Historical Data Collection

The system relies on a multi-source data collection strategy to provide agents with a comprehensive view of historical market conditions. This includes real-time and historical news feeds, as well as a vectorized knowledge base of research papers and technical documentation used for Retrieval-Augmented Generation (RAG).

News Data Connectors

The NewsAPIConnector is the primary utility for gathering market sentiment and event data. It is designed with a fallback mechanism to ensure data availability even without paid API keys.

NewsAPIConnector Interface

The connector interacts with NewsAPI.org and various RSS feeds (Google News, CoinTelegraph, CoinDesk) to compile news datasets.

from data_connectors.newsapi_connector import NewsAPIConnector

# Initialize connector (Optional API key for NewsAPI)
connector = NewsAPIConnector(api_key="your_newsapi_key")

# Fetch recent Bitcoin/Crypto news
articles = connector.get_bitcoin_news(limit=20)

Key Features:

Source Fallback: Automatically tries CoinTelegraph and CoinDesk RSS feeds if NewsAPI is unavailable or the API key is missing.
Data Cleaning: Automatically strips HTML tags from news descriptions and truncates text to ensure compatibility with LLM context windows.
Structure: Returns a list of dictionaries containing title, description, url, published_at, and source.

Knowledge Indexing (RAG)

To provide the agents with domain-specific expertise, the system includes a KnowledgeIndexer. This utility processes static documents (Markdown, Text, PDFs) and prepares them for the vector store.

Directory Structure

The indexer expects the knowledge/ directory to be organized by agent domains:

knowledge/technical_analysis/: Chart patterns, indicator math.
knowledge/sentiment/: Market psychology, social media impact papers.
knowledge/fundamental/: On-chain metrics, network health docs.
knowledge/risk_management/: Position sizing and volatility theory.
knowledge/papers/: General research PDFs.

Usage

The KnowledgeIndexer chunks large documents and assigns metadata based on the folder structure to ensure that a Technical Agent only retrieves technical documents.

from agents.rag.indexer import KnowledgeIndexer

indexer = KnowledgeIndexer(knowledge_dir="./knowledge")

# Index all documents into their respective vector stores
results = indexer.index_all()
print(f"Indexed chunks: {results}")

Supported Formats:

Markdown (.md): Preferred format; split by headers.
Text (.txt): Split by paragraph.
PDF (.pdf): Supported via PyPDF2 (Note: Curated Markdown summaries are recommended over raw PDFs for higher reasoning quality).

Historical Trade Logs

During backtesting, the system generates detailed JSON reports stored in the logs/ directory. These logs serve as historical data for the Dashboard and for performance auditing.

Viewing Historical Trades

You can inspect the results of a collection run or backtest using the view_trades.py utility:

# View the most recent backtest log
python view_trades.py

# View a specific report
python view_trades.py logs/backtest_report_20231027.json

Data Schemas

Historical data is standardized using Pydantic models to ensure consistency across the orchestrator and the dashboard.

These schemas ensure that when data is "collected" during a backtest, it is stored with the full context of the agent's reasoning, not just the final trade action.