Knowledge Indexing (RAG)

The system employs a Retrieval-Augmented Generation (RAG) pipeline to ground trading agents in domain-specific expertise. This ensures that recommendations from Technical, Fundamental, or Sentiment agents are backed by established research and curated trading strategies rather than relying solely on the LLM's base knowledge.

Knowledge Directory Structure

The KnowledgeIndexer expects a structured knowledge/ directory at the project root. Files placed in these subdirectories are automatically routed to the corresponding agent's vector store:

Supported Formats

The indexer is optimized for high-quality, actionable content:

Markdown (.md): Preferred format. The indexer uses Markdown headers (#, ##) to intelligently split content into semantic sections.
Plain Text (.txt): Supported for simple documentation.
PDF (.pdf): Supported via PyPDF2 (optional), though the system favors curated Markdown summaries to ensure higher signal-to-noise ratios for trading decisions.

Using the Knowledge Indexer

The KnowledgeIndexer class handles the loading, chunking, and upserting of documents into the vector database.

Initializing and Indexing All Files

To process the entire knowledge/ directory:

from agents.rag.indexer import KnowledgeIndexer

# Initialize with the path to your knowledge directory
indexer = KnowledgeIndexer(knowledge_dir="./knowledge")

# Index all domains at once
stats = indexer.index_all()

for domain, count in stats.items():
    print(f"Domain '{domain}': Indexed {count} chunks.")

Indexing a Specific Domain

If you only want to update a specific area (e.g., after adding new technical indicators):

from pathlib import Path
from agents.rag.indexer import KnowledgeIndexer

indexer = KnowledgeIndexer()
folder_path = Path("./knowledge/technical_analysis")

count = indexer.index_domain(domain="technical", folder_path=folder_path)
print(f"Indexed {count} technical analysis chunks.")

Processing Logic

Chunking Strategy

To maintain context, the indexer employs a hierarchical splitting strategy:

Header Splitting: It first splits documents by Markdown headers (e.g., ## Strategy Name).
Paragraph Splitting: If a section exceeds the target chunk_size (default: 500 characters), it further subdivides the text by paragraphs.
Semantic Metadata: Each chunk is tagged with its source filename and domain, allowing agents to cite their sources during the reasoning phase.

Vector Storage

Once processed, chunks are converted into embeddings and stored in domain-specific vector stores. When an agent (e.g., the Technical Agent) receives a query, it queries its specific vector store to retrieve the top-K relevant chunks before formulating a response.

Best Practices for Knowledge Content

For optimal retrieval performance:

Use Descriptive Headers: Use H2/H3 tags in Markdown to define clear topics.
Keep Chunks Focused: Write concise summaries of strategies rather than long, rambling prose.
Citations: Include source names or dates within the text to assist agents in temporal reasoning.