Document Size Handling in Aitana

This document explains how Aitana’s backend handles documents of varying sizes, from small documents under 200K tokens to massive documents over 10M tokens, using intelligent summarization and chunking strategies to preserve important details.

Overview

The system uses a multi-tiered approach to handle documents based on their size:

  1. Small documents (< token limit): Passed through unchanged
  2. Medium documents (< 400K chars): Direct summarization
  3. Large documents (400K - 1M chars): Chunked summarization
  4. Very large documents (> 1M chars): Recursive chunked summarization with executive summaries

Key Components

1. limit_content.py - Core Summarization Engine

This module handles the intelligent reduction of document content while preserving relevant information.

Constants

LIMIT_CONTENT_MAX = 1000000        # 1M chars - max for direct summarization
CHUNK_SIZE = 200000                # 200K chars - size of individual chunks
RECURSIVE_THRESHOLD = 400000       # 400K chars - when to start chunking
MODEL_TOKEN_OUTPUT_LIMIT = 65535   # ~65K tokens - model output limit

Processing Tiers

Tier 1: Small Documents (< token_limit)
  • Threshold: Document smaller than the requested token limit
  • Processing: No modification needed
  • Example: A 50K token document with a 100K limit passes through unchanged
Tier 2: Direct Summarization (< 1M chars)
  • Threshold: Documents between token_limit and 1M characters
  • Processing: Single-pass summarization using Gemini-2.5-flash
  • Preservation Strategy:
    • Quotes heavily from relevant content
    • Maintains metadata and URL links
    • Creates brief summaries for removed content
    • Preserves precise copies of relevant information
Tier 3: Chunked Summarization (400K - 1M chars)
  • Threshold: Documents exceeding 400K characters
  • Processing:
    1. Splits into 200K character chunks with 500-char overlap
    2. Processes chunks in parallel using asyncio
    3. Each chunk gets proportional token allocation
    4. Combines chunk summaries
Tier 4: Recursive Summarization (> 1M chars or when combined chunks too large)
  • Processing:
    1. Creates executive summary (2000 tokens max)
    2. Recursively summarizes detailed content
    3. Maximum recursion depth of 3 levels
    4. Combines executive + detailed summaries

2. anthropic.py - Anthropic Model Integration

The Anthropic integration uses percentage-based allocation for different content types:

total_char_limit = 180000  # ~180K chars total
chat_history_pct = 40%     # 72K chars for chat history
question_pct = 10%         # 18K chars for question
context_pct = 40%          # 72K chars for context/answers

Content Limiting Strategy

  1. Question limiting: Ensures the question fits within 10% allocation
  2. Context limiting: Tool results and context limited to 40%
  3. Chat history limiting: Previous messages limited to 40%
  4. Image handling: Deduplicates and adds signed URIs

3. gemini_smart_utils.py - Gemini Token Management

Handles token-based content limiting for Gemini models:

Token Counting

  • Uses actual Gemini token counting for accuracy
  • Processes messages from newest to oldest
  • Maintains chronological order in output

Overflow Handling

def limit_gemini_content_by_tokens(contents, token_limit):
    # Iterate backwards (newest first)
    # Keep messages that fit within limit
    # Return kept messages + formatted overflow string

Information Preservation Strategies

1. Intelligent Summarization Prompts

The system uses carefully crafted prompts to preserve important information:

"Do not remove anything that may be relevant to the question."
"For anything you do remove, create a brief summary so we at least know what it was."
"Keep a precise and accurate copy of any information that does look relevant."
"Quote heavily from the content if you think its relevant to the question."
"Reproduce any metadata or URL links you find in the relevant chunks."

2. Chunk Overlap

When splitting large documents:

  • 500 character overlap between chunks
  • Preserves context across chunk boundaries
  • Prevents loss of information at split points

3. Parallel Processing

For large documents:

  • Chunks processed simultaneously using asyncio
  • Each chunk maintains its position context [Chunk 1/5]
  • Errors in one chunk don’t affect others

4. Executive Summaries

For very large documents:

  • High-level overview capturing essence
  • Focus on key findings and main themes
  • Prioritizes question-relevant information
  • Limited to 2000 tokens for conciseness

5. Error Handling

Graceful degradation when summarization fails:

  • Fallback to truncation with clear markers
  • Error chunks marked as [Chunk X/Y - ERROR]
  • Preserves as much content as possible

Model-Specific Token Handling

The system uses different models for different purposes, each with distinct document processing strategies:

Recent Enhancement: 8x Larger Summary Outputs

Previous Limitation: Summaries were limited to ~8,000 tokens (8K) Current Capability: Summaries can now be up to ~65,000 tokens (64K)

This 8x increase in summary output capacity has dramatically improved information preservation and reduced the need for aggressive compression. The impact is significant across all document processing scenarios.

Model Roles

Gemini 2.5 Flash - Summarization Engine

  • Primary Use: Content summarization, chunking, and document processing
  • Context Window: ~1M tokens (~4M characters)
  • Processing Strategy: Efficient summarization with less aggressive compression
  • Direct Pass-Through Threshold: Documents up to 800K tokens
  • Chunking Threshold: 3.2M characters (800K tokens)

Gemini 2.5 Pro - Final Answer Generation

  • Primary Use: Final answer synthesis using summarized content
  • Context Window: ~1M tokens (~4M characters)
  • Processing Strategy: Comprehensive analysis of pre-processed content
  • Content Integration: Receives summarized content from Flash

Anthropic Claude - Final Answer Generation

  • Primary Use: Final answer generation (alternative to Gemini Pro)
  • Context Window: ~200K tokens (~800K characters)
  • Processing Strategy: Works with heavily compressed summaries
  • Content Integration: Requires more aggressive pre-summarization

Processing Pipeline by Model

Gemini-Based Pipeline (Flash → Pro)

1. Content Processing (Gemini 2.5 Flash):
   - Large document summarization
   - Tool result processing
   - Chunk-based analysis
   - Less aggressive compression (leverages 1M context)

2. Final Answer (Gemini 2.5 Pro):
   - Receives Flash-summarized content
   - 1M token context for comprehensive analysis
   - Can work with larger summarized datasets
   - More detailed final responses possible

Anthropic-Based Pipeline (Flash → Claude)

1. Content Processing (Gemini 2.5 Flash):
   - Same summarization as above
   - Additional compression layer for Claude compatibility
   - More aggressive final summarization
   - Reduced to fit 200K context limit

2. Final Answer (Anthropic Claude):
   - Receives heavily compressed summaries
   - 200K token context requires focused content
   - Excellent at synthesis from compressed information
   - Highly focused, distilled responses

Token Allocation Differences

Gemini Pro Final Answer (1M Context)

total_char_limit = 4000000        # ~1M tokens
chat_history_pct = 40%            # 1.6M chars for chat history
question_pct = 10%                # 400K chars for question  
context_pct = 40%                 # 1.6M chars for context/tools
reserved_pct = 10%                # 400K chars overhead

Anthropic Claude Final Answer (200K Context)

total_char_limit = 800000         # ~200K tokens
chat_history_pct = 40%            # 320K chars for chat history
question_pct = 10%                # 80K chars for question
context_pct = 40%                 # 320K chars for context/tools
reserved_pct = 10%                # 80K chars overhead

Real-World Document Size Context

Document Size Reference

  • 100K tokens = ~250 pages = PhD thesis, technical manual
  • 250K tokens = ~625 pages = Complete book, comprehensive documentation
  • 1M tokens = ~2,500 pages = Multiple books, complete documentation suite
  • 2.5M tokens = ~6,250 pages = Massive dataset, legal case collection

Processing Categories

  • Tiny documents: Up to 5K tokens (Email, newsletter, contract, article)
  • Small documents: 5K-100K tokens (Report, manual chapter, research paper)
  • Medium documents: 100K-250K tokens (Complete book, technical documentation)
  • Large collections: 250K-1M tokens (Multiple books, documentation suite)
  • Very large collections: 1M+ tokens (Massive datasets, legal archives)

Everyday Document Examples

  • Email/Newsletter: ~500 tokens (1-2 pages) - No processing needed
  • News Article: ~800 tokens (2 pages) - No processing needed
  • Business Contract: ~2,000 tokens (5 pages) - No processing needed
  • Blog Post: ~1,200 tokens (3 pages) - No processing needed
  • White Paper: ~10K tokens (25 pages) - No processing needed
  • Research Report: ~30K tokens (75 pages) - No processing needed
  • Technical Manual: ~60K tokens (150 pages) - Direct processing
  • Complete Book: ~120K tokens (300 pages) - Light chunking needed

Size-Based Examples by Processing Pipeline

Example 0: Tiny Documents (No Processing Needed)

Most everyday documents are tiny and pass through without any summarization:

Business Contract (5 pages, 2K tokens)

Both Gemini and Anthropic Pipelines:
Input: 2K token business contract (5 pages)
Processing: No summarization needed (well under all limits)
Output: Full contract passed through unchanged
Result: 
- Complete contract terms and conditions
- All legal clauses preserved verbatim
- Signatures, dates, and parties intact
- Ready for detailed legal analysis

Newsletter/Email (2 pages, 800 tokens)

Both Gemini and Anthropic Pipelines:
Input: 800 token newsletter (2 pages)
Processing: No summarization needed
Output: Full newsletter content unchanged
Result:
- All articles and announcements preserved
- Contact information and links intact
- Formatting and structure maintained
- Complete context for Q&A

Research Paper (15 pages, 6K tokens)

Both Gemini and Anthropic Pipelines:
Input: 6K token research paper (15 pages)
Processing: No summarization needed
Output: Full paper passed through unchanged
Result:
- Complete abstract, methodology, and results
- All references and citations preserved
- Figures and tables descriptions intact
- Full context for academic analysis

Example 1: Small Document (PhD Thesis - ~200 pages, 80K tokens)

Gemini Pipeline (Flash → Pro)

Step 1 - Gemini 2.5 Flash Processing:
Input: 80K token PhD thesis (200 pages)
Processing: No summarization needed (under threshold)
Output: Full document passed through unchanged

Step 2 - Gemini 2.5 Pro Final Answer:
Input: 80K token thesis + chat history
Token budget: 1M tokens available
Context allocation: 1.6M chars (400K tokens) for tool content
Result: Comprehensive analysis with full document context
- Complete access to all chapters and references
- Detailed methodology analysis
- Full literature review preserved
Context remaining: 920K tokens for detailed response

Anthropic Pipeline (Flash → Claude)

Step 1 - Gemini 2.5 Flash Processing:
Input: 80K token PhD thesis (200 pages)
Processing: Light compression for Claude compatibility
Output: ~65K token summary (compressed for 200K limit)

Step 2 - Anthropic Claude Final Answer:
Input: 65K token summary + chat history
Token budget: 200K tokens available
Context allocation: 320K chars (80K tokens) for tool content
Result: Focused analysis with key elements preserved
- Main findings and conclusions intact
- Core methodology preserved
- Key references maintained
Context remaining: 135K tokens for response

Example 2: Medium Document (Complete Book - ~300 pages, 120K tokens)

Gemini Pipeline (Flash → Pro)

Step 1 - Gemini 2.5 Flash Processing:
Input: 120K token novel (300 pages)
Processing: Direct summarization (single pass)
Output: ~100K token summary with detailed preservation

Step 2 - Gemini 2.5 Pro Final Answer:
Input: 100K token summary + chat history
Token allocation: 1.6M chars (400K tokens) for context
Result: 
- Detailed chapter-by-chapter analysis
- Character development preserved
- Plot structure and themes intact
- Key dialogue and scenes quoted
- Literary analysis with extensive examples

Anthropic Pipeline (Flash → Claude)

Step 1 - Gemini 2.5 Flash Processing:
Input: 120K token novel (300 pages)
Processing: Aggressive compression for Claude compatibility
Output: ~40K token summary (heavily compressed)

Step 2 - Anthropic Claude Final Answer:
Input: 40K token summary + chat history  
Token allocation: 320K chars (80K tokens) for context
Result:
- High-level plot summary and themes
- Main character arcs preserved
- Key scenes and turning points
- Focused literary analysis
- Essential quotes and examples

Example 3: Large Document Collection (Encyclopedia Volume - ~800 pages, 320K tokens)

Gemini Pipeline (Flash → Pro)

Step 1 - Gemini 2.5 Flash Processing:
Input: Encyclopedia volume (800 pages, 320K tokens)
Processing:
1. Split into 2-3 large chunks (160K per chunk)
2. Each chunk gets substantial token allocation (133K-200K tokens)
3. Minimal compression needed due to large output capacity
Output: ~280K token comprehensive summary

Step 2 - Gemini 2.5 Pro Final Answer:
Input: 280K token summary + chat history
Token allocation: 1.6M chars (400K tokens) for context
Result:
- Detailed entries for major topics preserved
- Cross-references between articles maintained
- Historical context and examples included
- Technical definitions with full explanations
- Comprehensive coverage across all subject areas

Anthropic Pipeline (Flash → Claude)

Step 1 - Gemini 2.5 Flash Processing:
Input: Encyclopedia volume (800 pages, 320K tokens)
Processing:
1. Split into 8-10 focused chunks
2. Aggressive compression for Claude compatibility
3. Hierarchical summarization by subject area
Output: ~65K token compressed summary

Step 2 - Anthropic Claude Final Answer:
Input: 65K token summary + chat history
Token allocation: 320K chars (80K tokens) for context
Result:
- Essential entries for major topics
- Key definitions and concepts
- Important historical facts
- Cross-references between related topics
- Focused synthesis by subject category

Gemini Pipeline (Flash → Pro)

Step 1 - Gemini 2.5 Flash Processing:
Input: Complex legal case files (1,000 pages, 400K tokens)
Processing:
1. Split into 3-4 major case chunks
2. Each chunk processed with high detail preservation
3. Legal precedents and citations maintained
Output: ~350K token comprehensive summary

Step 2 - Gemini 2.5 Pro Final Answer:
Input: 350K token summary + chat history
Token allocation: 1.6M chars (400K tokens) for context
Result:
[CASE OVERVIEW - 50K tokens]
- Complete case timeline and key events
- All parties and their roles identified
- Jurisdiction and legal framework

[LEGAL ANALYSIS - 300K tokens]
[Case 1] - Contract Dispute
- Full contract terms and disputed clauses
- Precedent cases cited with details
- Court decisions with reasoning
- Settlement terms and implications

[Case 2] - Liability Claims  
- Incident details and evidence presented
- Expert testimony summaries
- Damage assessments and calculations
- Appeal outcomes and final judgments
...

Anthropic Pipeline (Flash → Claude)

Step 1 - Gemini 2.5 Flash Processing:
Input: Complex legal case files (1,000 pages, 400K tokens)
Processing:
1. Heavy summarization focused on key legal points
2. Essential precedents and outcomes preserved
3. Multi-level compression for Claude compatibility
Output: ~65K token focused summary

Step 2 - Anthropic Claude Final Answer:
Input: 65K token summary + chat history
Token allocation: 320K chars (80K tokens) for context
Result:
[EXECUTIVE SUMMARY - 15K tokens]
- Key legal issues and outcomes
- Major precedents established
- Financial implications and settlements

[CASE SUMMARIES - 50K tokens]
[Contract Disputes - 20K tokens]
- Core contractual issues
- Key precedents cited
- Final outcomes and implications

[Liability Claims - 20K tokens]
- Main liability determinations
- Damage awards and reasoning
- Appeal results

[Regulatory Issues - 10K tokens]
- Compliance violations identified
- Regulatory responses and penalties
...

Processing Pipeline Differences

Gemini Pipeline (Flash → Pro) Characteristics

  • Summarization Stage (Flash):
    • Fewer chunks needed due to 1M context
    • Less recursive summarization required
    • Larger batch sizes for efficient processing
    • Less aggressive compression preserves more detail
  • Final Answer Stage (Pro):
    • 1M context allows comprehensive analysis
    • Can work with larger summarized datasets
    • More detailed final responses possible
    • Better cross-document synthesis

Anthropic Pipeline (Flash → Claude) Characteristics

  • Summarization Stage (Flash):
    • Same Flash processing capabilities
    • Additional compression layer for Claude compatibility
    • More aggressive final summarization required
    • Optimized for 200K context target
  • Final Answer Stage (Claude):
    • 200K context requires focused content
    • Excellent synthesis from compressed information
    • Highly distilled, focused responses
    • Superior reasoning with limited context

Content Quality Trade-offs by Pipeline

Gemini Pipeline Advantages

  • Higher detail preservation: Flash’s 1M context enables nuanced summaries
  • Better context retention: Cross-references maintained through Pro’s large context
  • Less information loss: Larger summaries possible throughout pipeline
  • Comprehensive analysis: Pro can synthesize more complex relationships
  • Detailed responses: More space for thorough explanations

Anthropic Pipeline Advantages

  • Highly focused summaries: Forces extraction of essential points
  • Superior reasoning: Claude excels at synthesis from compressed information
  • Consistent quality: Proven compression and distillation capabilities
  • Efficient processing: Lower token costs for summarization
  • Concise insights: Excellent at identifying core concepts

Model Selection Strategy

Choose Gemini Pipeline When:

  • Document complexity is high: Technical specifications, research papers
  • Detail preservation is critical: Legal documents, medical records
  • Cross-referencing needed: Multiple related documents
  • Comprehensive analysis required: In-depth technical questions
  • Context richness matters: Complex queries requiring nuanced understanding

Choose Anthropic Pipeline When:

  • Focus and clarity are paramount: Executive summaries, decision-making
  • Processing efficiency is important: Large-scale document processing
  • Distillation quality matters: Extracting key insights from complex data
  • Reasoning depth needed: Complex logical analysis and synthesis
  • Concise responses preferred: Clear, actionable insights

Impact of 8x Larger Summary Outputs (8K → 64K)

The increase from 8,000 to 65,000 token summary outputs has fundamentally changed how the system handles documents:

Before vs After Comparison

Small Document Processing (400K chars ~100K tokens)

Before (8K Summary Limit):

Input: 400K character research paper
Processing: Triggers recursive chunking (>400K threshold)
1. Split into 2 chunks of 200K chars each
2. Each chunk gets token_limit//2 = 4K tokens per chunk
3. 2 × 4K = 8K combined summaries
4. No recursive pass needed (fits in 8K limit)
Output: 8K token summary
Information Preservation: ~8% of original content

Result: 
- Only key findings preserved
- Most methodology details lost
- Limited quotes and examples
- Minimal cross-references

After (64K Summary Limit):

Input: 400K character research paper  
Processing: Direct summarization (under 1M limit, above 400K threshold)
1. Split into 2 chunks of 200K chars each
2. Each chunk gets token_limit//2 = 32K tokens per chunk
3. 2 × 32K = 64K combined summaries
4. No recursive pass needed (fits in 64K limit)
Output: 64K token summary
Information Preservation: ~64% of original content

Result:
- Comprehensive methodology preserved
- Extensive quotes and examples
- Detailed findings with supporting evidence
- Full cross-references and citations maintained

Medium Document Processing (1M chars ~250K tokens)

Before (8K Summary Limit):

Input: 1M character technical documentation
Processing: Recursive chunking strategy
1. Split into 5 chunks of 200K chars each
2. Each chunk gets min(8K//5, 8K) = 1.6K tokens per chunk
3. 5 × 1.6K = 8K combined summaries
4. Combined summaries = 8K (no recursive pass needed)
Output: 8K token summary
Information Preservation: ~3.2% of original content

Result: Severe information loss
- Only highest-level concepts preserved
- Most technical details lost
- Implementation examples removed
- Architecture details heavily simplified

After (64K Summary Limit):

Input: 1M character technical documentation
Processing: Recursive chunking strategy
1. Split into 5 chunks of 200K chars each
2. Each chunk gets min(64K//5, 64K) = 12.8K tokens per chunk
3. 5 × 12.8K = 64K combined summaries
4. Combined summaries = 64K (no recursive pass needed)
Output: 64K token summary
Information Preservation: ~25.6% of original content

Result: Significant information preservation
- Technical details largely maintained
- Code examples preserved
- Architecture diagrams and explanations intact
- Implementation guidance preserved

Large Document Collection (5M chars ~1.25M tokens)

Before (8K Summary Limit):

Input: 5M character document collection
Processing: Multi-level recursive chunking
1. Split into 25 chunks of 200K chars each
2. Each chunk gets min(8K//25, 8K) = 320 tokens per chunk
3. 25 × 320 = 8K combined summaries
4. Combined summaries trigger recursive summarization:
   - Executive summary: min(2K, 8K//4) = 2K tokens
   - Detailed summary: 8K - 2K - 100 = 5.9K tokens
   - Further recursive compression if needed
Output: ~8K token summary
Information Preservation: ~0.64% of original content

Result:
- Only highest-level themes preserved
- No detailed methodology
- No comparative analysis possible
- Lost connections between documents

After (64K Summary Limit):

Input: 5M character document collection
Processing: Single-level recursive chunking
1. Split into 25 chunks of 200K chars each
2. Each chunk gets min(64K//25, 64K) = 2.56K tokens per chunk
3. 25 × 2.56K = 64K combined summaries
4. Combined summaries = 64K (no recursive pass needed)
Output: 64K token summary
Information Preservation: ~5.12% of original content

Result:
- Key methodologies for each document preserved
- Important findings with supporting evidence
- Some comparative analysis across documents
- Critical connections and relationships maintained

Very Large Document Collection (10M chars ~2.5M tokens)

Before (8K Summary Limit):

Input: 10M character document collection
Processing: Deep recursive chunking (multiple levels)
1. Split into 50 chunks of 200K chars each
2. Each chunk gets min(8K//50, 8K) = 160 tokens per chunk
3. 50 × 160 = 8K combined summaries
4. Triggers recursive summarization:
   - Executive summary: 2K tokens
   - Detailed summary: 5.9K tokens
   - May trigger additional recursive levels
Output: ~8K token summary
Information Preservation: ~0.32% of original content

Result:
- Only document titles and main themes
- No detailed content preserved
- No relationships between documents
- Minimal actionable information

After (64K Summary Limit):

Input: 10M character document collection
Processing: Controlled recursive chunking
1. Split into 50 chunks of 200K chars each
2. Each chunk gets min(64K//50, 64K) = 1.28K tokens per chunk
3. 50 × 1.28K = 64K combined summaries
4. Combined summaries = 64K (fits without recursion)
Output: 64K token summary
Information Preservation: ~2.56% of original content

Result:
- Document summaries with key points preserved
- Essential findings and methodologies
- Basic comparative analysis possible
- Important cross-document relationships maintained

Tool-Specific Improvements

extract_files.py Enhancement Needed

# Current limitation (needs update)
max_output_tokens=8192

# Should be updated to
max_output_tokens=65535  # or MODEL_TOKEN_OUTPUT_LIMIT

Impact on File Processing:

  • Before: Each file limited to 8K summary regardless of size
  • After: Each file can generate up to 64K summary, preserving much more detail
  • Result: 8x more information preserved per file processed

google_search.py Enhancement Needed

# Current limitation (needs update)
max_output_tokens=8192

# Should be updated to  
max_output_tokens=65535

Impact on Search Results:

  • Before: Search results heavily compressed, losing context and details
  • After: Comprehensive search summaries with full context and quotes

url_processing.py Enhancement Needed

# Current limitation (needs update)
max_output_tokens=8192

# Should be updated to
max_output_tokens=65535

Impact on Web Content:

  • Before: Web pages reduced to basic summaries
  • After: Full article content with detailed analysis

Real-World Scenarios

Scenario 1: Everyday Business Documents (No Processing)

Input: Business contract (5 pages, 2K tokens)
Newsletter (2 pages, 800 tokens)
Email thread (3 pages, 1.2K tokens)

Processing: No summarization needed for any model
Result: Full documents preserved unchanged
- All legal terms and conditions intact
- Complete newsletter content and links
- Full email conversation history
- Ready for immediate analysis

Scenario 2: Professional Reports (Light Processing)

Input: Technical manual (150 pages, 60K tokens)

Gemini Pipeline:
- Direct processing, minimal compression
- Full technical details preserved
- All examples and procedures intact

Anthropic Pipeline:  
- Light compression for Claude compatibility
- Key procedures and examples preserved
- Technical specifications maintained

Scenario 3: Large Document Analysis (Heavy Processing)

Input: Complete legal case file (1,000 pages, 400K tokens)

Before (8K limit):
- Only case outcomes preserved
- Legal reasoning lost
- Precedent details missing
- Evidence summaries removed

After (64K limit):
Gemini Pipeline:
- Comprehensive case analysis
- Legal reasoning preserved
- Precedent cases with details
- Evidence and testimony summaries

Anthropic Pipeline:
- Focused legal analysis
- Key precedents and outcomes
- Essential evidence highlights
- Concentrated legal insights

Performance Impact

Processing Efficiency

  • Reduced Chunking: Fewer chunks needed due to larger output capacity
  • Fewer Recursive Passes: Less need for multi-level summarization
  • Better Parallel Processing: Larger chunks process more efficiently

Quality Improvements

  • Context Preservation: Relationships between concepts maintained
  • Detail Retention: Technical specifications preserved
  • Quote Accuracy: More verbatim content possible
  • Reference Integrity: Citations and links maintained

Recommendations for Tool Updates

Tools still using 8K limits should be updated:

# Update these tools to use the new limit:
# - extract_files.py
# - google_search.py  
# - url_processing.py
# - image_creation.py

# Replace this:
max_output_tokens=8192

# With this:
from models.limit_content import MODEL_TOKEN_OUTPUT_LIMIT
max_output_tokens=MODEL_TOKEN_OUTPUT_LIMIT  # 65535

This update will provide immediate benefits:

  • 8x more information preserved per tool operation
  • Better context for final answer generation
  • Reduced information loss in tool processing chains
  • More comprehensive responses to user queries

Best Practices

  1. Question Context: Always provide clear questions to guide summarization
  2. Metadata Preservation: URLs, timestamps, and identifiers are prioritized
  3. Structured Data: Tables and lists are preserved when relevant
  4. Code Blocks: Programming code is quoted verbatim when relevant
  5. Error Information: Error messages and stack traces preserved in full

Configuration

Adjusting Limits

# In limit_content.py
LIMIT_CONTENT_MAX = 1000000      # Increase for larger direct summaries
CHUNK_SIZE = 200000              # Adjust chunk granularity
RECURSIVE_THRESHOLD = 400000     # Change when chunking starts

# In anthropic.py
total_char_limit = 180000        # Total context window
chat_history_pct = 40            # Adjust allocation percentages
question_pct = 10
context_pct = 40

Model Selection

  • Summarization: Gemini-2.5-flash (fast, efficient)
  • Executive Summaries: Gemini-2.5-flash with 2000 token limit
  • Final Answers: Anthropic Claude or Gemini based on configuration

Monitoring and Debugging

Logging

log.info("Content size X exceeds threshold, using recursive chunking")
log.warning("Content was > 1M, truncating for direct summarization")
log.error("Error summarizing chunk X: error_message")

Tracing

  • Langfuse integration tracks each summarization step
  • Parent-child observation hierarchy
  • Token usage tracked at each level

Metrics

  • Input size vs output size ratios
  • Chunking frequency and depths
  • Summarization latency by document size
  • Error rates by chunk

Tool-Specific Large Output Handling

The system includes multiple tools that can generate extremely large outputs (millions of tokens). Each tool has specific strategies for managing size:

High-Risk Tools (Multi-Million Token Potential)

1. extract_files.py - File Processing Tool

  • Output Size: Can process 100-200 files, generating millions of tokens
  • Limiting Strategy:
    • File size check: 49MB limit per file for Gemini processing
    • Batch processing: Handles files in batches of 1-20 files
    • Progressive limits: Reduces batch size for large file counts
    • Timeout controls: 4-8 minute timeouts per file
    • Sequential processing: Forces one-at-a-time for 100+ files
    • File prioritization: Text files → PDFs → Images → Others
    • Direct text extraction: Small text files (<1MB) extracted directly
# File processing limits
GEMINI_FILE_SIZE_LIMIT = 1024*1024*49  # 49MB per file
batch_size = 1 if len(files) > 100 else min(20, len(files))
timeout_per_file = 4*60  # 4 minutes
hard_timeout_per_file = 8*60  # 8 minutes hard limit

2. ai_search.py - Document Search Tool

  • Output Size: Searches large document collections, potentially millions of tokens
  • Limiting Strategy:
    • Content limiting: 500K character limit with fallback to 100K
    • Result limiting: Default max 10 search results
    • Summarization: Uses limit_content() for large result sets
# AI search limits
max_limit = 10  # Maximum search results
content_limit = 500000  # 500K character limit
fallback_limit = 100000  # 100K fallback limit

3. user_history.py - Chat History Tool

  • Output Size: Searches across user chat histories, can accumulate large amounts
  • Limiting Strategy:
    • Result limiting: Default max 5 history results
    • Content summarization: Applies limit_content() to results
    • Query specificity: Focuses searches to reduce result size

4. code_execution.py - Code Runner Tool

  • Output Size: Code execution can produce large outputs (data analysis, visualizations)
  • Limiting Strategy:
    • Content limiting: 200K character limit via limit_gemini_content()
    • History limiting: Manages conversation history size
    • Output truncation: Large code outputs are summarized

Medium-Risk Tools

5. google_search.py - Web Search Tool

  • Output Size: Comprehensive web search results with metadata
  • Limiting Strategy:
    • Token limiting: 8192 max output tokens
    • Result filtering: Focused search result processing

6. url_processing.py - Web Content Tool

  • Output Size: Processes content from multiple URLs
  • Limiting Strategy:
    • Token limiting: 8192 max output tokens per URL
    • Content extraction: Focuses on relevant content sections

Tool Result Integration Strategy

When tools return large outputs, the system applies a multi-stage reduction process:

Stage 1: Tool-Level Limiting

  • Each tool implements its own size controls
  • Pre-processing limits (file counts, result counts)
  • Timeout and batch controls for processing

Stage 2: Content Integration

  • Tool results combined in anthropic.py and gemini_smart_utils.py
  • Percentage-based allocation:
    • Context/tool results: 40% of total token budget (72K chars)
    • Chat history: 40% (72K chars)
    • Question: 10% (18K chars)
    • Reserved: 10% for overhead

Stage 3: Final Summarization

  • If combined tool results exceed context allocation (72K chars)
  • limit_context() function applies intelligent summarization
  • Preserves key findings and relevant details
  • Maintains tool metadata and source references

Example: Large File Collection Processing

Input: 150 files (50 PDFs, 100 text files) totaling 50M tokens
Processing:
1. Tool-level limiting:
   - Prioritizes text files first
   - Batches into groups of 3-5 files
   - Direct text extraction for small files
   - Gemini processing for PDFs/large files
   
2. Content integration:
   - 150 file summaries combined
   - Total output: ~2M tokens
   - Exceeds 72K context allocation
   
3. Final summarization:
   - Executive summary of key documents
   - Detailed chunk summaries by file type
   - Preserved source references and metadata
   - Final result: ~65K tokens fitting context window

Tool Result Caching

Several tools implement caching to avoid re-processing:

  • extract_files.py: Caches file content extractions and URL processing
  • ai_search.py: May cache search results based on query patterns
  • url_processing.py: Caches processed URL content

Error Handling for Large Outputs

Tools implement graceful degradation when size limits are exceeded:

  1. Timeout handling: Partial results returned if processing times out
  2. Memory limits: Fallback to smaller batch sizes or sequential processing
  3. API limits: Retry with reduced content or alternative processing
  4. Truncation: Clear markers when content is truncated due to limits

Monitoring Tool Output Sizes

The system tracks tool performance through Langfuse:

  • Input/output token counts per tool
  • Processing times and timeout rates
  • Content reduction ratios
  • Error rates by tool and content size

Tool Configuration for Size Management

Tools can be configured with size-related parameters:

# In tool configurations
toolConfigs = {
    "extract_files": {
        "max_files": 100,
        "batch_size": 5,
        "timeout_per_file": 240
    },
    "ai_search": {
        "max_results": 10,
        "content_limit": 500000
    },
    "user_history": {
        "max_results": 5,
        "days_back": 30
    }
}

Future Improvements

  1. Adaptive Chunking: Dynamic chunk sizes based on content density
  2. Semantic Chunking: Split at natural boundaries (paragraphs, sections)
  3. Priority Scoring: Score content relevance before summarization
  4. Caching: Cache summaries for frequently accessed large documents
  5. Streaming: Stream partial summaries as they’re generated
  6. Tool-Specific Limits: Per-tool token budgets based on typical output sizes
  7. Progressive Loading: Stream tool results as they complete processing
  8. Content Deduplication: Remove duplicate content across multiple tool results

Summary: Key Impacts of 8K → 64K Summary Increase

Dramatic Information Preservation Improvements (Based on Actual Recursive Chunking):

  • Small documents (400K chars/100K tokens): Preservation improved from 8% to 64% (8x improvement)
  • Medium documents (1M chars/250K tokens): Preservation improved from 3.2% to 25.6% (8x improvement)
  • Large collections (5M chars/1.25M tokens): Preservation improved from 0.64% to 5.12% (8x improvement)
  • Very large collections (10M chars/2.5M tokens): Preservation improved from 0.32% to 2.56% (8x improvement)

Critical Insight:

The improvement follows the chunking algorithm exactly - each chunk gets token_limit // num_chunks tokens, so the 8x increase in token_limit translates directly to 8x more tokens per chunk, resulting in consistently 8x better information preservation across all document sizes.

Reduced Recursive Complexity:

  • Before: Large documents often triggered multiple levels of recursive summarization
  • After: Most documents can be processed in a single chunking pass, avoiding recursive compression losses
  • Result: More predictable and higher-quality summaries

Tools Requiring Updates:

The following tools still use 8K limits and should be updated to leverage the full 64K capacity:

  • extract_files.py (Updated: Now uses default Gemini limits)
  • google_search.py (Updated: Now uses default Gemini limits)
  • url_processing.py (Needs update: Still has max_output_tokens=8192)
  • image_creation.py (Needs update: Still has max_output_tokens=8192)