Document Size Handling in Aitana
This document explains how Aitana’s backend handles documents of varying sizes, from small documents under 200K tokens to massive documents over 10M tokens, using intelligent summarization and chunking strategies to preserve important details.
Overview
The system uses a multi-tiered approach to handle documents based on their size:
- Small documents (< token limit): Passed through unchanged
- Medium documents (< 400K chars): Direct summarization
- Large documents (400K - 1M chars): Chunked summarization
- Very large documents (> 1M chars): Recursive chunked summarization with executive summaries
Key Components
1. limit_content.py - Core Summarization Engine
This module handles the intelligent reduction of document content while preserving relevant information.
Constants
LIMIT_CONTENT_MAX = 1000000 # 1M chars - max for direct summarization
CHUNK_SIZE = 200000 # 200K chars - size of individual chunks
RECURSIVE_THRESHOLD = 400000 # 400K chars - when to start chunking
MODEL_TOKEN_OUTPUT_LIMIT = 65535 # ~65K tokens - model output limit
Processing Tiers
Tier 1: Small Documents (< token_limit)
- Threshold: Document smaller than the requested token limit
- Processing: No modification needed
- Example: A 50K token document with a 100K limit passes through unchanged
Tier 2: Direct Summarization (< 1M chars)
- Threshold: Documents between token_limit and 1M characters
- Processing: Single-pass summarization using Gemini-2.5-flash
- Preservation Strategy:
- Quotes heavily from relevant content
- Maintains metadata and URL links
- Creates brief summaries for removed content
- Preserves precise copies of relevant information
Tier 3: Chunked Summarization (400K - 1M chars)
- Threshold: Documents exceeding 400K characters
- Processing:
- Splits into 200K character chunks with 500-char overlap
- Processes chunks in parallel using asyncio
- Each chunk gets proportional token allocation
- Combines chunk summaries
Tier 4: Recursive Summarization (> 1M chars or when combined chunks too large)
- Processing:
- Creates executive summary (2000 tokens max)
- Recursively summarizes detailed content
- Maximum recursion depth of 3 levels
- Combines executive + detailed summaries
2. anthropic.py - Anthropic Model Integration
The Anthropic integration uses percentage-based allocation for different content types:
total_char_limit = 180000 # ~180K chars total
chat_history_pct = 40% # 72K chars for chat history
question_pct = 10% # 18K chars for question
context_pct = 40% # 72K chars for context/answers
Content Limiting Strategy
- Question limiting: Ensures the question fits within 10% allocation
- Context limiting: Tool results and context limited to 40%
- Chat history limiting: Previous messages limited to 40%
- Image handling: Deduplicates and adds signed URIs
3. gemini_smart_utils.py - Gemini Token Management
Handles token-based content limiting for Gemini models:
Token Counting
- Uses actual Gemini token counting for accuracy
- Processes messages from newest to oldest
- Maintains chronological order in output
Overflow Handling
def limit_gemini_content_by_tokens(contents, token_limit):
# Iterate backwards (newest first)
# Keep messages that fit within limit
# Return kept messages + formatted overflow string
Information Preservation Strategies
1. Intelligent Summarization Prompts
The system uses carefully crafted prompts to preserve important information:
"Do not remove anything that may be relevant to the question."
"For anything you do remove, create a brief summary so we at least know what it was."
"Keep a precise and accurate copy of any information that does look relevant."
"Quote heavily from the content if you think its relevant to the question."
"Reproduce any metadata or URL links you find in the relevant chunks."
2. Chunk Overlap
When splitting large documents:
- 500 character overlap between chunks
- Preserves context across chunk boundaries
- Prevents loss of information at split points
3. Parallel Processing
For large documents:
- Chunks processed simultaneously using asyncio
- Each chunk maintains its position context
[Chunk 1/5] - Errors in one chunk don’t affect others
4. Executive Summaries
For very large documents:
- High-level overview capturing essence
- Focus on key findings and main themes
- Prioritizes question-relevant information
- Limited to 2000 tokens for conciseness
5. Error Handling
Graceful degradation when summarization fails:
- Fallback to truncation with clear markers
- Error chunks marked as
[Chunk X/Y - ERROR] - Preserves as much content as possible
Model-Specific Token Handling
The system uses different models for different purposes, each with distinct document processing strategies:
Recent Enhancement: 8x Larger Summary Outputs
Previous Limitation: Summaries were limited to ~8,000 tokens (8K) Current Capability: Summaries can now be up to ~65,000 tokens (64K)
This 8x increase in summary output capacity has dramatically improved information preservation and reduced the need for aggressive compression. The impact is significant across all document processing scenarios.
Model Roles
Gemini 2.5 Flash - Summarization Engine
- Primary Use: Content summarization, chunking, and document processing
- Context Window: ~1M tokens (~4M characters)
- Processing Strategy: Efficient summarization with less aggressive compression
- Direct Pass-Through Threshold: Documents up to 800K tokens
- Chunking Threshold: 3.2M characters (800K tokens)
Gemini 2.5 Pro - Final Answer Generation
- Primary Use: Final answer synthesis using summarized content
- Context Window: ~1M tokens (~4M characters)
- Processing Strategy: Comprehensive analysis of pre-processed content
- Content Integration: Receives summarized content from Flash
Anthropic Claude - Final Answer Generation
- Primary Use: Final answer generation (alternative to Gemini Pro)
- Context Window: ~200K tokens (~800K characters)
- Processing Strategy: Works with heavily compressed summaries
- Content Integration: Requires more aggressive pre-summarization
Processing Pipeline by Model
Gemini-Based Pipeline (Flash → Pro)
1. Content Processing (Gemini 2.5 Flash):
- Large document summarization
- Tool result processing
- Chunk-based analysis
- Less aggressive compression (leverages 1M context)
2. Final Answer (Gemini 2.5 Pro):
- Receives Flash-summarized content
- 1M token context for comprehensive analysis
- Can work with larger summarized datasets
- More detailed final responses possible
Anthropic-Based Pipeline (Flash → Claude)
1. Content Processing (Gemini 2.5 Flash):
- Same summarization as above
- Additional compression layer for Claude compatibility
- More aggressive final summarization
- Reduced to fit 200K context limit
2. Final Answer (Anthropic Claude):
- Receives heavily compressed summaries
- 200K token context requires focused content
- Excellent at synthesis from compressed information
- Highly focused, distilled responses
Token Allocation Differences
Gemini Pro Final Answer (1M Context)
total_char_limit = 4000000 # ~1M tokens
chat_history_pct = 40% # 1.6M chars for chat history
question_pct = 10% # 400K chars for question
context_pct = 40% # 1.6M chars for context/tools
reserved_pct = 10% # 400K chars overhead
Anthropic Claude Final Answer (200K Context)
total_char_limit = 800000 # ~200K tokens
chat_history_pct = 40% # 320K chars for chat history
question_pct = 10% # 80K chars for question
context_pct = 40% # 320K chars for context/tools
reserved_pct = 10% # 80K chars overhead
Real-World Document Size Context
Document Size Reference
- 100K tokens = ~250 pages = PhD thesis, technical manual
- 250K tokens = ~625 pages = Complete book, comprehensive documentation
- 1M tokens = ~2,500 pages = Multiple books, complete documentation suite
- 2.5M tokens = ~6,250 pages = Massive dataset, legal case collection
Processing Categories
- Tiny documents: Up to 5K tokens (Email, newsletter, contract, article)
- Small documents: 5K-100K tokens (Report, manual chapter, research paper)
- Medium documents: 100K-250K tokens (Complete book, technical documentation)
- Large collections: 250K-1M tokens (Multiple books, documentation suite)
- Very large collections: 1M+ tokens (Massive datasets, legal archives)
Everyday Document Examples
- Email/Newsletter: ~500 tokens (1-2 pages) - No processing needed
- News Article: ~800 tokens (2 pages) - No processing needed
- Business Contract: ~2,000 tokens (5 pages) - No processing needed
- Blog Post: ~1,200 tokens (3 pages) - No processing needed
- White Paper: ~10K tokens (25 pages) - No processing needed
- Research Report: ~30K tokens (75 pages) - No processing needed
- Technical Manual: ~60K tokens (150 pages) - Direct processing
- Complete Book: ~120K tokens (300 pages) - Light chunking needed
Size-Based Examples by Processing Pipeline
Example 0: Tiny Documents (No Processing Needed)
Most everyday documents are tiny and pass through without any summarization:
Business Contract (5 pages, 2K tokens)
Both Gemini and Anthropic Pipelines:
Input: 2K token business contract (5 pages)
Processing: No summarization needed (well under all limits)
Output: Full contract passed through unchanged
Result:
- Complete contract terms and conditions
- All legal clauses preserved verbatim
- Signatures, dates, and parties intact
- Ready for detailed legal analysis
Newsletter/Email (2 pages, 800 tokens)
Both Gemini and Anthropic Pipelines:
Input: 800 token newsletter (2 pages)
Processing: No summarization needed
Output: Full newsletter content unchanged
Result:
- All articles and announcements preserved
- Contact information and links intact
- Formatting and structure maintained
- Complete context for Q&A
Research Paper (15 pages, 6K tokens)
Both Gemini and Anthropic Pipelines:
Input: 6K token research paper (15 pages)
Processing: No summarization needed
Output: Full paper passed through unchanged
Result:
- Complete abstract, methodology, and results
- All references and citations preserved
- Figures and tables descriptions intact
- Full context for academic analysis
Example 1: Small Document (PhD Thesis - ~200 pages, 80K tokens)
Gemini Pipeline (Flash → Pro)
Step 1 - Gemini 2.5 Flash Processing:
Input: 80K token PhD thesis (200 pages)
Processing: No summarization needed (under threshold)
Output: Full document passed through unchanged
Step 2 - Gemini 2.5 Pro Final Answer:
Input: 80K token thesis + chat history
Token budget: 1M tokens available
Context allocation: 1.6M chars (400K tokens) for tool content
Result: Comprehensive analysis with full document context
- Complete access to all chapters and references
- Detailed methodology analysis
- Full literature review preserved
Context remaining: 920K tokens for detailed response
Anthropic Pipeline (Flash → Claude)
Step 1 - Gemini 2.5 Flash Processing:
Input: 80K token PhD thesis (200 pages)
Processing: Light compression for Claude compatibility
Output: ~65K token summary (compressed for 200K limit)
Step 2 - Anthropic Claude Final Answer:
Input: 65K token summary + chat history
Token budget: 200K tokens available
Context allocation: 320K chars (80K tokens) for tool content
Result: Focused analysis with key elements preserved
- Main findings and conclusions intact
- Core methodology preserved
- Key references maintained
Context remaining: 135K tokens for response
Example 2: Medium Document (Complete Book - ~300 pages, 120K tokens)
Gemini Pipeline (Flash → Pro)
Step 1 - Gemini 2.5 Flash Processing:
Input: 120K token novel (300 pages)
Processing: Direct summarization (single pass)
Output: ~100K token summary with detailed preservation
Step 2 - Gemini 2.5 Pro Final Answer:
Input: 100K token summary + chat history
Token allocation: 1.6M chars (400K tokens) for context
Result:
- Detailed chapter-by-chapter analysis
- Character development preserved
- Plot structure and themes intact
- Key dialogue and scenes quoted
- Literary analysis with extensive examples
Anthropic Pipeline (Flash → Claude)
Step 1 - Gemini 2.5 Flash Processing:
Input: 120K token novel (300 pages)
Processing: Aggressive compression for Claude compatibility
Output: ~40K token summary (heavily compressed)
Step 2 - Anthropic Claude Final Answer:
Input: 40K token summary + chat history
Token allocation: 320K chars (80K tokens) for context
Result:
- High-level plot summary and themes
- Main character arcs preserved
- Key scenes and turning points
- Focused literary analysis
- Essential quotes and examples
Example 3: Large Document Collection (Encyclopedia Volume - ~800 pages, 320K tokens)
Gemini Pipeline (Flash → Pro)
Step 1 - Gemini 2.5 Flash Processing:
Input: Encyclopedia volume (800 pages, 320K tokens)
Processing:
1. Split into 2-3 large chunks (160K per chunk)
2. Each chunk gets substantial token allocation (133K-200K tokens)
3. Minimal compression needed due to large output capacity
Output: ~280K token comprehensive summary
Step 2 - Gemini 2.5 Pro Final Answer:
Input: 280K token summary + chat history
Token allocation: 1.6M chars (400K tokens) for context
Result:
- Detailed entries for major topics preserved
- Cross-references between articles maintained
- Historical context and examples included
- Technical definitions with full explanations
- Comprehensive coverage across all subject areas
Anthropic Pipeline (Flash → Claude)
Step 1 - Gemini 2.5 Flash Processing:
Input: Encyclopedia volume (800 pages, 320K tokens)
Processing:
1. Split into 8-10 focused chunks
2. Aggressive compression for Claude compatibility
3. Hierarchical summarization by subject area
Output: ~65K token compressed summary
Step 2 - Anthropic Claude Final Answer:
Input: 65K token summary + chat history
Token allocation: 320K chars (80K tokens) for context
Result:
- Essential entries for major topics
- Key definitions and concepts
- Important historical facts
- Cross-references between related topics
- Focused synthesis by subject category
Example 4: Very Large Dataset (Legal Case Collection - ~1,000 pages, 400K tokens)
Gemini Pipeline (Flash → Pro)
Step 1 - Gemini 2.5 Flash Processing:
Input: Complex legal case files (1,000 pages, 400K tokens)
Processing:
1. Split into 3-4 major case chunks
2. Each chunk processed with high detail preservation
3. Legal precedents and citations maintained
Output: ~350K token comprehensive summary
Step 2 - Gemini 2.5 Pro Final Answer:
Input: 350K token summary + chat history
Token allocation: 1.6M chars (400K tokens) for context
Result:
[CASE OVERVIEW - 50K tokens]
- Complete case timeline and key events
- All parties and their roles identified
- Jurisdiction and legal framework
[LEGAL ANALYSIS - 300K tokens]
[Case 1] - Contract Dispute
- Full contract terms and disputed clauses
- Precedent cases cited with details
- Court decisions with reasoning
- Settlement terms and implications
[Case 2] - Liability Claims
- Incident details and evidence presented
- Expert testimony summaries
- Damage assessments and calculations
- Appeal outcomes and final judgments
...
Anthropic Pipeline (Flash → Claude)
Step 1 - Gemini 2.5 Flash Processing:
Input: Complex legal case files (1,000 pages, 400K tokens)
Processing:
1. Heavy summarization focused on key legal points
2. Essential precedents and outcomes preserved
3. Multi-level compression for Claude compatibility
Output: ~65K token focused summary
Step 2 - Anthropic Claude Final Answer:
Input: 65K token summary + chat history
Token allocation: 320K chars (80K tokens) for context
Result:
[EXECUTIVE SUMMARY - 15K tokens]
- Key legal issues and outcomes
- Major precedents established
- Financial implications and settlements
[CASE SUMMARIES - 50K tokens]
[Contract Disputes - 20K tokens]
- Core contractual issues
- Key precedents cited
- Final outcomes and implications
[Liability Claims - 20K tokens]
- Main liability determinations
- Damage awards and reasoning
- Appeal results
[Regulatory Issues - 10K tokens]
- Compliance violations identified
- Regulatory responses and penalties
...
Processing Pipeline Differences
Gemini Pipeline (Flash → Pro) Characteristics
- Summarization Stage (Flash):
- Fewer chunks needed due to 1M context
- Less recursive summarization required
- Larger batch sizes for efficient processing
- Less aggressive compression preserves more detail
- Final Answer Stage (Pro):
- 1M context allows comprehensive analysis
- Can work with larger summarized datasets
- More detailed final responses possible
- Better cross-document synthesis
Anthropic Pipeline (Flash → Claude) Characteristics
- Summarization Stage (Flash):
- Same Flash processing capabilities
- Additional compression layer for Claude compatibility
- More aggressive final summarization required
- Optimized for 200K context target
- Final Answer Stage (Claude):
- 200K context requires focused content
- Excellent synthesis from compressed information
- Highly distilled, focused responses
- Superior reasoning with limited context
Content Quality Trade-offs by Pipeline
Gemini Pipeline Advantages
- Higher detail preservation: Flash’s 1M context enables nuanced summaries
- Better context retention: Cross-references maintained through Pro’s large context
- Less information loss: Larger summaries possible throughout pipeline
- Comprehensive analysis: Pro can synthesize more complex relationships
- Detailed responses: More space for thorough explanations
Anthropic Pipeline Advantages
- Highly focused summaries: Forces extraction of essential points
- Superior reasoning: Claude excels at synthesis from compressed information
- Consistent quality: Proven compression and distillation capabilities
- Efficient processing: Lower token costs for summarization
- Concise insights: Excellent at identifying core concepts
Model Selection Strategy
Choose Gemini Pipeline When:
- Document complexity is high: Technical specifications, research papers
- Detail preservation is critical: Legal documents, medical records
- Cross-referencing needed: Multiple related documents
- Comprehensive analysis required: In-depth technical questions
- Context richness matters: Complex queries requiring nuanced understanding
Choose Anthropic Pipeline When:
- Focus and clarity are paramount: Executive summaries, decision-making
- Processing efficiency is important: Large-scale document processing
- Distillation quality matters: Extracting key insights from complex data
- Reasoning depth needed: Complex logical analysis and synthesis
- Concise responses preferred: Clear, actionable insights
Impact of 8x Larger Summary Outputs (8K → 64K)
The increase from 8,000 to 65,000 token summary outputs has fundamentally changed how the system handles documents:
Before vs After Comparison
Small Document Processing (400K chars ~100K tokens)
Before (8K Summary Limit):
Input: 400K character research paper
Processing: Triggers recursive chunking (>400K threshold)
1. Split into 2 chunks of 200K chars each
2. Each chunk gets token_limit//2 = 4K tokens per chunk
3. 2 × 4K = 8K combined summaries
4. No recursive pass needed (fits in 8K limit)
Output: 8K token summary
Information Preservation: ~8% of original content
Result:
- Only key findings preserved
- Most methodology details lost
- Limited quotes and examples
- Minimal cross-references
After (64K Summary Limit):
Input: 400K character research paper
Processing: Direct summarization (under 1M limit, above 400K threshold)
1. Split into 2 chunks of 200K chars each
2. Each chunk gets token_limit//2 = 32K tokens per chunk
3. 2 × 32K = 64K combined summaries
4. No recursive pass needed (fits in 64K limit)
Output: 64K token summary
Information Preservation: ~64% of original content
Result:
- Comprehensive methodology preserved
- Extensive quotes and examples
- Detailed findings with supporting evidence
- Full cross-references and citations maintained
Medium Document Processing (1M chars ~250K tokens)
Before (8K Summary Limit):
Input: 1M character technical documentation
Processing: Recursive chunking strategy
1. Split into 5 chunks of 200K chars each
2. Each chunk gets min(8K//5, 8K) = 1.6K tokens per chunk
3. 5 × 1.6K = 8K combined summaries
4. Combined summaries = 8K (no recursive pass needed)
Output: 8K token summary
Information Preservation: ~3.2% of original content
Result: Severe information loss
- Only highest-level concepts preserved
- Most technical details lost
- Implementation examples removed
- Architecture details heavily simplified
After (64K Summary Limit):
Input: 1M character technical documentation
Processing: Recursive chunking strategy
1. Split into 5 chunks of 200K chars each
2. Each chunk gets min(64K//5, 64K) = 12.8K tokens per chunk
3. 5 × 12.8K = 64K combined summaries
4. Combined summaries = 64K (no recursive pass needed)
Output: 64K token summary
Information Preservation: ~25.6% of original content
Result: Significant information preservation
- Technical details largely maintained
- Code examples preserved
- Architecture diagrams and explanations intact
- Implementation guidance preserved
Large Document Collection (5M chars ~1.25M tokens)
Before (8K Summary Limit):
Input: 5M character document collection
Processing: Multi-level recursive chunking
1. Split into 25 chunks of 200K chars each
2. Each chunk gets min(8K//25, 8K) = 320 tokens per chunk
3. 25 × 320 = 8K combined summaries
4. Combined summaries trigger recursive summarization:
- Executive summary: min(2K, 8K//4) = 2K tokens
- Detailed summary: 8K - 2K - 100 = 5.9K tokens
- Further recursive compression if needed
Output: ~8K token summary
Information Preservation: ~0.64% of original content
Result:
- Only highest-level themes preserved
- No detailed methodology
- No comparative analysis possible
- Lost connections between documents
After (64K Summary Limit):
Input: 5M character document collection
Processing: Single-level recursive chunking
1. Split into 25 chunks of 200K chars each
2. Each chunk gets min(64K//25, 64K) = 2.56K tokens per chunk
3. 25 × 2.56K = 64K combined summaries
4. Combined summaries = 64K (no recursive pass needed)
Output: 64K token summary
Information Preservation: ~5.12% of original content
Result:
- Key methodologies for each document preserved
- Important findings with supporting evidence
- Some comparative analysis across documents
- Critical connections and relationships maintained
Very Large Document Collection (10M chars ~2.5M tokens)
Before (8K Summary Limit):
Input: 10M character document collection
Processing: Deep recursive chunking (multiple levels)
1. Split into 50 chunks of 200K chars each
2. Each chunk gets min(8K//50, 8K) = 160 tokens per chunk
3. 50 × 160 = 8K combined summaries
4. Triggers recursive summarization:
- Executive summary: 2K tokens
- Detailed summary: 5.9K tokens
- May trigger additional recursive levels
Output: ~8K token summary
Information Preservation: ~0.32% of original content
Result:
- Only document titles and main themes
- No detailed content preserved
- No relationships between documents
- Minimal actionable information
After (64K Summary Limit):
Input: 10M character document collection
Processing: Controlled recursive chunking
1. Split into 50 chunks of 200K chars each
2. Each chunk gets min(64K//50, 64K) = 1.28K tokens per chunk
3. 50 × 1.28K = 64K combined summaries
4. Combined summaries = 64K (fits without recursion)
Output: 64K token summary
Information Preservation: ~2.56% of original content
Result:
- Document summaries with key points preserved
- Essential findings and methodologies
- Basic comparative analysis possible
- Important cross-document relationships maintained
Tool-Specific Improvements
extract_files.py Enhancement Needed
# Current limitation (needs update)
max_output_tokens=8192
# Should be updated to
max_output_tokens=65535 # or MODEL_TOKEN_OUTPUT_LIMIT
Impact on File Processing:
- Before: Each file limited to 8K summary regardless of size
- After: Each file can generate up to 64K summary, preserving much more detail
- Result: 8x more information preserved per file processed
google_search.py Enhancement Needed
# Current limitation (needs update)
max_output_tokens=8192
# Should be updated to
max_output_tokens=65535
Impact on Search Results:
- Before: Search results heavily compressed, losing context and details
- After: Comprehensive search summaries with full context and quotes
url_processing.py Enhancement Needed
# Current limitation (needs update)
max_output_tokens=8192
# Should be updated to
max_output_tokens=65535
Impact on Web Content:
- Before: Web pages reduced to basic summaries
- After: Full article content with detailed analysis
Real-World Scenarios
Scenario 1: Everyday Business Documents (No Processing)
Input: Business contract (5 pages, 2K tokens)
Newsletter (2 pages, 800 tokens)
Email thread (3 pages, 1.2K tokens)
Processing: No summarization needed for any model
Result: Full documents preserved unchanged
- All legal terms and conditions intact
- Complete newsletter content and links
- Full email conversation history
- Ready for immediate analysis
Scenario 2: Professional Reports (Light Processing)
Input: Technical manual (150 pages, 60K tokens)
Gemini Pipeline:
- Direct processing, minimal compression
- Full technical details preserved
- All examples and procedures intact
Anthropic Pipeline:
- Light compression for Claude compatibility
- Key procedures and examples preserved
- Technical specifications maintained
Scenario 3: Large Document Analysis (Heavy Processing)
Input: Complete legal case file (1,000 pages, 400K tokens)
Before (8K limit):
- Only case outcomes preserved
- Legal reasoning lost
- Precedent details missing
- Evidence summaries removed
After (64K limit):
Gemini Pipeline:
- Comprehensive case analysis
- Legal reasoning preserved
- Precedent cases with details
- Evidence and testimony summaries
Anthropic Pipeline:
- Focused legal analysis
- Key precedents and outcomes
- Essential evidence highlights
- Concentrated legal insights
Performance Impact
Processing Efficiency
- Reduced Chunking: Fewer chunks needed due to larger output capacity
- Fewer Recursive Passes: Less need for multi-level summarization
- Better Parallel Processing: Larger chunks process more efficiently
Quality Improvements
- Context Preservation: Relationships between concepts maintained
- Detail Retention: Technical specifications preserved
- Quote Accuracy: More verbatim content possible
- Reference Integrity: Citations and links maintained
Recommendations for Tool Updates
Tools still using 8K limits should be updated:
# Update these tools to use the new limit:
# - extract_files.py
# - google_search.py
# - url_processing.py
# - image_creation.py
# Replace this:
max_output_tokens=8192
# With this:
from models.limit_content import MODEL_TOKEN_OUTPUT_LIMIT
max_output_tokens=MODEL_TOKEN_OUTPUT_LIMIT # 65535
This update will provide immediate benefits:
- 8x more information preserved per tool operation
- Better context for final answer generation
- Reduced information loss in tool processing chains
- More comprehensive responses to user queries
Best Practices
- Question Context: Always provide clear questions to guide summarization
- Metadata Preservation: URLs, timestamps, and identifiers are prioritized
- Structured Data: Tables and lists are preserved when relevant
- Code Blocks: Programming code is quoted verbatim when relevant
- Error Information: Error messages and stack traces preserved in full
Configuration
Adjusting Limits
# In limit_content.py
LIMIT_CONTENT_MAX = 1000000 # Increase for larger direct summaries
CHUNK_SIZE = 200000 # Adjust chunk granularity
RECURSIVE_THRESHOLD = 400000 # Change when chunking starts
# In anthropic.py
total_char_limit = 180000 # Total context window
chat_history_pct = 40 # Adjust allocation percentages
question_pct = 10
context_pct = 40
Model Selection
- Summarization: Gemini-2.5-flash (fast, efficient)
- Executive Summaries: Gemini-2.5-flash with 2000 token limit
- Final Answers: Anthropic Claude or Gemini based on configuration
Monitoring and Debugging
Logging
log.info("Content size X exceeds threshold, using recursive chunking")
log.warning("Content was > 1M, truncating for direct summarization")
log.error("Error summarizing chunk X: error_message")
Tracing
- Langfuse integration tracks each summarization step
- Parent-child observation hierarchy
- Token usage tracked at each level
Metrics
- Input size vs output size ratios
- Chunking frequency and depths
- Summarization latency by document size
- Error rates by chunk
Tool-Specific Large Output Handling
The system includes multiple tools that can generate extremely large outputs (millions of tokens). Each tool has specific strategies for managing size:
High-Risk Tools (Multi-Million Token Potential)
1. extract_files.py - File Processing Tool
- Output Size: Can process 100-200 files, generating millions of tokens
- Limiting Strategy:
- File size check: 49MB limit per file for Gemini processing
- Batch processing: Handles files in batches of 1-20 files
- Progressive limits: Reduces batch size for large file counts
- Timeout controls: 4-8 minute timeouts per file
- Sequential processing: Forces one-at-a-time for 100+ files
- File prioritization: Text files → PDFs → Images → Others
- Direct text extraction: Small text files (<1MB) extracted directly
# File processing limits
GEMINI_FILE_SIZE_LIMIT = 1024*1024*49 # 49MB per file
batch_size = 1 if len(files) > 100 else min(20, len(files))
timeout_per_file = 4*60 # 4 minutes
hard_timeout_per_file = 8*60 # 8 minutes hard limit
2. ai_search.py - Document Search Tool
- Output Size: Searches large document collections, potentially millions of tokens
- Limiting Strategy:
- Content limiting: 500K character limit with fallback to 100K
- Result limiting: Default max 10 search results
- Summarization: Uses
limit_content()for large result sets
# AI search limits
max_limit = 10 # Maximum search results
content_limit = 500000 # 500K character limit
fallback_limit = 100000 # 100K fallback limit
3. user_history.py - Chat History Tool
- Output Size: Searches across user chat histories, can accumulate large amounts
- Limiting Strategy:
- Result limiting: Default max 5 history results
- Content summarization: Applies
limit_content()to results - Query specificity: Focuses searches to reduce result size
4. code_execution.py - Code Runner Tool
- Output Size: Code execution can produce large outputs (data analysis, visualizations)
- Limiting Strategy:
- Content limiting: 200K character limit via
limit_gemini_content() - History limiting: Manages conversation history size
- Output truncation: Large code outputs are summarized
- Content limiting: 200K character limit via
Medium-Risk Tools
5. google_search.py - Web Search Tool
- Output Size: Comprehensive web search results with metadata
- Limiting Strategy:
- Token limiting: 8192 max output tokens
- Result filtering: Focused search result processing
6. url_processing.py - Web Content Tool
- Output Size: Processes content from multiple URLs
- Limiting Strategy:
- Token limiting: 8192 max output tokens per URL
- Content extraction: Focuses on relevant content sections
Tool Result Integration Strategy
When tools return large outputs, the system applies a multi-stage reduction process:
Stage 1: Tool-Level Limiting
- Each tool implements its own size controls
- Pre-processing limits (file counts, result counts)
- Timeout and batch controls for processing
Stage 2: Content Integration
- Tool results combined in
anthropic.pyandgemini_smart_utils.py - Percentage-based allocation:
- Context/tool results: 40% of total token budget (72K chars)
- Chat history: 40% (72K chars)
- Question: 10% (18K chars)
- Reserved: 10% for overhead
Stage 3: Final Summarization
- If combined tool results exceed context allocation (72K chars)
limit_context()function applies intelligent summarization- Preserves key findings and relevant details
- Maintains tool metadata and source references
Example: Large File Collection Processing
Input: 150 files (50 PDFs, 100 text files) totaling 50M tokens
Processing:
1. Tool-level limiting:
- Prioritizes text files first
- Batches into groups of 3-5 files
- Direct text extraction for small files
- Gemini processing for PDFs/large files
2. Content integration:
- 150 file summaries combined
- Total output: ~2M tokens
- Exceeds 72K context allocation
3. Final summarization:
- Executive summary of key documents
- Detailed chunk summaries by file type
- Preserved source references and metadata
- Final result: ~65K tokens fitting context window
Tool Result Caching
Several tools implement caching to avoid re-processing:
- extract_files.py: Caches file content extractions and URL processing
- ai_search.py: May cache search results based on query patterns
- url_processing.py: Caches processed URL content
Error Handling for Large Outputs
Tools implement graceful degradation when size limits are exceeded:
- Timeout handling: Partial results returned if processing times out
- Memory limits: Fallback to smaller batch sizes or sequential processing
- API limits: Retry with reduced content or alternative processing
- Truncation: Clear markers when content is truncated due to limits
Monitoring Tool Output Sizes
The system tracks tool performance through Langfuse:
- Input/output token counts per tool
- Processing times and timeout rates
- Content reduction ratios
- Error rates by tool and content size
Tool Configuration for Size Management
Tools can be configured with size-related parameters:
# In tool configurations
toolConfigs = {
"extract_files": {
"max_files": 100,
"batch_size": 5,
"timeout_per_file": 240
},
"ai_search": {
"max_results": 10,
"content_limit": 500000
},
"user_history": {
"max_results": 5,
"days_back": 30
}
}
Future Improvements
- Adaptive Chunking: Dynamic chunk sizes based on content density
- Semantic Chunking: Split at natural boundaries (paragraphs, sections)
- Priority Scoring: Score content relevance before summarization
- Caching: Cache summaries for frequently accessed large documents
- Streaming: Stream partial summaries as they’re generated
- Tool-Specific Limits: Per-tool token budgets based on typical output sizes
- Progressive Loading: Stream tool results as they complete processing
- Content Deduplication: Remove duplicate content across multiple tool results
Summary: Key Impacts of 8K → 64K Summary Increase
Dramatic Information Preservation Improvements (Based on Actual Recursive Chunking):
- Small documents (400K chars/100K tokens): Preservation improved from 8% to 64% (8x improvement)
- Medium documents (1M chars/250K tokens): Preservation improved from 3.2% to 25.6% (8x improvement)
- Large collections (5M chars/1.25M tokens): Preservation improved from 0.64% to 5.12% (8x improvement)
- Very large collections (10M chars/2.5M tokens): Preservation improved from 0.32% to 2.56% (8x improvement)
Critical Insight:
The improvement follows the chunking algorithm exactly - each chunk gets token_limit // num_chunks tokens, so the 8x increase in token_limit translates directly to 8x more tokens per chunk, resulting in consistently 8x better information preservation across all document sizes.
Reduced Recursive Complexity:
- Before: Large documents often triggered multiple levels of recursive summarization
- After: Most documents can be processed in a single chunking pass, avoiding recursive compression losses
- Result: More predictable and higher-quality summaries
Tools Requiring Updates:
The following tools still use 8K limits and should be updated to leverage the full 64K capacity:
extract_files.py(Updated: Now uses default Gemini limits)google_search.py(Updated: Now uses default Gemini limits)url_processing.py(Needs update: Still has max_output_tokens=8192)image_creation.py(Needs update: Still has max_output_tokens=8192)