Chunking Strategy Optimizations for 1M Input / 64K Output
This document proposes optimizations to the current chunking strategy to maximize information preservation while staying within model limits.
Current Strategy Analysis
Current Constants
LIMIT_CONTENT_MAX = 1000000 # 1M chars max for direct summarization
CHUNK_SIZE = 200000 # 200K chars per chunk
RECURSIVE_THRESHOLD = 400000 # When to start chunking
MODEL_TOKEN_OUTPUT_LIMIT = 65535 # 64K tokens output limit
Current Issues
- Conservative chunk sizing: 200K chunks underutilize Gemini’s 1M input capacity
- Fixed chunk allocation:
token_limit // num_chunkscan create very small summaries for large documents - No model-specific optimization: Same strategy for Anthropic (200K) and Gemini (1M) contexts
- Character-based chunking: Doesn’t account for token density variations
Proposed Optimizations
1. Model-Aware Adaptive Chunking
# New optimized constants
GEMINI_INPUT_LIMIT = 1000000 # ~1M tokens (~4M chars)
ANTHROPIC_INPUT_LIMIT = 200000 # ~200K tokens (~800K chars)
GEMINI_CHUNK_SIZE = 800000 # 800K chars per chunk for Gemini
ANTHROPIC_CHUNK_SIZE = 160000 # 160K chars per chunk for Anthropic
MIN_SUMMARY_SIZE = 8000 # Minimum viable summary size
MAX_SUMMARY_SIZE = 65535 # Maximum output tokens
async def adaptive_limit_content(
content_str: str,
question: str,
target_model: str = "gemini", # "gemini" or "anthropic"
final_token_limit: int = None,
**kwargs
) -> str:
"""Model-aware content limiting with optimized chunking"""
# Set model-specific parameters
if target_model == "anthropic":
input_limit = ANTHROPIC_INPUT_LIMIT * 4 # ~800K chars
chunk_size = ANTHROPIC_CHUNK_SIZE
final_limit = final_token_limit or 80000 # 80K for Anthropic context
else: # gemini
input_limit = GEMINI_INPUT_LIMIT * 4 # ~4M chars
chunk_size = GEMINI_CHUNK_SIZE
final_limit = final_token_limit or 400000 # 400K for Gemini context
# Strategy selection based on content size
if len(content_str) <= final_limit:
return content_str # No processing needed
if len(content_str) <= input_limit:
return await _optimized_direct_summarize(content_str, question, final_limit, target_model)
return await _optimized_recursive_summarize(content_str, question, final_limit, chunk_size, target_model)
2. Optimized Chunk Size Calculation
Instead of fixed division, use dynamic allocation based on content complexity:
def calculate_optimal_chunks(content_size: int, target_model: str, final_limit: int) -> tuple:
"""Calculate optimal chunk configuration"""
if target_model == "anthropic":
max_chunk_size = ANTHROPIC_CHUNK_SIZE
input_limit = ANTHROPIC_INPUT_LIMIT * 4
else:
max_chunk_size = GEMINI_CHUNK_SIZE
input_limit = GEMINI_INPUT_LIMIT * 4
# Calculate number of chunks needed
num_chunks = math.ceil(content_size / max_chunk_size)
# Ensure each chunk summary gets meaningful token allocation
tokens_per_chunk = max(MIN_SUMMARY_SIZE, final_limit // num_chunks)
# If summaries would be too small, reduce chunk count
if tokens_per_chunk < MIN_SUMMARY_SIZE:
max_chunks = final_limit // MIN_SUMMARY_SIZE
num_chunks = min(num_chunks, max_chunks)
tokens_per_chunk = final_limit // num_chunks
# Recalculate actual chunk size to fit content evenly
actual_chunk_size = content_size // num_chunks
return num_chunks, actual_chunk_size, tokens_per_chunk
# Example results:
# 5M chars, Gemini → 7 chunks, 714K chars each, 9.1K tokens each = 64K total
# 5M chars, Anthropic → 32 chunks, 156K chars each, 2.5K tokens each = 80K total
3. Smart Semantic Chunking
Replace arbitrary character boundaries with intelligent content-aware splitting:
async def semantic_chunk_content(content: str, chunk_size: int) -> List[str]:
"""Split content at natural boundaries to preserve context"""
chunks = []
current_pos = 0
while current_pos < len(content):
end_pos = min(current_pos + chunk_size, len(content))
# If not at end of content, find better boundary
if end_pos < len(content):
# Look for natural breakpoints in order of preference
breakpoints = [
content.rfind('\n\n# ', current_pos, end_pos), # Markdown headers
content.rfind('\n\n## ', current_pos, end_pos), # Subheaders
content.rfind('\n\n', current_pos, end_pos), # Paragraph breaks
content.rfind('. ', current_pos, end_pos), # Sentence ends
content.rfind(' ', current_pos, end_pos), # Word boundaries
]
# Use the best available breakpoint
for breakpoint in breakpoints:
if breakpoint > current_pos + chunk_size * 0.8: # At least 80% of chunk
end_pos = breakpoint + 1
break
chunk = content[current_pos:end_pos].strip()
if chunk:
chunks.append(chunk)
current_pos = end_pos
return chunks
4. Hierarchical Summary Strategy
Instead of flat chunking, use a hierarchical approach for very large documents:
async def hierarchical_summarize(content: str, question: str, target_model: str) -> str:
"""Multi-level hierarchical summarization for optimal preservation"""
# Level 1: Semantic sections (major topics)
sections = await extract_semantic_sections(content)
# Level 2: Chunk large sections
section_summaries = []
for section in sections:
if len(section) > CHUNK_SIZE:
summary = await adaptive_limit_content(section, question, target_model)
else:
summary = section
section_summaries.append(summary)
# Level 3: Combine and final summarization
combined = "\n\n".join(section_summaries)
if target_model == "anthropic":
final_limit = 80000
else:
final_limit = 400000
if len(combined) <= final_limit:
return combined
return await _optimized_direct_summarize(combined, question, final_limit, target_model)
async def extract_semantic_sections(content: str) -> List[str]:
"""Extract logical sections based on content structure"""
# For code: split by classes, functions, modules
if '```' in content or 'def ' in content or 'class ' in content:
return extract_code_sections(content)
# For markdown: split by headers
if content.count('\n# ') > 2 or content.count('\n## ') > 5:
return extract_markdown_sections(content)
# For academic papers: split by standard sections
if any(marker in content.lower() for marker in ['abstract', 'methodology', 'results', 'conclusion']):
return extract_paper_sections(content)
# Default: paragraph-based chunking
return content.split('\n\n')
5. Context-Preserving Overlap Strategy
Improve the current 500-character overlap with intelligent context preservation:
def calculate_intelligent_overlap(prev_chunk: str, next_chunk: str, base_overlap: int = 1000) -> str:
"""Calculate contextually relevant overlap between chunks"""
# Extract key entities and concepts from the boundary
boundary_text = prev_chunk[-base_overlap:] + next_chunk[:base_overlap]
# Identify important elements to preserve
key_elements = extract_key_elements(boundary_text)
# Create overlap that maintains context
overlap_content = []
# Include complete sentences around the boundary
prev_sentences = prev_chunk.split('. ')[-3:] # Last 3 sentences
next_sentences = next_chunk.split('. ')[:3] # First 3 sentences
overlap_content.extend(prev_sentences)
overlap_content.extend(next_sentences)
# Add key entity references
overlap_content.extend(key_elements)
return '. '.join(overlap_content)
def extract_key_elements(text: str) -> List[str]:
"""Extract key entities, dates, numbers, and concepts"""
elements = []
# Extract dates (YYYY-MM-DD, Month Year, etc.)
date_patterns = [
r'\b\d{4}-\d{2}-\d{2}\b',
r'\b\w+ \d{4}\b',
r'\b\d{1,2}/\d{1,2}/\d{4}\b'
]
# Extract numbers and percentages
number_patterns = [
r'\b\d+\.?\d*%\b',
r'\$\d+\.?\d*[MBK]?\b',
r'\b\d+\.?\d*[MBK]?\b'
]
# Extract capitalized entities (names, places, organizations)
entity_pattern = r'\b[A-Z][a-z]+ [A-Z][a-z]+\b'
for pattern in date_patterns + number_patterns + [entity_pattern]:
elements.extend(re.findall(pattern, text))
return list(set(elements))
6. Token-Aware Processing
Replace character-based limits with actual token counting:
from models.gemini_utils import count_gemini_tokens
async def token_aware_chunking(content: str, target_tokens_per_chunk: int) -> List[str]:
"""Split content based on actual token counts rather than characters"""
chunks = []
current_chunk = ""
sentences = content.split('. ')
for sentence in sentences:
test_chunk = current_chunk + sentence + '. '
if count_gemini_tokens(test_chunk) <= target_tokens_per_chunk:
current_chunk = test_chunk
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = sentence + '. '
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
Optimized Strategy Implementation
For Gemini Pipeline (Flash → Pro)
# Optimized for 1M token context
GEMINI_STRATEGY = {
"input_limit": 4000000, # ~1M tokens
"chunk_size": 800000, # 800K chars per chunk
"min_summary_tokens": 12000, # 12K tokens minimum per chunk
"max_summary_tokens": 65535, # 64K tokens maximum
"overlap_size": 2000, # 2K char overlap for context
"final_compression_target": 400000 # 400K chars for Pro context
}
async def optimize_for_gemini(content: str, question: str) -> str:
"""Optimized processing for Gemini pipeline"""
if len(content) <= 400000: # Fits in final context
return content
if len(content) <= 4000000: # Single pass with Gemini Flash
return await direct_summarize_gemini(content, question, 400000)
# Multi-chunk strategy optimized for Gemini
chunks = await semantic_chunk_content(content, 800000)
# Each chunk gets generous token allocation
tokens_per_chunk = min(65535, 400000 // len(chunks))
if tokens_per_chunk < 12000: # Reduce chunks if summaries would be too small
max_chunks = 400000 // 12000 # ~33 chunks max
chunks = await semantic_chunk_content(content, len(content) // max_chunks)
tokens_per_chunk = 12000
# Process chunks with generous token allocation
summaries = await process_chunks_parallel(chunks, question, tokens_per_chunk)
combined = "\n\n".join(summaries)
# Final compression if needed
if len(combined) > 400000:
return await direct_summarize_gemini(combined, question, 400000)
return combined
For Anthropic Pipeline (Flash → Claude)
# Optimized for 200K token context
ANTHROPIC_STRATEGY = {
"input_limit": 4000000, # Still use Gemini Flash for processing
"chunk_size": 200000, # Smaller chunks for more aggressive compression
"min_summary_tokens": 2000, # 2K tokens minimum per chunk
"max_summary_tokens": 65535, # 64K tokens maximum
"overlap_size": 1000, # 1K char overlap
"final_compression_target": 80000 # 80K chars for Claude context
}
async def optimize_for_anthropic(content: str, question: str) -> str:
"""Optimized processing for Anthropic pipeline"""
if len(content) <= 80000: # Fits in final context
return content
if len(content) <= 4000000: # Single pass with aggressive compression
return await direct_summarize_aggressive(content, question, 80000)
# Multi-chunk strategy optimized for Claude's smaller context
chunks = await semantic_chunk_content(content, 200000)
# More conservative token allocation
tokens_per_chunk = min(65535, 80000 // len(chunks))
if tokens_per_chunk < 2000: # Reduce chunks if summaries would be too small
max_chunks = 80000 // 2000 # ~40 chunks max
chunks = await semantic_chunk_content(content, len(content) // max_chunks)
tokens_per_chunk = 2000
# Process with focus on essential information
summaries = await process_chunks_focused(chunks, question, tokens_per_chunk)
combined = "\n\n".join(summaries)
# Aggressive final compression for Claude
if len(combined) > 80000:
return await direct_summarize_aggressive(combined, question, 80000)
return combined
Expected Performance Improvements
Information Preservation Gains
| Document Size | Current Preservation | Optimized Gemini | Optimized Anthropic |
|---|---|---|---|
| 1M chars | 25.6% | 40-50% | 8-12% |
| 5M chars | 5.12% | 15-20% | 1.6-2.4% |
| 10M chars | 2.56% | 8-12% | 0.8-1.2% |
| 20M chars | 1.28% | 4-6% | 0.4-0.6% |
Processing Efficiency Gains
- Reduced chunk count: 40-60% fewer chunks needed for Gemini pipeline
- Better token utilization: Each chunk gets meaningful summary allocation
- Semantic coherence: Natural boundaries preserve context better
- Parallel processing: Larger chunks process more efficiently
Context Quality Improvements
- Relationship preservation: Semantic chunking maintains logical connections
- Reference integrity: Intelligent overlap preserves citations and cross-references
- Technical detail retention: Larger summaries can include code examples and specifics
- Hierarchical organization: Multi-level approach preserves document structure
Implementation Recommendations
- Phase 1: Update constants and implement model-aware chunking
- Phase 2: Add semantic boundary detection for common document types
- Phase 3: Implement hierarchical summarization for very large documents
- Phase 4: Add token-aware processing to replace character-based limits
This optimized strategy should significantly improve information preservation while staying within model limits and maximizing the new 64K output capacity.