Chunking Strategy Optimizations for 1M Input / 64K Output

This document proposes optimizations to the current chunking strategy to maximize information preservation while staying within model limits.

Current Strategy Analysis

Current Constants

LIMIT_CONTENT_MAX = 1000000        # 1M chars max for direct summarization
CHUNK_SIZE = 200000                # 200K chars per chunk
RECURSIVE_THRESHOLD = 400000       # When to start chunking
MODEL_TOKEN_OUTPUT_LIMIT = 65535   # 64K tokens output limit

Current Issues

  1. Conservative chunk sizing: 200K chunks underutilize Gemini’s 1M input capacity
  2. Fixed chunk allocation: token_limit // num_chunks can create very small summaries for large documents
  3. No model-specific optimization: Same strategy for Anthropic (200K) and Gemini (1M) contexts
  4. Character-based chunking: Doesn’t account for token density variations

Proposed Optimizations

1. Model-Aware Adaptive Chunking

# New optimized constants
GEMINI_INPUT_LIMIT = 1000000      # ~1M tokens (~4M chars)
ANTHROPIC_INPUT_LIMIT = 200000    # ~200K tokens (~800K chars)
GEMINI_CHUNK_SIZE = 800000        # 800K chars per chunk for Gemini
ANTHROPIC_CHUNK_SIZE = 160000     # 160K chars per chunk for Anthropic
MIN_SUMMARY_SIZE = 8000           # Minimum viable summary size
MAX_SUMMARY_SIZE = 65535          # Maximum output tokens

async def adaptive_limit_content(
    content_str: str,
    question: str,
    target_model: str = "gemini",  # "gemini" or "anthropic"
    final_token_limit: int = None,
    **kwargs
) -> str:
    """Model-aware content limiting with optimized chunking"""
    
    # Set model-specific parameters
    if target_model == "anthropic":
        input_limit = ANTHROPIC_INPUT_LIMIT * 4  # ~800K chars
        chunk_size = ANTHROPIC_CHUNK_SIZE
        final_limit = final_token_limit or 80000  # 80K for Anthropic context
    else:  # gemini
        input_limit = GEMINI_INPUT_LIMIT * 4  # ~4M chars  
        chunk_size = GEMINI_CHUNK_SIZE
        final_limit = final_token_limit or 400000  # 400K for Gemini context
    
    # Strategy selection based on content size
    if len(content_str) <= final_limit:
        return content_str  # No processing needed
    
    if len(content_str) <= input_limit:
        return await _optimized_direct_summarize(content_str, question, final_limit, target_model)
    
    return await _optimized_recursive_summarize(content_str, question, final_limit, chunk_size, target_model)

2. Optimized Chunk Size Calculation

Instead of fixed division, use dynamic allocation based on content complexity:

def calculate_optimal_chunks(content_size: int, target_model: str, final_limit: int) -> tuple:
    """Calculate optimal chunk configuration"""
    
    if target_model == "anthropic":
        max_chunk_size = ANTHROPIC_CHUNK_SIZE
        input_limit = ANTHROPIC_INPUT_LIMIT * 4
    else:
        max_chunk_size = GEMINI_CHUNK_SIZE  
        input_limit = GEMINI_INPUT_LIMIT * 4
    
    # Calculate number of chunks needed
    num_chunks = math.ceil(content_size / max_chunk_size)
    
    # Ensure each chunk summary gets meaningful token allocation
    tokens_per_chunk = max(MIN_SUMMARY_SIZE, final_limit // num_chunks)
    
    # If summaries would be too small, reduce chunk count
    if tokens_per_chunk < MIN_SUMMARY_SIZE:
        max_chunks = final_limit // MIN_SUMMARY_SIZE
        num_chunks = min(num_chunks, max_chunks)
        tokens_per_chunk = final_limit // num_chunks
    
    # Recalculate actual chunk size to fit content evenly
    actual_chunk_size = content_size // num_chunks
    
    return num_chunks, actual_chunk_size, tokens_per_chunk

# Example results:
# 5M chars, Gemini → 7 chunks, 714K chars each, 9.1K tokens each = 64K total
# 5M chars, Anthropic → 32 chunks, 156K chars each, 2.5K tokens each = 80K total  

3. Smart Semantic Chunking

Replace arbitrary character boundaries with intelligent content-aware splitting:

async def semantic_chunk_content(content: str, chunk_size: int) -> List[str]:
    """Split content at natural boundaries to preserve context"""
    
    chunks = []
    current_pos = 0
    
    while current_pos < len(content):
        end_pos = min(current_pos + chunk_size, len(content))
        
        # If not at end of content, find better boundary
        if end_pos < len(content):
            # Look for natural breakpoints in order of preference
            breakpoints = [
                content.rfind('\n\n# ', current_pos, end_pos),      # Markdown headers
                content.rfind('\n\n## ', current_pos, end_pos),     # Subheaders  
                content.rfind('\n\n', current_pos, end_pos),        # Paragraph breaks
                content.rfind('. ', current_pos, end_pos),          # Sentence ends
                content.rfind(' ', current_pos, end_pos),           # Word boundaries
            ]
            
            # Use the best available breakpoint
            for breakpoint in breakpoints:
                if breakpoint > current_pos + chunk_size * 0.8:  # At least 80% of chunk
                    end_pos = breakpoint + 1
                    break
        
        chunk = content[current_pos:end_pos].strip()
        if chunk:
            chunks.append(chunk)
        
        current_pos = end_pos
    
    return chunks

4. Hierarchical Summary Strategy

Instead of flat chunking, use a hierarchical approach for very large documents:

async def hierarchical_summarize(content: str, question: str, target_model: str) -> str:
    """Multi-level hierarchical summarization for optimal preservation"""
    
    # Level 1: Semantic sections (major topics)
    sections = await extract_semantic_sections(content)
    
    # Level 2: Chunk large sections  
    section_summaries = []
    for section in sections:
        if len(section) > CHUNK_SIZE:
            summary = await adaptive_limit_content(section, question, target_model)
        else:
            summary = section
        section_summaries.append(summary)
    
    # Level 3: Combine and final summarization
    combined = "\n\n".join(section_summaries)
    
    if target_model == "anthropic":
        final_limit = 80000
    else:
        final_limit = 400000
        
    if len(combined) <= final_limit:
        return combined
    
    return await _optimized_direct_summarize(combined, question, final_limit, target_model)

async def extract_semantic_sections(content: str) -> List[str]:
    """Extract logical sections based on content structure"""
    
    # For code: split by classes, functions, modules
    if '```' in content or 'def ' in content or 'class ' in content:
        return extract_code_sections(content)
    
    # For markdown: split by headers
    if content.count('\n# ') > 2 or content.count('\n## ') > 5:
        return extract_markdown_sections(content)
    
    # For academic papers: split by standard sections  
    if any(marker in content.lower() for marker in ['abstract', 'methodology', 'results', 'conclusion']):
        return extract_paper_sections(content)
    
    # Default: paragraph-based chunking
    return content.split('\n\n')

5. Context-Preserving Overlap Strategy

Improve the current 500-character overlap with intelligent context preservation:

def calculate_intelligent_overlap(prev_chunk: str, next_chunk: str, base_overlap: int = 1000) -> str:
    """Calculate contextually relevant overlap between chunks"""
    
    # Extract key entities and concepts from the boundary
    boundary_text = prev_chunk[-base_overlap:] + next_chunk[:base_overlap]
    
    # Identify important elements to preserve
    key_elements = extract_key_elements(boundary_text)
    
    # Create overlap that maintains context
    overlap_content = []
    
    # Include complete sentences around the boundary
    prev_sentences = prev_chunk.split('. ')[-3:]  # Last 3 sentences
    next_sentences = next_chunk.split('. ')[:3]   # First 3 sentences
    
    overlap_content.extend(prev_sentences)
    overlap_content.extend(next_sentences)
    
    # Add key entity references
    overlap_content.extend(key_elements)
    
    return '. '.join(overlap_content)

def extract_key_elements(text: str) -> List[str]:
    """Extract key entities, dates, numbers, and concepts"""
    elements = []
    
    # Extract dates (YYYY-MM-DD, Month Year, etc.)
    date_patterns = [
        r'\b\d{4}-\d{2}-\d{2}\b',
        r'\b\w+ \d{4}\b',
        r'\b\d{1,2}/\d{1,2}/\d{4}\b'
    ]
    
    # Extract numbers and percentages
    number_patterns = [
        r'\b\d+\.?\d*%\b',
        r'\$\d+\.?\d*[MBK]?\b',
        r'\b\d+\.?\d*[MBK]?\b'
    ]
    
    # Extract capitalized entities (names, places, organizations)
    entity_pattern = r'\b[A-Z][a-z]+ [A-Z][a-z]+\b'
    
    for pattern in date_patterns + number_patterns + [entity_pattern]:
        elements.extend(re.findall(pattern, text))
    
    return list(set(elements))

6. Token-Aware Processing

Replace character-based limits with actual token counting:

from models.gemini_utils import count_gemini_tokens

async def token_aware_chunking(content: str, target_tokens_per_chunk: int) -> List[str]:
    """Split content based on actual token counts rather than characters"""
    
    chunks = []
    current_chunk = ""
    sentences = content.split('. ')
    
    for sentence in sentences:
        test_chunk = current_chunk + sentence + '. '
        
        if count_gemini_tokens(test_chunk) <= target_tokens_per_chunk:
            current_chunk = test_chunk
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + '. '
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

Optimized Strategy Implementation

For Gemini Pipeline (Flash → Pro)

# Optimized for 1M token context
GEMINI_STRATEGY = {
    "input_limit": 4000000,           # ~1M tokens
    "chunk_size": 800000,             # 800K chars per chunk  
    "min_summary_tokens": 12000,      # 12K tokens minimum per chunk
    "max_summary_tokens": 65535,      # 64K tokens maximum
    "overlap_size": 2000,             # 2K char overlap for context
    "final_compression_target": 400000 # 400K chars for Pro context
}

async def optimize_for_gemini(content: str, question: str) -> str:
    """Optimized processing for Gemini pipeline"""
    
    if len(content) <= 400000:  # Fits in final context
        return content
    
    if len(content) <= 4000000:  # Single pass with Gemini Flash
        return await direct_summarize_gemini(content, question, 400000)
    
    # Multi-chunk strategy optimized for Gemini
    chunks = await semantic_chunk_content(content, 800000)
    
    # Each chunk gets generous token allocation
    tokens_per_chunk = min(65535, 400000 // len(chunks))
    
    if tokens_per_chunk < 12000:  # Reduce chunks if summaries would be too small
        max_chunks = 400000 // 12000  # ~33 chunks max
        chunks = await semantic_chunk_content(content, len(content) // max_chunks)
        tokens_per_chunk = 12000
    
    # Process chunks with generous token allocation
    summaries = await process_chunks_parallel(chunks, question, tokens_per_chunk)
    
    combined = "\n\n".join(summaries)
    
    # Final compression if needed
    if len(combined) > 400000:
        return await direct_summarize_gemini(combined, question, 400000)
    
    return combined

For Anthropic Pipeline (Flash → Claude)

# Optimized for 200K token context
ANTHROPIC_STRATEGY = {
    "input_limit": 4000000,           # Still use Gemini Flash for processing
    "chunk_size": 200000,             # Smaller chunks for more aggressive compression
    "min_summary_tokens": 2000,       # 2K tokens minimum per chunk
    "max_summary_tokens": 65535,      # 64K tokens maximum  
    "overlap_size": 1000,             # 1K char overlap
    "final_compression_target": 80000  # 80K chars for Claude context
}

async def optimize_for_anthropic(content: str, question: str) -> str:
    """Optimized processing for Anthropic pipeline"""
    
    if len(content) <= 80000:  # Fits in final context
        return content
    
    if len(content) <= 4000000:  # Single pass with aggressive compression
        return await direct_summarize_aggressive(content, question, 80000)
    
    # Multi-chunk strategy optimized for Claude's smaller context
    chunks = await semantic_chunk_content(content, 200000)
    
    # More conservative token allocation
    tokens_per_chunk = min(65535, 80000 // len(chunks))
    
    if tokens_per_chunk < 2000:  # Reduce chunks if summaries would be too small
        max_chunks = 80000 // 2000  # ~40 chunks max
        chunks = await semantic_chunk_content(content, len(content) // max_chunks)
        tokens_per_chunk = 2000
    
    # Process with focus on essential information
    summaries = await process_chunks_focused(chunks, question, tokens_per_chunk)
    
    combined = "\n\n".join(summaries)
    
    # Aggressive final compression for Claude
    if len(combined) > 80000:
        return await direct_summarize_aggressive(combined, question, 80000)
    
    return combined

Expected Performance Improvements

Information Preservation Gains

Document Size Current Preservation Optimized Gemini Optimized Anthropic
1M chars 25.6% 40-50% 8-12%
5M chars 5.12% 15-20% 1.6-2.4%
10M chars 2.56% 8-12% 0.8-1.2%
20M chars 1.28% 4-6% 0.4-0.6%

Processing Efficiency Gains

  • Reduced chunk count: 40-60% fewer chunks needed for Gemini pipeline
  • Better token utilization: Each chunk gets meaningful summary allocation
  • Semantic coherence: Natural boundaries preserve context better
  • Parallel processing: Larger chunks process more efficiently

Context Quality Improvements

  • Relationship preservation: Semantic chunking maintains logical connections
  • Reference integrity: Intelligent overlap preserves citations and cross-references
  • Technical detail retention: Larger summaries can include code examples and specifics
  • Hierarchical organization: Multi-level approach preserves document structure

Implementation Recommendations

  1. Phase 1: Update constants and implement model-aware chunking
  2. Phase 2: Add semantic boundary detection for common document types
  3. Phase 3: Implement hierarchical summarization for very large documents
  4. Phase 4: Add token-aware processing to replace character-based limits

This optimized strategy should significantly improve information preservation while staying within model limits and maximizing the new 64K output capacity.