Batch Extraction Feature

Overview

The batch extraction feature provides high-performance document processing capabilities for extracting structured data from large collections of files in Google Cloud Storage (GCS). This feature bypasses the HTTP API layer and directly calls backend functions, enabling longer processing timeouts and better performance.

Key Benefits

1. Extended Timeouts

  • UI Context: 60 seconds (standard web interface)
  • API Context: 5 minutes (batch extraction)
  • Email Context: 5 minutes (background processing)

The batch extraction uses the “api” context, providing 5 minutes per file instead of the 60-second UI timeout. This is critical for processing large PDFs or complex documents.

2. Direct Backend Integration

Instead of going through Flask/FastAPI HTTP endpoints, the batch extraction directly imports and calls backend functions:

  • Eliminates HTTP overhead
  • Reduces serialization/deserialization cycles
  • Avoids Flask response tuple formatting issues
  • Direct access to backend error handling

3. Parallel Processing

Files are processed in configurable batches (default: 5 files simultaneously):

  • Dramatically reduces total processing time
  • Maintains system stability with controlled concurrency
  • Progress tracking for each batch
  • Automatic retry on failures

4. Structured Data Export

Results are automatically exported in multiple formats:

  • JSON: Complete results with all metadata
  • CSV: Main data table with document information
  • Obligations CSV: Normalized table for contractual obligations with document ID references

Architecture

Call Flow Comparison

Standard HTTP API Flow (60-second timeout):

CLI → HTTP Client → Flask/FastAPI → Backend → AI Model
    ↓                                         ↑
    ← JSON Response ← HTTP Response ← Result ←

Direct Backend Flow (5-minute timeout):

CLI → Backend Functions → AI Model
    ↓                    ↑
    ← Direct Response ←

Context Parameter Propagation

The context parameter flows through the entire call chain:

  1. Entry Point: batch_extract.py
    # Pre-initialize GenAI client with API context
    from models.genai_client import genai_client
    genai_client(location="global", force_recreate=True, context="api")
       
    # Then call with API context
    result = await process_assistant_request(
        context="api"  # Ensures 5-minute timeout
    )
    
  2. Assistant Utils: assistant_utils.py
    async def process_assistant_request(..., context="ui"):
        # Passes context to vac_stream
    
  3. VAC Service: vac_service.py
    request_context = kwargs.get("context", "ui")
    # Creates appropriate GenAI client based on context
    
  4. GenAI Client Configuration: genai_client.py
    if context == "api":
        timeout_ms = 300000  # 5 minutes
    else:
        timeout_ms = 60000   # 1 minute
    

Installation

# Install using uv tool (recommended)
uv tool install ./aitana --force

# Or install with pip
pip install ./aitana

Usage

Command Line Interface

# Basic usage with any assistant
aitana batch-extract <assistant-id> <bucket-url> \
    --prompt "Your extraction prompt" \
    [options]

Examples

1. Contract Extraction

aitana batch-extract contract-assistant gs://legal-docs/contracts/ \
    --prompt "Extract all obligations, parties, and key dates" \
    --schema-name contract_extraction_schema \
    --max-files 50 \
    --batch-size 10 \
    --output-dir results/contracts/

2. Invoice Processing

aitana batch-extract invoice-processor gs://finance/invoices/2024/ \
    --prompt "Extract invoice number, amount, date, and line items" \
    --schema-name invoice_schema \
    --file-pattern ".*\.pdf$" \
    --batch-size 5

3. Document Analysis

aitana batch-extract document-analyzer gs://research/papers/ \
    --prompt "Summarize key findings and methodologies" \
    --tools "file-browser,structured_extraction" \
    --max-files 100 \
    --output-dir analysis/

Output Format

Directory Structure

batch_results_20240910_143022/
├── results.json          # Complete results with all data
├── summary.csv           # Summary of all processed files
├── failed.csv            # Details of failed files (if any)
├── structured_data.csv   # Flattened structured extraction data
└── obligations.csv       # Normalized obligations table

Summary CSV Format

filename,success,structured_fields,answer_length,error
contract1.pdf,true,12,4500,
contract2.pdf,true,15,5200,
contract3.pdf,false,0,0,"Timeout exceeded"

Structured Data CSV Format

filename,file_uri,processed_at,party_1,party_2,effective_date,contract_type
contract1.pdf,gs://bucket/contract1.pdf,2024-09-10T14:30:22,"ABC Corp","XYZ Ltd","2024-01-01","Service Agreement"
contract2.pdf,gs://bucket/contract2.pdf,2024-09-10T14:30:25,"DEF Inc","GHI LLC","2024-02-01","Purchase Agreement"

Obligations CSV Format

filename,obligation_source,obligation_number,description,responsible_party,due_date,penalty
contract1.pdf,obligations,1,"Deliver goods","ABC Corp","2024-03-01","$1000/day"
contract1.pdf,obligations,2,"Payment terms","XYZ Ltd","2024-03-15","2% interest"
contract2.pdf,obligations,1,"Service level","DEF Inc","Ongoing","Service credits"

Failed CSV Format (if failures occur)

filename,file_uri,error,processed_at
large_document.pdf,gs://bucket/large_document.pdf,"DEADLINE_EXCEEDED: Request timed out",2024-09-10T14:35:00
corrupted.pdf,gs://bucket/corrupted.pdf,"Invalid PDF format",2024-09-10T14:35:05

Configuration

Authentication Requirements

Important: User authentication is mandatory. Use one of these methods:

# Method 1: Use aitana auth (recommended)
aitana auth

# Method 2: Provide email via command line
aitana batch-extract --user-email your-email@example.com ...

There are no default fallback emails - authentication must be explicit.

Backend Setup Requirements

The batch extraction requires the backend to be properly configured:

# Ensure backend dependencies are installed
cd /Users/mark/dev/aitana-labs/frontend/backend
uv sync
source .venv/bin/activate

Environment Variables

The batch extraction automatically sets default environment variables:

  • GOOGLE_CLOUD_PROJECT: “aitana-multivac-dev”
  • GOOGLE_CLOUD_LOCATION: “global”

These can be overridden if needed.

Tool Configuration

Default tools for batch extraction:

  • file-browser: For accessing GCS files
  • structured_extraction: For extracting structured data

Tool configurations can be customized:

--tools "file-browser,structured_extraction" \
--schema-name "custom_schema"

Performance Optimization

Batch Size Selection

Choose batch size based on:

  • Document complexity: Complex PDFs need smaller batches
  • System resources: Higher concurrency needs more memory
  • API limits: Some assistants have rate limits

Recommended batch sizes:

  • Simple text files: 10-20
  • Standard PDFs: 5-10
  • Complex documents: 2-5

File Filtering

Use regex patterns to process only relevant files:

--file-pattern ".*2024.*\.pdf$"  # Only 2024 PDFs
--file-pattern "^contracts/.*"   # Only contracts folder
--file-pattern ".*\.(pdf|docx)$" # Only PDFs and Word docs

Maximum Files

Limit processing for testing or batching:

--max-files 10  # Process first 10 files for testing
--max-files 100 # Process in batches of 100

Troubleshooting

Common Issues

1. Backend Import Errors

Failed to import backend modules: No module named 'assistant_utils'

Solution: Ensure backend is set up:

cd backend && uv sync

2. Timeout Issues

Error: DEADLINE_EXCEEDED

Solution: This means even 5 minutes wasn’t enough. Consider:

  • Reducing document size
  • Simplifying extraction prompt
  • Using smaller batch size

3. No Files Found

No files found in bucket

Solution: Check:

  • Bucket URL format: gs://bucket/path/
  • GCS permissions
  • File pattern if using --file-pattern

4. Schema Not Found

Schema 'contract_extraction_schema' not found

Solution: Verify schema name with:

aitana inspect <assistant-id>

Comparison with Other Methods

Batch Extract vs Process Files

Feature batch-extract process files
Timeout 5 minutes 60 seconds
Method Direct backend HTTP API
Performance Faster Standard
CSV Export Multiple formats JSON only
Summary CSV Yes No
Failed Files CSV Yes No
Structured Data CSV Yes No
Obligations Table Yes No
Error Recovery Better Standard
Authentication Required Optional
Context Override API (5 min) UI (60 sec)

When to Use Each

Use batch-extract when:

  • Processing large PDFs (>10MB)
  • Extracting complex structured data
  • Need CSV export with normalization
  • Processing >50 files
  • Documents take >30 seconds each

Use process files when:

  • Simple text extraction
  • Small files (<1MB)
  • Need HTTP API features
  • Integration with other tools

Future Enhancements

Planned Features

  1. Resume capability: Continue from last processed file
  2. Incremental saves: Save results after each batch
  3. Custom schemas: Upload Pydantic schemas
  4. Multiple output formats: Excel, Parquet, BigQuery
  5. Progress webhooks: Send progress updates to external systems

Performance Improvements

  1. Adaptive batch sizing: Automatically adjust based on success rate
  2. Smart retries: Exponential backoff with jitter
  3. Caching: Skip already processed files
  4. Compression: Automatic result compression for large datasets

API Reference

Python API

from aitana.processors.batch_extract import BatchExtractProcessor

# Initialize processor
processor = BatchExtractProcessor(
    assistant_id="my-assistant",
    prompt="Extract key information",
    batch_size=5,
    verbose=True
)

# Process files
results = await processor.process_batch(
    bucket_url="gs://my-bucket/",
    tools=["file-browser", "structured_extraction"],
    tool_configs={"structured_extraction": {"schema_name": "my_schema"}},
    max_files=100,
    file_pattern=".*\.pdf$"
)

# Save results
saved_files = processor.save_results("output_dir/")
processor.print_summary()

Direct Backend Usage

# For custom integrations
import sys
sys.path.insert(0, "/path/to/backend")

from assistant_utils import process_assistant_request

result = await process_assistant_request(
    assistant_id="my-assistant",
    user_input="Extract data from document",
    context="api",  # Critical for 5-minute timeout
    tools=["file-browser", "structured_extraction"],
    toolConfigs={...}
)