Batch Extraction Feature

Overview

The batch extraction feature provides high-performance document processing capabilities for extracting structured data from large collections of files in Google Cloud Storage (GCS). This feature bypasses the HTTP API layer and directly calls backend functions, enabling longer processing timeouts and better performance.

Key Benefits

1. Extended Timeouts

UI Context: 60 seconds (standard web interface)
API Context: 5 minutes (batch extraction)
Email Context: 5 minutes (background processing)

The batch extraction uses the “api” context, providing 5 minutes per file instead of the 60-second UI timeout. This is critical for processing large PDFs or complex documents.

2. Direct Backend Integration

Instead of going through Flask/FastAPI HTTP endpoints, the batch extraction directly imports and calls backend functions:

Eliminates HTTP overhead
Reduces serialization/deserialization cycles
Avoids Flask response tuple formatting issues
Direct access to backend error handling

3. Parallel Processing

Files are processed in configurable batches (default: 5 files simultaneously):

Dramatically reduces total processing time
Maintains system stability with controlled concurrency
Progress tracking for each batch
Automatic retry on failures

4. Structured Data Export

Results are automatically exported in multiple formats:

JSON: Complete results with all metadata
CSV: Main data table with document information
Obligations CSV: Normalized table for contractual obligations with document ID references

Architecture

Call Flow Comparison

Standard HTTP API Flow (60-second timeout):

CLI → HTTP Client → Flask/FastAPI → Backend → AI Model
    ↓                                         ↑
    ← JSON Response ← HTTP Response ← Result ←

Direct Backend Flow (5-minute timeout):

CLI → Backend Functions → AI Model
    ↓                    ↑
    ← Direct Response ←

Context Parameter Propagation

The context parameter flows through the entire call chain:

Entry Point: batch_extract.py

# Pre-initialize GenAI client with API context
from models.genai_client import genai_client
genai_client(location="global", force_recreate=True, context="api")
   
# Then call with API context
result = await process_assistant_request(
    context="api"  # Ensures 5-minute timeout
)

Assistant Utils: assistant_utils.py

async def process_assistant_request(..., context="ui"):
    # Passes context to vac_stream

VAC Service: vac_service.py

request_context = kwargs.get("context", "ui")
# Creates appropriate GenAI client based on context

GenAI Client Configuration: genai_client.py

if context == "api":
    timeout_ms = 300000  # 5 minutes
else:
    timeout_ms = 60000   # 1 minute

Installation

# Install using uv tool (recommended)
uv tool install ./aitana --force

# Or install with pip
pip install ./aitana

Usage

Command Line Interface

# Basic usage with any assistant
aitana batch-extract <assistant-id> <bucket-url> \
    --prompt "Your extraction prompt" \
    [options]

Examples

1. Contract Extraction

aitana batch-extract contract-assistant gs://legal-docs/contracts/ \
    --prompt "Extract all obligations, parties, and key dates" \
    --schema-name contract_extraction_schema \
    --max-files 50 \
    --batch-size 10 \
    --output-dir results/contracts/

2. Invoice Processing

aitana batch-extract invoice-processor gs://finance/invoices/2024/ \
    --prompt "Extract invoice number, amount, date, and line items" \
    --schema-name invoice_schema \
    --file-pattern ".*\.pdf$" \
    --batch-size 5

3. Document Analysis

aitana batch-extract document-analyzer gs://research/papers/ \
    --prompt "Summarize key findings and methodologies" \
    --tools "file-browser,structured_extraction" \
    --max-files 100 \
    --output-dir analysis/

Output Format

Directory Structure

batch_results_20240910_143022/
├── results.json          # Complete results with all data
├── summary.csv           # Summary of all processed files
├── failed.csv            # Details of failed files (if any)
├── structured_data.csv   # Flattened structured extraction data
└── obligations.csv       # Normalized obligations table

Summary CSV Format

filename,success,structured_fields,answer_length,error
contract1.pdf,true,12,4500,
contract2.pdf,true,15,5200,
contract3.pdf,false,0,0,"Timeout exceeded"

Structured Data CSV Format

filename,file_uri,processed_at,party_1,party_2,effective_date,contract_type
contract1.pdf,gs://bucket/contract1.pdf,2024-09-10T14:30:22,"ABC Corp","XYZ Ltd","2024-01-01","Service Agreement"
contract2.pdf,gs://bucket/contract2.pdf,2024-09-10T14:30:25,"DEF Inc","GHI LLC","2024-02-01","Purchase Agreement"

Obligations CSV Format

filename,obligation_source,obligation_number,description,responsible_party,due_date,penalty
contract1.pdf,obligations,1,"Deliver goods","ABC Corp","2024-03-01","$1000/day"
contract1.pdf,obligations,2,"Payment terms","XYZ Ltd","2024-03-15","2% interest"
contract2.pdf,obligations,1,"Service level","DEF Inc","Ongoing","Service credits"

Failed CSV Format (if failures occur)

filename,file_uri,error,processed_at
large_document.pdf,gs://bucket/large_document.pdf,"DEADLINE_EXCEEDED: Request timed out",2024-09-10T14:35:00
corrupted.pdf,gs://bucket/corrupted.pdf,"Invalid PDF format",2024-09-10T14:35:05

Configuration

Authentication Requirements

Important: User authentication is mandatory. Use one of these methods:

# Method 1: Use aitana auth (recommended)
aitana auth

# Method 2: Provide email via command line
aitana batch-extract --user-email your-email@example.com ...

There are no default fallback emails - authentication must be explicit.

Backend Setup Requirements

The batch extraction requires the backend to be properly configured:

# Ensure backend dependencies are installed
cd /Users/mark/dev/aitana-labs/frontend/backend
uv sync
source .venv/bin/activate

Environment Variables

The batch extraction automatically sets default environment variables:

GOOGLE_CLOUD_PROJECT: “aitana-multivac-dev”
GOOGLE_CLOUD_LOCATION: “global”

These can be overridden if needed.

Tool Configuration

Default tools for batch extraction:

file-browser: For accessing GCS files
structured_extraction: For extracting structured data

Tool configurations can be customized:

--tools "file-browser,structured_extraction" \
--schema-name "custom_schema"

Performance Optimization

Batch Size Selection

Choose batch size based on:

Document complexity: Complex PDFs need smaller batches
System resources: Higher concurrency needs more memory
API limits: Some assistants have rate limits

Recommended batch sizes:

Simple text files: 10-20
Standard PDFs: 5-10
Complex documents: 2-5

File Filtering

Use regex patterns to process only relevant files:

--file-pattern ".*2024.*\.pdf$"  # Only 2024 PDFs
--file-pattern "^contracts/.*"   # Only contracts folder
--file-pattern ".*\.(pdf|docx)$" # Only PDFs and Word docs

Maximum Files

Limit processing for testing or batching:

--max-files 10  # Process first 10 files for testing
--max-files 100 # Process in batches of 100

Troubleshooting

Common Issues

1. Backend Import Errors

Failed to import backend modules: No module named 'assistant_utils'

Solution: Ensure backend is set up:

cd backend && uv sync

2. Timeout Issues

Error: DEADLINE_EXCEEDED

Solution: This means even 5 minutes wasn’t enough. Consider:

Reducing document size
Simplifying extraction prompt
Using smaller batch size

3. No Files Found

No files found in bucket

Solution: Check:

Bucket URL format: gs://bucket/path/
GCS permissions
File pattern if using --file-pattern

4. Schema Not Found

Schema 'contract_extraction_schema' not found

Solution: Verify schema name with:

aitana inspect <assistant-id>

Comparison with Other Methods

Batch Extract vs Process Files

Feature	batch-extract	process files
Timeout	5 minutes	60 seconds
Method	Direct backend	HTTP API
Performance	Faster	Standard
CSV Export	Multiple formats	JSON only
Summary CSV	Yes	No
Failed Files CSV	Yes	No
Structured Data CSV	Yes	No
Obligations Table	Yes	No
Error Recovery	Better	Standard
Authentication	Required	Optional
Context Override	API (5 min)	UI (60 sec)

When to Use Each

Use batch-extract when:

Processing large PDFs (>10MB)
Extracting complex structured data
Need CSV export with normalization
Processing >50 files
Documents take >30 seconds each

Use process files when:

Simple text extraction
Small files (<1MB)
Need HTTP API features
Integration with other tools

Future Enhancements

Planned Features

Resume capability: Continue from last processed file
Incremental saves: Save results after each batch
Custom schemas: Upload Pydantic schemas
Multiple output formats: Excel, Parquet, BigQuery
Progress webhooks: Send progress updates to external systems

Performance Improvements

Adaptive batch sizing: Automatically adjust based on success rate
Smart retries: Exponential backoff with jitter
Caching: Skip already processed files
Compression: Automatic result compression for large datasets

API Reference

Python API

from aitana.processors.batch_extract import BatchExtractProcessor

# Initialize processor
processor = BatchExtractProcessor(
    assistant_id="my-assistant",
    prompt="Extract key information",
    batch_size=5,
    verbose=True
)

# Process files
results = await processor.process_batch(
    bucket_url="gs://my-bucket/",
    tools=["file-browser", "structured_extraction"],
    tool_configs={"structured_extraction": {"schema_name": "my_schema"}},
    max_files=100,
    file_pattern=".*\.pdf$"
)

# Save results
saved_files = processor.save_results("output_dir/")
processor.print_summary()

Direct Backend Usage

# For custom integrations
import sys
sys.path.insert(0, "/path/to/backend")

from assistant_utils import process_assistant_request

result = await process_assistant_request(
    assistant_id="my-assistant",
    user_input="Extract data from document",
    context="api",  # Critical for 5-minute timeout
    tools=["file-browser", "structured_extraction"],
    toolConfigs={...}
)