Batch Extraction Feature
Overview
The batch extraction feature provides high-performance document processing capabilities for extracting structured data from large collections of files in Google Cloud Storage (GCS). This feature bypasses the HTTP API layer and directly calls backend functions, enabling longer processing timeouts and better performance.
Key Benefits
1. Extended Timeouts
- UI Context: 60 seconds (standard web interface)
- API Context: 5 minutes (batch extraction)
- Email Context: 5 minutes (background processing)
The batch extraction uses the “api” context, providing 5 minutes per file instead of the 60-second UI timeout. This is critical for processing large PDFs or complex documents.
2. Direct Backend Integration
Instead of going through Flask/FastAPI HTTP endpoints, the batch extraction directly imports and calls backend functions:
- Eliminates HTTP overhead
- Reduces serialization/deserialization cycles
- Avoids Flask response tuple formatting issues
- Direct access to backend error handling
3. Parallel Processing
Files are processed in configurable batches (default: 5 files simultaneously):
- Dramatically reduces total processing time
- Maintains system stability with controlled concurrency
- Progress tracking for each batch
- Automatic retry on failures
4. Structured Data Export
Results are automatically exported in multiple formats:
- JSON: Complete results with all metadata
- CSV: Main data table with document information
- Obligations CSV: Normalized table for contractual obligations with document ID references
Architecture
Call Flow Comparison
Standard HTTP API Flow (60-second timeout):
CLI → HTTP Client → Flask/FastAPI → Backend → AI Model
↓ ↑
← JSON Response ← HTTP Response ← Result ←
Direct Backend Flow (5-minute timeout):
CLI → Backend Functions → AI Model
↓ ↑
← Direct Response ←
Context Parameter Propagation
The context parameter flows through the entire call chain:
- Entry Point:
batch_extract.py# Pre-initialize GenAI client with API context from models.genai_client import genai_client genai_client(location="global", force_recreate=True, context="api") # Then call with API context result = await process_assistant_request( context="api" # Ensures 5-minute timeout ) - Assistant Utils:
assistant_utils.pyasync def process_assistant_request(..., context="ui"): # Passes context to vac_stream - VAC Service:
vac_service.pyrequest_context = kwargs.get("context", "ui") # Creates appropriate GenAI client based on context - GenAI Client Configuration:
genai_client.pyif context == "api": timeout_ms = 300000 # 5 minutes else: timeout_ms = 60000 # 1 minute
Installation
# Install using uv tool (recommended)
uv tool install ./aitana --force
# Or install with pip
pip install ./aitana
Usage
Command Line Interface
# Basic usage with any assistant
aitana batch-extract <assistant-id> <bucket-url> \
--prompt "Your extraction prompt" \
[options]
Examples
1. Contract Extraction
aitana batch-extract contract-assistant gs://legal-docs/contracts/ \
--prompt "Extract all obligations, parties, and key dates" \
--schema-name contract_extraction_schema \
--max-files 50 \
--batch-size 10 \
--output-dir results/contracts/
2. Invoice Processing
aitana batch-extract invoice-processor gs://finance/invoices/2024/ \
--prompt "Extract invoice number, amount, date, and line items" \
--schema-name invoice_schema \
--file-pattern ".*\.pdf$" \
--batch-size 5
3. Document Analysis
aitana batch-extract document-analyzer gs://research/papers/ \
--prompt "Summarize key findings and methodologies" \
--tools "file-browser,structured_extraction" \
--max-files 100 \
--output-dir analysis/
Output Format
Directory Structure
batch_results_20240910_143022/
├── results.json # Complete results with all data
├── summary.csv # Summary of all processed files
├── failed.csv # Details of failed files (if any)
├── structured_data.csv # Flattened structured extraction data
└── obligations.csv # Normalized obligations table
Summary CSV Format
filename,success,structured_fields,answer_length,error
contract1.pdf,true,12,4500,
contract2.pdf,true,15,5200,
contract3.pdf,false,0,0,"Timeout exceeded"
Structured Data CSV Format
filename,file_uri,processed_at,party_1,party_2,effective_date,contract_type
contract1.pdf,gs://bucket/contract1.pdf,2024-09-10T14:30:22,"ABC Corp","XYZ Ltd","2024-01-01","Service Agreement"
contract2.pdf,gs://bucket/contract2.pdf,2024-09-10T14:30:25,"DEF Inc","GHI LLC","2024-02-01","Purchase Agreement"
Obligations CSV Format
filename,obligation_source,obligation_number,description,responsible_party,due_date,penalty
contract1.pdf,obligations,1,"Deliver goods","ABC Corp","2024-03-01","$1000/day"
contract1.pdf,obligations,2,"Payment terms","XYZ Ltd","2024-03-15","2% interest"
contract2.pdf,obligations,1,"Service level","DEF Inc","Ongoing","Service credits"
Failed CSV Format (if failures occur)
filename,file_uri,error,processed_at
large_document.pdf,gs://bucket/large_document.pdf,"DEADLINE_EXCEEDED: Request timed out",2024-09-10T14:35:00
corrupted.pdf,gs://bucket/corrupted.pdf,"Invalid PDF format",2024-09-10T14:35:05
Configuration
Authentication Requirements
Important: User authentication is mandatory. Use one of these methods:
# Method 1: Use aitana auth (recommended)
aitana auth
# Method 2: Provide email via command line
aitana batch-extract --user-email your-email@example.com ...
There are no default fallback emails - authentication must be explicit.
Backend Setup Requirements
The batch extraction requires the backend to be properly configured:
# Ensure backend dependencies are installed
cd /Users/mark/dev/aitana-labs/frontend/backend
uv sync
source .venv/bin/activate
Environment Variables
The batch extraction automatically sets default environment variables:
GOOGLE_CLOUD_PROJECT: “aitana-multivac-dev”GOOGLE_CLOUD_LOCATION: “global”
These can be overridden if needed.
Tool Configuration
Default tools for batch extraction:
file-browser: For accessing GCS filesstructured_extraction: For extracting structured data
Tool configurations can be customized:
--tools "file-browser,structured_extraction" \
--schema-name "custom_schema"
Performance Optimization
Batch Size Selection
Choose batch size based on:
- Document complexity: Complex PDFs need smaller batches
- System resources: Higher concurrency needs more memory
- API limits: Some assistants have rate limits
Recommended batch sizes:
- Simple text files: 10-20
- Standard PDFs: 5-10
- Complex documents: 2-5
File Filtering
Use regex patterns to process only relevant files:
--file-pattern ".*2024.*\.pdf$" # Only 2024 PDFs
--file-pattern "^contracts/.*" # Only contracts folder
--file-pattern ".*\.(pdf|docx)$" # Only PDFs and Word docs
Maximum Files
Limit processing for testing or batching:
--max-files 10 # Process first 10 files for testing
--max-files 100 # Process in batches of 100
Troubleshooting
Common Issues
1. Backend Import Errors
Failed to import backend modules: No module named 'assistant_utils'
Solution: Ensure backend is set up:
cd backend && uv sync
2. Timeout Issues
Error: DEADLINE_EXCEEDED
Solution: This means even 5 minutes wasn’t enough. Consider:
- Reducing document size
- Simplifying extraction prompt
- Using smaller batch size
3. No Files Found
No files found in bucket
Solution: Check:
- Bucket URL format:
gs://bucket/path/ - GCS permissions
- File pattern if using
--file-pattern
4. Schema Not Found
Schema 'contract_extraction_schema' not found
Solution: Verify schema name with:
aitana inspect <assistant-id>
Comparison with Other Methods
Batch Extract vs Process Files
| Feature | batch-extract | process files |
|---|---|---|
| Timeout | 5 minutes | 60 seconds |
| Method | Direct backend | HTTP API |
| Performance | Faster | Standard |
| CSV Export | Multiple formats | JSON only |
| Summary CSV | Yes | No |
| Failed Files CSV | Yes | No |
| Structured Data CSV | Yes | No |
| Obligations Table | Yes | No |
| Error Recovery | Better | Standard |
| Authentication | Required | Optional |
| Context Override | API (5 min) | UI (60 sec) |
When to Use Each
Use batch-extract when:
- Processing large PDFs (>10MB)
- Extracting complex structured data
- Need CSV export with normalization
- Processing >50 files
- Documents take >30 seconds each
Use process files when:
- Simple text extraction
- Small files (<1MB)
- Need HTTP API features
- Integration with other tools
Future Enhancements
Planned Features
- Resume capability: Continue from last processed file
- Incremental saves: Save results after each batch
- Custom schemas: Upload Pydantic schemas
- Multiple output formats: Excel, Parquet, BigQuery
- Progress webhooks: Send progress updates to external systems
Performance Improvements
- Adaptive batch sizing: Automatically adjust based on success rate
- Smart retries: Exponential backoff with jitter
- Caching: Skip already processed files
- Compression: Automatic result compression for large datasets
API Reference
Python API
from aitana.processors.batch_extract import BatchExtractProcessor
# Initialize processor
processor = BatchExtractProcessor(
assistant_id="my-assistant",
prompt="Extract key information",
batch_size=5,
verbose=True
)
# Process files
results = await processor.process_batch(
bucket_url="gs://my-bucket/",
tools=["file-browser", "structured_extraction"],
tool_configs={"structured_extraction": {"schema_name": "my_schema"}},
max_files=100,
file_pattern=".*\.pdf$"
)
# Save results
saved_files = processor.save_results("output_dir/")
processor.print_summary()
Direct Backend Usage
# For custom integrations
import sys
sys.path.insert(0, "/path/to/backend")
from assistant_utils import process_assistant_request
result = await process_assistant_request(
assistant_id="my-assistant",
user_input="Extract data from document",
context="api", # Critical for 5-minute timeout
tools=["file-browser", "structured_extraction"],
toolConfigs={...}
)