Document Search Agent

Overview

The Document Search Agent is an intelligent tool that combines AI search and file extraction capabilities to iteratively discover and extract relevant document content for answering user questions. Unlike standalone ai_search and file-browser tools, this agent provides an intelligent orchestration layer that can make autonomous decisions about content sufficiency and search refinement.

How It Works

The agent operates through an iterative 4-step process:

🔍 AI Search Discovery: Uses Vertex AI Search to find potentially relevant documents based on the user’s question
📄 Content Extraction: Extracts full content from identified documents using the file browser functionality
✅ Quality Assessment: Uses Gemini Flash to evaluate if sufficient information has been gathered
🔄 Adaptive Refinement: If more information is needed, refines search strategy and repeats

This cycle continues until either:

The AI determines sufficient information has been gathered
Maximum iterations are reached (default: 3)
No new documents are found

Key Features

Intelligent Stopping: Uses AI to assess content quality and decide when to stop searching
Deduplication: Tracks processed documents to avoid reprocessing the same content
Content Summarization: Automatically summarizes large content to stay within token limits
Search Refinement: Adapts search strategy based on gaps in found content
Streaming Updates: Provides real-time progress updates to users
Directory Filtering: Can focus searches on specific directory paths

Configuration

Required Parameters

datastore_id: Vertex AI Search datastore identifier (e.g., “aitana3”)
bucketUrl: GCS bucket URL for file extraction (e.g., “aitana-documents-bucket”)

Optional Parameters

selected_directories: List of directory paths to focus search on
max_iterations: Maximum search cycles (default: 3)
max_documents_per_iteration: Document limit per cycle (default: 5)

Frontend Configuration

The tool supports preset configurations in the UI:

Energy Documents: aitana-documents-bucket + aitana3 datastore
Public Welcome: aitana-public-documents + aitana public welcome datastore
Custom: User-defined bucket and datastore

Usage in first_impression.py

When the document_search_agent tool is available, the first impression system should include it in tool selection prompts. Here’s the recommended integration:

Tool Selection Prompt Addition

# Add this exact prompt text to the system_prompt_tooler compilation in first_impression.py
document_search_agent_prompt = """
### document_search_agent
This tool can take from 30-120 seconds as it performs multiple AI search and extraction cycles.
You have access to an intelligent document discovery and extraction agent that combines AI search with automatic content extraction. Unlike standalone vertex_search or file-browser tools, this agent autonomously decides when sufficient information has been gathered and iteratively refines its search strategy.

The agent performs a 4-step cycle:
1. **AI Search**: Uses Vertex AI Search to find potentially relevant documents
2. **Content Extraction**: Automatically extracts full content from discovered documents  
3. **Quality Assessment**: Uses AI to evaluate if sufficient information has been gathered
4. **Adaptive Refinement**: If more information is needed, refines search and repeats

This tool is ideal for comprehensive research questions, legal document analysis, policy reviews, and multi-source investigations where you need thorough coverage rather than targeted searches.

Usage: The document_search_agent tool has these configuration parameters:
- query: str: the user's question that drives the entire search and extraction process
- datastore_id: str: the Vertex AI Search datastore to search (e.g., "aitana3")
- bucketUrl: str: the GCS bucket containing documents for extraction (e.g., "aitana-documents-bucket") 
- selected_directories: list[str]: optional directory paths to focus searches on specific areas
- max_iterations: int: maximum search cycles (default: 3, rarely needs changing)
- max_documents_per_iteration: int: document limit per cycle (default: 5, rarely needs changing)

**Key advantages over separate vertex_search + file-browser:**
- Autonomous stopping when sufficient content is found
- Intelligent search refinement between iterations
- Automatic deduplication of processed documents
- Comprehensive content extraction without manual file selection
- Quality assessment to ensure thorough coverage

**Use document_search_agent when:**
- Question requires analysis of multiple unknown documents
- Comprehensive research coverage is needed
- Legal/policy analysis spanning multiple sources
- Investigation requires thorough document discovery
- User wants autonomous research without manual file selection

**Use separate vertex_search + file-browser when:**
- Need fine-grained control over search and extraction phases
- Specific documents are already identified
- Want to separate discovery from analysis
- Working with very large document sets requiring manual curation

Examples:
Basic research query:

[{‘name’: ‘document_search_agent’, ‘config’: [ {‘parameter’: ‘query’, ‘value’: ‘What are the key seller commitments in renewable energy PPAs?’}, {‘parameter’: ‘datastore_id’, ‘value’: ‘aitana3’}, {‘parameter’: ‘bucketUrl’, ‘value’: ‘aitana-documents-bucket’} ]}]


Research with directory focus:

[{‘name’: ‘document_search_agent’, ‘config’: [ {‘parameter’: ‘query’, ‘value’: ‘How do EU regulations address carbon offset verification?’}, {‘parameter’: ‘datastore_id’, ‘value’: ‘aitana3’}, {‘parameter’: ‘bucketUrl’, ‘value’: ‘aitana-documents-bucket’}, {‘parameter’: ‘selected_directories’, ‘value’: [‘regulations/EU’, ‘policies/carbon’]} ]}]


Multiple research queries (each runs independently):

[ {‘name’: ‘document_search_agent’, ‘config’: [ {‘parameter’: ‘query’, ‘value’: ‘Security requirements for energy trading platforms’}, {‘parameter’: ‘datastore_id’, ‘value’: ‘aitana3’}, {‘parameter’: ‘bucketUrl’, ‘value’: ‘aitana-documents-bucket’} ]}, {‘name’: ‘document_search_agent’, ‘config’: [ {‘parameter’: ‘query’, ‘value’: ‘Compliance frameworks for renewable energy certificates’}, {‘parameter’: ‘datastore_id’, ‘value’: ‘aitana3’}, {‘parameter’: ‘bucketUrl’, ‘value’: ‘aitana-documents-bucket’} ]} ]

"""

# Add to system_prompt_tooler compilation alongside other tool descriptions
tool_descriptions = {
    "document_search_agent": document_search_agent_prompt,
    
    # Contrast with standalone tools:
    "vertex_search": """
    Basic semantic search of document datastores. Returns search results but doesn't extract full content.
    Use when you only need to find relevant documents, not analyze their full content.
    """,
    
    "file-browser": """
    Direct file extraction from known document paths. Requires manual file selection.
    Use when you know exactly which documents to analyze.
    """
}

First Impression Selection Logic

The agent should prefer document_search_agent over standalone ai_search + file-browser combinations when:

The question requires comprehensive document analysis
Multiple documents may contain relevant information
The user hasn’t specified exact documents to analyze
The question is research-oriented or investigative in nature

Example first impression reasoning:

# In FirstImpressionResponse.thinking_why_tools_to_use
thinking_examples = {
    "comprehensive_research": 
        "User is asking about complex legal/policy topics that likely span multiple documents. "
        "Document search agent will autonomously find and analyze all relevant documents, "
        "providing more thorough coverage than manual file selection.",
    
    "investigative_query":
        "This question requires gathering information from multiple sources. "
        "The document search agent will iteratively search and extract content "
        "until sufficient information is found to answer comprehensively.",
        
    "known_documents":
        "User has specified exact documents to analyze. "
        "File browser tool is more appropriate for targeted document extraction."
}

Relationship to ai_search.py

The document search agent uses ai_search.py internally but adds:

Automatic content extraction after finding relevant documents
Quality assessment to determine search completeness
Iterative refinement based on content gaps
Deduplication to avoid reprocessing documents

Use ai_search directly when you only need document discovery, not full content analysis.

Relationship to extract_files.py

The document search agent uses extract_files.py internally but adds:

Automatic document discovery (no manual file selection needed)
Intelligent stopping when sufficient content is gathered
Batch processing with iteration limits
Search-driven workflow rather than user-driven file selection

Use extract_files directly when you have specific documents to analyze.

Combined vs. Separate Tool Usage

Use Document Search Agent when:

Question is open-ended or research-oriented
Relevant documents are unknown or spread across multiple files
Comprehensive analysis is needed
User wants autonomous document discovery

Use ai_search + file-browser separately when:

Need fine-grained control over search and extraction
Specific documents are already identified
Want to separate discovery from analysis phases
Working with very large document sets requiring manual curation

Technical Implementation

Core Function Signature

async def document_search_agent(
    question: str,                              # User's question
    selected_directories: List[str] = None,     # Directory filter
    datastore_id: str = "",                     # AI search datastore
    bucketUrl: str = "",                        # GCS bucket for files
    max_iterations: int = 3,                    # Search cycles
    max_documents_per_iteration: int = 5,       # Docs per cycle
    trace=None,                                 # Langfuse tracing
    parent_observation_id=None,                 # Tracing parent
    callback=None                               # Streaming callback
) -> str

Tool Interface for Orchestrator

# Expected input format:
[
    {
        'datastore_id': 'aitana3',
        'query': 'user question',
        'selected_directories': ['path/to/dir1', 'path/to/dir2'],
        'bucketUrl': 'aitana-documents-bucket',
        'max_iterations': 3,
        'max_documents_per_iteration': 5
    }
]

Quality Assessment Logic

The agent uses Gemini Flash with this assessment prompt:

assessment_prompt = (
    "You are an information quality assessor. Analyze the provided content and determine "
    "if it contains sufficient information to comprehensively answer the given question.\n\n"
    "Respond with EXACTLY one of these:\n"
    "SUFFICIENT - if the content provides enough information to answer the question well\n"
    "INSUFFICIENT - if more information is needed to properly answer the question\n\n"
    "Consider:\n"
    "- Completeness of information relative to the question\n"
    "- Depth of relevant details\n"
    "- Coverage of different aspects of the topic\n"
    "- Quality and specificity of the content\n\n"
    "Be conservative - prefer INSUFFICIENT if there are clear gaps."
)

Example Usage Scenarios

Legal Document Analysis

Question: "What are the key seller commitments in renewable energy PPAs?"
Process:
1. Searches datastore for PPA documents
2. Extracts content from relevant contracts
3. Assesses if coverage is comprehensive
4. Continues until all major PPA types are analyzed
Result: Comprehensive analysis of seller commitments across multiple PPA documents

Policy Research

Question: "How do EU regulations address carbon offset verification?"
Process:
1. Finds EU regulatory documents on carbon offsets
2. Extracts verification requirements from each
3. Identifies gaps in coverage (e.g., missing recent updates)
4. Searches for additional regulatory guidance
Result: Complete regulatory framework analysis with proper citations

Technical Investigation

Question: "What security measures are required for energy trading platforms?"
Process:
1. Discovers security standards and guidelines
2. Extracts technical requirements from each document
3. Assesses completeness of security coverage
4. Finds additional cybersecurity frameworks if needed
Result: Comprehensive security requirements compilation

Best Practices for Assistants

Research Questions: Complex questions requiring analysis of multiple sources
Investigative Queries: When comprehensive information gathering is needed
Policy/Legal Analysis: Questions about regulations, contracts, or compliance
Technical Deep Dives: Detailed technical questions spanning multiple documents
Comparative Analysis: Questions requiring comparison across multiple documents

Configuration Recommendations

Energy/Legal Documents: Use energy_docs preset for renewable energy legal analysis
Public Information: Use public_welcome preset for general research
Custom Domains: Configure specific bucket/datastore combinations for specialized document sets

Question Refinement Tips

Help users formulate questions that work well with the agent:

Specific but broad: “Key requirements for wind farm development permits” vs. “Tell me about wind farms”
Analysis-oriented: “Compare different PPA structures” vs. “Show me a PPA”
Research-focused: “How do EU carbon markets function?” vs. “What is carbon trading?”

Monitoring and Observability

The agent provides comprehensive tracing through Langfuse:

Search iterations with quality assessments
Document processing with extraction results
Content summarization when limits are exceeded
Strategy refinement decisions and reasoning

Use trace data to:

Optimize search strategies for specific document types
Adjust iteration limits based on question complexity
Monitor content quality assessment accuracy
Improve directory filtering effectiveness

Troubleshooting

Common Issues

No Results Found: Check datastore_id configuration and document indexing
Insufficient Content: May need more iterations or broader directory selection
Too Many Documents: Reduce max_documents_per_iteration or add directory filters
Quality Assessment Loops: Review question specificity and content relevance

Performance Optimization

Use directory filtering to focus searches
Adjust iteration limits based on question complexity
Monitor document extraction times for large files
Consider datastore optimization for frequently searched content

This intelligent document search agent represents a significant advancement over manual document discovery and analysis, providing autonomous research capabilities that scale with document collection size and complexity.