Document Search Agent
Overview
The Document Search Agent is an intelligent tool that combines AI search and file extraction capabilities to iteratively discover and extract relevant document content for answering user questions. Unlike standalone ai_search and file-browser tools, this agent provides an intelligent orchestration layer that can make autonomous decisions about content sufficiency and search refinement.
How It Works
The agent operates through an iterative 4-step process:
- đ AI Search Discovery: Uses Vertex AI Search to find potentially relevant documents based on the userâs question
- đ Content Extraction: Extracts full content from identified documents using the file browser functionality
- â Quality Assessment: Uses Gemini Flash to evaluate if sufficient information has been gathered
- đ Adaptive Refinement: If more information is needed, refines search strategy and repeats
This cycle continues until either:
- The AI determines sufficient information has been gathered
- Maximum iterations are reached (default: 3)
- No new documents are found
Key Features
- Intelligent Stopping: Uses AI to assess content quality and decide when to stop searching
- Deduplication: Tracks processed documents to avoid reprocessing the same content
- Content Summarization: Automatically summarizes large content to stay within token limits
- Search Refinement: Adapts search strategy based on gaps in found content
- Streaming Updates: Provides real-time progress updates to users
- Directory Filtering: Can focus searches on specific directory paths
Configuration
Required Parameters
datastore_id: Vertex AI Search datastore identifier (e.g., âaitana3â)bucketUrl: GCS bucket URL for file extraction (e.g., âaitana-documents-bucketâ)
Optional Parameters
selected_directories: List of directory paths to focus search onmax_iterations: Maximum search cycles (default: 3)max_documents_per_iteration: Document limit per cycle (default: 5)
Frontend Configuration
The tool supports preset configurations in the UI:
- Energy Documents:
aitana-documents-bucket+aitana3datastore - Public Welcome:
aitana-public-documents+aitana public welcomedatastore - Custom: User-defined bucket and datastore
Usage in first_impression.py
When the document_search_agent tool is available, the first impression system should include it in tool selection prompts. Hereâs the recommended integration:
Tool Selection Prompt Addition
# Add this exact prompt text to the system_prompt_tooler compilation in first_impression.py
document_search_agent_prompt = """
### document_search_agent
This tool can take from 30-120 seconds as it performs multiple AI search and extraction cycles.
You have access to an intelligent document discovery and extraction agent that combines AI search with automatic content extraction. Unlike standalone vertex_search or file-browser tools, this agent autonomously decides when sufficient information has been gathered and iteratively refines its search strategy.
The agent performs a 4-step cycle:
1. **AI Search**: Uses Vertex AI Search to find potentially relevant documents
2. **Content Extraction**: Automatically extracts full content from discovered documents
3. **Quality Assessment**: Uses AI to evaluate if sufficient information has been gathered
4. **Adaptive Refinement**: If more information is needed, refines search and repeats
This tool is ideal for comprehensive research questions, legal document analysis, policy reviews, and multi-source investigations where you need thorough coverage rather than targeted searches.
Usage: The document_search_agent tool has these configuration parameters:
- query: str: the user's question that drives the entire search and extraction process
- datastore_id: str: the Vertex AI Search datastore to search (e.g., "aitana3")
- bucketUrl: str: the GCS bucket containing documents for extraction (e.g., "aitana-documents-bucket")
- selected_directories: list[str]: optional directory paths to focus searches on specific areas
- max_iterations: int: maximum search cycles (default: 3, rarely needs changing)
- max_documents_per_iteration: int: document limit per cycle (default: 5, rarely needs changing)
**Key advantages over separate vertex_search + file-browser:**
- Autonomous stopping when sufficient content is found
- Intelligent search refinement between iterations
- Automatic deduplication of processed documents
- Comprehensive content extraction without manual file selection
- Quality assessment to ensure thorough coverage
**Use document_search_agent when:**
- Question requires analysis of multiple unknown documents
- Comprehensive research coverage is needed
- Legal/policy analysis spanning multiple sources
- Investigation requires thorough document discovery
- User wants autonomous research without manual file selection
**Use separate vertex_search + file-browser when:**
- Need fine-grained control over search and extraction phases
- Specific documents are already identified
- Want to separate discovery from analysis
- Working with very large document sets requiring manual curation
Examples:
Basic research query:
[{ânameâ: âdocument_search_agentâ, âconfigâ: [ {âparameterâ: âqueryâ, âvalueâ: âWhat are the key seller commitments in renewable energy PPAs?â}, {âparameterâ: âdatastore_idâ, âvalueâ: âaitana3â}, {âparameterâ: âbucketUrlâ, âvalueâ: âaitana-documents-bucketâ} ]}]
Research with directory focus:
[{ânameâ: âdocument_search_agentâ, âconfigâ: [ {âparameterâ: âqueryâ, âvalueâ: âHow do EU regulations address carbon offset verification?â}, {âparameterâ: âdatastore_idâ, âvalueâ: âaitana3â}, {âparameterâ: âbucketUrlâ, âvalueâ: âaitana-documents-bucketâ}, {âparameterâ: âselected_directoriesâ, âvalueâ: [âregulations/EUâ, âpolicies/carbonâ]} ]}]
Multiple research queries (each runs independently):
[ {ânameâ: âdocument_search_agentâ, âconfigâ: [ {âparameterâ: âqueryâ, âvalueâ: âSecurity requirements for energy trading platformsâ}, {âparameterâ: âdatastore_idâ, âvalueâ: âaitana3â}, {âparameterâ: âbucketUrlâ, âvalueâ: âaitana-documents-bucketâ} ]}, {ânameâ: âdocument_search_agentâ, âconfigâ: [ {âparameterâ: âqueryâ, âvalueâ: âCompliance frameworks for renewable energy certificatesâ}, {âparameterâ: âdatastore_idâ, âvalueâ: âaitana3â}, {âparameterâ: âbucketUrlâ, âvalueâ: âaitana-documents-bucketâ} ]} ]
"""
# Add to system_prompt_tooler compilation alongside other tool descriptions
tool_descriptions = {
"document_search_agent": document_search_agent_prompt,
# Contrast with standalone tools:
"vertex_search": """
Basic semantic search of document datastores. Returns search results but doesn't extract full content.
Use when you only need to find relevant documents, not analyze their full content.
""",
"file-browser": """
Direct file extraction from known document paths. Requires manual file selection.
Use when you know exactly which documents to analyze.
"""
}
First Impression Selection Logic
The agent should prefer document_search_agent over standalone ai_search + file-browser combinations when:
- The question requires comprehensive document analysis
- Multiple documents may contain relevant information
- The user hasnât specified exact documents to analyze
- The question is research-oriented or investigative in nature
Example first impression reasoning:
# In FirstImpressionResponse.thinking_why_tools_to_use
thinking_examples = {
"comprehensive_research":
"User is asking about complex legal/policy topics that likely span multiple documents. "
"Document search agent will autonomously find and analyze all relevant documents, "
"providing more thorough coverage than manual file selection.",
"investigative_query":
"This question requires gathering information from multiple sources. "
"The document search agent will iteratively search and extract content "
"until sufficient information is found to answer comprehensively.",
"known_documents":
"User has specified exact documents to analyze. "
"File browser tool is more appropriate for targeted document extraction."
}
Integration with Related Tools
Relationship to ai_search.py
The document search agent uses ai_search.py internally but adds:
- Automatic content extraction after finding relevant documents
- Quality assessment to determine search completeness
- Iterative refinement based on content gaps
- Deduplication to avoid reprocessing documents
Use ai_search directly when you only need document discovery, not full content analysis.
Relationship to extract_files.py
The document search agent uses extract_files.py internally but adds:
- Automatic document discovery (no manual file selection needed)
- Intelligent stopping when sufficient content is gathered
- Batch processing with iteration limits
- Search-driven workflow rather than user-driven file selection
Use extract_files directly when you have specific documents to analyze.
Combined vs. Separate Tool Usage
Use Document Search Agent when:
- Question is open-ended or research-oriented
- Relevant documents are unknown or spread across multiple files
- Comprehensive analysis is needed
- User wants autonomous document discovery
Use ai_search + file-browser separately when:
- Need fine-grained control over search and extraction
- Specific documents are already identified
- Want to separate discovery from analysis phases
- Working with very large document sets requiring manual curation
Technical Implementation
Core Function Signature
async def document_search_agent(
question: str, # User's question
selected_directories: List[str] = None, # Directory filter
datastore_id: str = "", # AI search datastore
bucketUrl: str = "", # GCS bucket for files
max_iterations: int = 3, # Search cycles
max_documents_per_iteration: int = 5, # Docs per cycle
trace=None, # Langfuse tracing
parent_observation_id=None, # Tracing parent
callback=None # Streaming callback
) -> str
Tool Interface for Orchestrator
# Expected input format:
[
{
'datastore_id': 'aitana3',
'query': 'user question',
'selected_directories': ['path/to/dir1', 'path/to/dir2'],
'bucketUrl': 'aitana-documents-bucket',
'max_iterations': 3,
'max_documents_per_iteration': 5
}
]
Quality Assessment Logic
The agent uses Gemini Flash with this assessment prompt:
assessment_prompt = (
"You are an information quality assessor. Analyze the provided content and determine "
"if it contains sufficient information to comprehensively answer the given question.\n\n"
"Respond with EXACTLY one of these:\n"
"SUFFICIENT - if the content provides enough information to answer the question well\n"
"INSUFFICIENT - if more information is needed to properly answer the question\n\n"
"Consider:\n"
"- Completeness of information relative to the question\n"
"- Depth of relevant details\n"
"- Coverage of different aspects of the topic\n"
"- Quality and specificity of the content\n\n"
"Be conservative - prefer INSUFFICIENT if there are clear gaps."
)
Example Usage Scenarios
Legal Document Analysis
Question: "What are the key seller commitments in renewable energy PPAs?"
Process:
1. Searches datastore for PPA documents
2. Extracts content from relevant contracts
3. Assesses if coverage is comprehensive
4. Continues until all major PPA types are analyzed
Result: Comprehensive analysis of seller commitments across multiple PPA documents
Policy Research
Question: "How do EU regulations address carbon offset verification?"
Process:
1. Finds EU regulatory documents on carbon offsets
2. Extracts verification requirements from each
3. Identifies gaps in coverage (e.g., missing recent updates)
4. Searches for additional regulatory guidance
Result: Complete regulatory framework analysis with proper citations
Technical Investigation
Question: "What security measures are required for energy trading platforms?"
Process:
1. Discovers security standards and guidelines
2. Extracts technical requirements from each document
3. Assesses completeness of security coverage
4. Finds additional cybersecurity frameworks if needed
Result: Comprehensive security requirements compilation
Best Practices for Assistants
When to Recommend Document Search Agent
- Research Questions: Complex questions requiring analysis of multiple sources
- Investigative Queries: When comprehensive information gathering is needed
- Policy/Legal Analysis: Questions about regulations, contracts, or compliance
- Technical Deep Dives: Detailed technical questions spanning multiple documents
- Comparative Analysis: Questions requiring comparison across multiple documents
Configuration Recommendations
- Energy/Legal Documents: Use
energy_docspreset for renewable energy legal analysis - Public Information: Use
public_welcomepreset for general research - Custom Domains: Configure specific bucket/datastore combinations for specialized document sets
Question Refinement Tips
Help users formulate questions that work well with the agent:
- Specific but broad: âKey requirements for wind farm development permitsâ vs. âTell me about wind farmsâ
- Analysis-oriented: âCompare different PPA structuresâ vs. âShow me a PPAâ
- Research-focused: âHow do EU carbon markets function?â vs. âWhat is carbon trading?â
Monitoring and Observability
The agent provides comprehensive tracing through Langfuse:
- Search iterations with quality assessments
- Document processing with extraction results
- Content summarization when limits are exceeded
- Strategy refinement decisions and reasoning
Use trace data to:
- Optimize search strategies for specific document types
- Adjust iteration limits based on question complexity
- Monitor content quality assessment accuracy
- Improve directory filtering effectiveness
Troubleshooting
Common Issues
- No Results Found: Check datastore_id configuration and document indexing
- Insufficient Content: May need more iterations or broader directory selection
- Too Many Documents: Reduce max_documents_per_iteration or add directory filters
- Quality Assessment Loops: Review question specificity and content relevance
Performance Optimization
- Use directory filtering to focus searches
- Adjust iteration limits based on question complexity
- Monitor document extraction times for large files
- Consider datastore optimization for frequently searched content
This intelligent document search agent represents a significant advancement over manual document discovery and analysis, providing autonomous research capabilities that scale with document collection size and complexity.