Prompt Optimization System
Overview
The prompt optimization system is an automated tool for improving AI assistant prompts using iterative testing and LLM-based evaluation. It specifically targets the first_impression.py function prompts but can be extended to other prompts in the system.
Table of Contents
- Architecture
- Setup & Configuration
- Creating Datasets
- Running Optimizations
- Understanding Results
- Best Practices
- Troubleshooting
- API Reference
Architecture
graph TD
A[Langfuse Dataset] --> B[PromptOptimizer]
C[Current Prompts] --> B
B --> D[Test Execution]
D --> E[LLM Judge Evaluation]
E --> F[Score & Feedback]
F --> G{Good Score?}
G -->|No| H[Generate Improved Prompt]
H --> D
G -->|Yes| I[Save Results]
I --> J[Generate Report]
Key Components
PromptOptimizer (prompt_optimization.py)
The main orchestrator that manages the optimization process:
- Loads test cases from Langfuse datasets
- Fetches current prompts using existing infrastructure
- Manages optimization iterations (max 5)
- Coordinates evaluation and improvement cycles
PromptEvaluator (prompt_evaluator.py)
LLM-based evaluation system that scores responses on:
- Relevance (25 points): Does the answer address the user’s question?
- Tool Selection (25 points): Are the selected tools appropriate?
- Clarity (20 points): Is the response clear and well-structured?
- Completeness (15 points): Does it address all aspects?
- Reasoning (10 points): Is the tool selection logic sound?
- Technical Accuracy (5 points): Are technical details correct?
Optimization Utilities (optimization_utils.py)
Supporting classes for tracking and analysis:
- PromptVersionManager: Tracks prompt versions and performance
- DatasetAnalyzer: Analyzes test dataset characteristics
- PerformanceAnalyzer: Tracks optimization trends and suggests early stopping
- OptimizationReporter: Generates comprehensive reports
Setup & Configuration
Prerequisites
- Environment Setup
cd backend source .venv/bin/activate uv sync - Langfuse Configuration
The system uses environment-specific configuration files from the root directory:
.env.local.dev- Development Langfuse instance.env.local.test- Test/staging Langfuse instance.env.local.prod- Production Langfuse instance (default)
Each file should contain:
LANGFUSE_PUBLIC_KEY=pk-lf-... LANGFUSE_SECRET_KEY=sk-lf-... LANGFUSE_HOST=https://analytics.aitana.chat # Or environment-specific hosts - GCP/Vertex AI Access The system uses Gemini for evaluation, requiring valid GCP credentials.
Installation
The optimization scripts are located in backend/scripts/:
backend/scripts/
├── prompt_optimization.py # Core optimization engine
├── prompt_evaluator.py # LLM judge implementation
├── optimization_utils.py # Helper utilities
├── run_prompt_optimization.py # CLI interface
└── test_optimization.py # Test suite
Creating Datasets
Dataset Structure
Create a Langfuse dataset with items following this structure:
{
"input": {
"question": "What is the weather like in Paris today?"
},
"expected_output": {
"answer": "I'll help you check the current weather in Paris.",
"tools_to_use": [
{
"name": "google_search",
"config": [
{"parameter": "query", "value": "Paris weather today"}
]
}
],
"conversation_summary": {
"summary": "User asking about current weather in Paris"
},
"pause_to_confirm": false
},
"metadata": {
"type": "weather_query",
"complexity": "simple",
"priority": "high"
}
}
Dataset Best Practices
- Diverse Test Cases: Include various query types:
- Simple factual questions
- Complex analysis requests
- Multi-tool scenarios
- Edge cases and error conditions
-
Realistic Expectations: Ensure
expected_outputreflects achievable responses - Metadata Tags: Use metadata for categorization:
{ "type": "search|analysis|generation|factual", "complexity": "simple|medium|complex", "tools_required": ["vertex_search", "file-browser"], "priority": "high|medium|low" } - Minimum Dataset Size: At least 10-20 test cases for meaningful optimization
Creating Datasets via Langfuse UI
- Navigate to your Langfuse project
- Go to Datasets → Create New Dataset
- Name it (e.g.,
first_impression_optimisation) - Add items manually or via API
- Use production traces as inspiration
Creating Datasets Programmatically
from langfuse import Langfuse
langfuse = Langfuse()
# Create dataset
dataset = langfuse.create_dataset(
name="first_impression_optimisation",
description="Test cases for prompt optimization"
)
# Add items
langfuse.create_dataset_item(
dataset_name="first_impression_optimisation",
input={
"question": "Analyze our Q3 financial reports"
},
expected_output={
"answer": "I'll analyze your Q3 financial reports for you.",
"tools_to_use": [
{"name": "vertex_search", "config": [{"parameter": "query", "value": "Q3 financial reports"}]},
{"name": "file-browser", "config": []}
],
"conversation_summary": {"summary": "User requesting Q3 financial analysis"},
"pause_to_confirm": True
},
metadata={
"type": "document_analysis",
"complexity": "complex"
}
)
Running Optimizations
Basic Usage
cd backend
source .venv/bin/activate
# Run optimization on production dataset (default)
python scripts/run_prompt_optimization.py --dataset first_impression_optimisation
# Run optimization on development dataset
python scripts/run_prompt_optimization.py --dataset first_impression_optimisation --env dev
# Run optimization on test dataset
python scripts/run_prompt_optimization.py --dataset first_impression_optimisation --env test
Advanced Options
# Full command with all options
python scripts/run_prompt_optimization.py \
--dataset first_impression_optimisation \
--env dev \
--max-iterations 3 \
--output-dir ./optimization_results \
--report-format json \
--save-prompts \
--verbose
# Dry run (validation only) in test environment
python scripts/run_prompt_optimization.py \
--dataset first_impression_optimisation \
--env test \
--dry-run
# Quick test with fewer iterations in development
python scripts/run_prompt_optimization.py \
--dataset first_impression_optimisation \
--env dev \
--max-iterations 2
Command-Line Options
| Option | Description | Default |
|---|---|---|
--dataset |
Langfuse dataset name | Required |
--env |
Environment (dev/test/prod) | prod |
--max-iterations |
Maximum optimization iterations | 5 |
--output-dir |
Directory for results | ./optimization_results |
--report-format |
Output format (json/text) | text |
--save-prompts |
Save optimized prompts to files | False |
--verbose |
Enable detailed logging | False |
--dry-run |
Validate without running | False |
What Happens During Optimization
- Initial Evaluation: Tests current prompts against all dataset items
- Iteration Loop (up to max_iterations):
- Analyzes failures and low scores
- Generates improved prompts using LLM
- Tests improved prompts
- Tracks performance trends
- Early Stopping: Automatically stops if:
- Score reaches 95+ (excellent)
- Performance plateaus
- Scores decline consistently
- Final Report: Generates comprehensive analysis
Understanding Results
Score Interpretation
| Score Range | Rating | Recommendation |
|---|---|---|
| 90-100 | Excellent | Ready for production |
| 80-89 | Good | Minor refinements optional |
| 70-79 | Fair | Additional optimization recommended |
| Below 70 | Poor | Major revision needed |
Reading the Summary Report
================================================================================
PROMPT OPTIMIZATION SUMMARY REPORT
================================================================================
Dataset: first_impression_optimisation
Generated: 2024-01-15 10:30:45
Total Test Cases: 15
PERFORMANCE METRICS
==================
Initial Score: 72.34/100
Final Score: 88.67/100
Best Score: 89.12/100
Total Improvement: +16.33 points
Iterations: 4
TREND ANALYSIS
==============
Trend: improving
Score Volatility: 2.45
Average Score: 82.15
ITERATION DETAILS
=================
Iteration 1: Score 72.34 (+0.00) Success Rate: 86.7%
Iteration 2: Score 81.23 (+8.89) Success Rate: 93.3%
Iteration 3: Score 87.45 (+6.22) Success Rate: 100.0%
Iteration 4: Score 88.67 (+1.22) Success Rate: 100.0%
Detailed Results Analysis
The system generates multiple output files:
- Summary Report (
optimization_report_*.txt)- Overall performance metrics
- Iteration-by-iteration progress
- Recommendations
- JSON Results (
optimization_results_*.json)- Detailed test results
- Individual evaluations
- Error details
- Optimized Prompts (
optimized_*.txt)- Best-performing prompt versions
- Performance metadata
- Version hashes
Evaluation Criteria Breakdown
Each test case receives scores on:
{
"criteria_scores": {
"relevance": 23.5, // out of 25
"tool_selection": 24.0, // out of 25
"clarity": 18.5, // out of 20
"completeness": 14.0, // out of 15
"reasoning": 9.0, // out of 10
"technical_accuracy": 4.5 // out of 5
},
"overall_score": 93.5,
"overall_rating": "excellent"
}
Best Practices
1. Dataset Quality
Do:
- Include real user queries from production
- Cover edge cases and error scenarios
- Balance simple and complex queries
- Update datasets regularly
Don’t:
- Use overly synthetic examples
- Create impossible expectations
- Ignore tool limitations
2. Optimization Strategy
Iterative Approach:
- Start with small dataset (10-15 cases)
- Run initial optimization
- Analyze failures
- Add more targeted test cases
- Re-run optimization
Performance Monitoring:
# Monitor optimization progress
tail -f optimization_*.log
# Check for specific issues
grep "ERROR\|FAILED" optimization_*.log
3. Prompt Management
Version Control:
- Save successful prompts before major changes
- Document why changes were made
- Track performance over time
Testing Before Deployment:
# Validate optimized prompts
python scripts/run_prompt_optimization.py \
--dataset production_validation \
--dry-run
4. Common Patterns
Tool Selection Issues:
// Bad: Over-selecting tools
"tools_to_use": ["google_search", "vertex_search", "file-browser", "code_execution"]
// Good: Focused tool selection
"tools_to_use": ["vertex_search"]
Response Clarity:
// Bad: Technical jargon
"answer": "Initiating multi-modal search with semantic vectorization..."
// Good: Clear communication
"answer": "I'll search for that information in your documents."
Troubleshooting
Common Issues
1. Dataset Not Found
Error: Dataset 'first_impression_optimisation' not found
Solution: Verify dataset exists in Langfuse UI and check spelling
2. Low Initial Scores
Initial Score: 45.23/100
Solutions:
- Review expected outputs for realism
- Check if current prompts match dataset expectations
- Analyze specific failure patterns
3. Optimization Plateau
Performance has plateaued. Consider manual review.
Solutions:
- Add more diverse test cases
- Manually review problem areas
- Consider prompt structure changes
4. Module Import Errors
ModuleNotFoundError: No module named 'langfuse'
Solution:
cd backend
source .venv/bin/activate
uv sync
Debug Mode
Enable verbose logging for detailed diagnostics:
# Full debug output
python scripts/run_prompt_optimization.py \
--dataset first_impression_optimisation \
--verbose 2>&1 | tee optimization_debug.log
# Analyze specific test case
grep -A 10 -B 10 "test_case_id" optimization_*.log
Performance Issues
Slow Optimization:
- Reduce dataset size for testing
- Lower max iterations
- Check API rate limits
Memory Issues:
- Process datasets in batches
- Clear cache between runs
- Monitor system resources
API Reference
PromptOptimizer
from scripts.prompt_optimization import PromptOptimizer
# Initialize optimizer
optimizer = PromptOptimizer(
dataset_name="first_impression_optimisation",
max_iterations=5
)
# Run optimization
results = await optimizer.run_optimization()
# Generate report
report = optimizer.generate_report()
PromptEvaluator
from scripts.prompt_evaluator import PromptEvaluator
# Initialize evaluator
evaluator = PromptEvaluator(model_name="gemini")
# Evaluate single response
result = await evaluator.evaluate_first_impression_response(
user_question="What's the weather?",
expected_output={...},
actual_output={...}
)
# Batch evaluation
results = await evaluator.batch_evaluate(test_cases)
Utility Functions
from scripts.optimization_utils import (
validate_prompt_format,
calculate_similarity,
extract_prompt_variables
)
# Validate prompt
validation = validate_prompt_format(prompt_text)
if not validation["valid"]:
print(f"Issues: {validation['issues']}")
# Extract variables
variables = extract_prompt_variables("Hello , ")
# Returns: ['name', 'greeting']
Integration with CI/CD
Automated Testing
Add to your CI pipeline:
- name: Run Prompt Optimization Tests
run: |
cd backend
source .venv/bin/activate
python scripts/run_prompt_optimization.py \
--dataset ci_test_dataset \
--max-iterations 2 \
--dry-run
Performance Tracking
Track prompt performance over time:
# Create performance tracking script
#!/bin/bash
DATE=$(date +%Y%m%d)
python scripts/run_prompt_optimization.py \
--dataset production_dataset \
--output-dir ./performance_tracking/$DATE \
--report-format json
Extending the System
Adding New Evaluation Criteria
Edit prompt_evaluator.py:
class EvaluationCriteria(Enum):
RELEVANCE = "relevance"
TOOL_SELECTION = "tool_selection"
# Add new criteria
CULTURAL_SENSITIVITY = "cultural_sensitivity"
RESPONSE_TIME = "response_time"
Supporting Additional Prompts
Modify prompt_optimization.py:
prompt_names = ["first_impression", "first_impression-tools", "smart-context"]
Custom Scoring Logic
Implement custom evaluators:
class CustomEvaluator(PromptEvaluator):
def score_response(self, response):
# Custom scoring logic
return custom_score
Security Considerations
- Sensitive Data: Optimization logs may contain sensitive information
- Access Control: Restrict access to Langfuse datasets
- Prompt Injection: Validate all generated prompts
- API Keys: Never commit credentials to version control
- Environment Isolation: Keep development, test, and production environments separate
- Credential Management: Each environment uses separate API keys and datasets
Future Enhancements
Planned Features
- Dynamic prompt injection during testing
- Multi-model evaluation (Claude + Gemini)
- A/B testing framework
- Automated deployment pipeline
- Real-time performance monitoring
Contributing
To contribute improvements:
- Test changes with
test_optimization.py - Update documentation
- Follow existing code patterns
- Add appropriate logging
Conclusion
The prompt optimization system provides a robust framework for improving AI assistant prompts through automated testing and evaluation. By following the practices outlined in this guide, you can systematically enhance prompt performance and maintain high-quality AI interactions.