Prompt Optimization System

Overview

The prompt optimization system is an automated tool for improving AI assistant prompts using iterative testing and LLM-based evaluation. It specifically targets the first_impression.py function prompts but can be extended to other prompts in the system.

Table of Contents

  1. Architecture
  2. Setup & Configuration
  3. Creating Datasets
  4. Running Optimizations
  5. Understanding Results
  6. Best Practices
  7. Troubleshooting
  8. API Reference

Architecture

graph TD
    A[Langfuse Dataset] --> B[PromptOptimizer]
    C[Current Prompts] --> B
    B --> D[Test Execution]
    D --> E[LLM Judge Evaluation]
    E --> F[Score & Feedback]
    F --> G{Good Score?}
    G -->|No| H[Generate Improved Prompt]
    H --> D
    G -->|Yes| I[Save Results]
    I --> J[Generate Report]

Key Components

PromptOptimizer (prompt_optimization.py)

The main orchestrator that manages the optimization process:

  • Loads test cases from Langfuse datasets
  • Fetches current prompts using existing infrastructure
  • Manages optimization iterations (max 5)
  • Coordinates evaluation and improvement cycles

PromptEvaluator (prompt_evaluator.py)

LLM-based evaluation system that scores responses on:

  • Relevance (25 points): Does the answer address the user’s question?
  • Tool Selection (25 points): Are the selected tools appropriate?
  • Clarity (20 points): Is the response clear and well-structured?
  • Completeness (15 points): Does it address all aspects?
  • Reasoning (10 points): Is the tool selection logic sound?
  • Technical Accuracy (5 points): Are technical details correct?

Optimization Utilities (optimization_utils.py)

Supporting classes for tracking and analysis:

  • PromptVersionManager: Tracks prompt versions and performance
  • DatasetAnalyzer: Analyzes test dataset characteristics
  • PerformanceAnalyzer: Tracks optimization trends and suggests early stopping
  • OptimizationReporter: Generates comprehensive reports

Setup & Configuration

Prerequisites

  1. Environment Setup
    cd backend
    source .venv/bin/activate
    uv sync
    
  2. Langfuse Configuration The system uses environment-specific configuration files from the root directory:
    • .env.local.dev - Development Langfuse instance
    • .env.local.test - Test/staging Langfuse instance
    • .env.local.prod - Production Langfuse instance (default)

    Each file should contain:

    LANGFUSE_PUBLIC_KEY=pk-lf-...
    LANGFUSE_SECRET_KEY=sk-lf-...
    LANGFUSE_HOST=https://analytics.aitana.chat
    # Or environment-specific hosts
    
  3. GCP/Vertex AI Access The system uses Gemini for evaluation, requiring valid GCP credentials.

Installation

The optimization scripts are located in backend/scripts/:

backend/scripts/
├── prompt_optimization.py      # Core optimization engine
├── prompt_evaluator.py         # LLM judge implementation
├── optimization_utils.py       # Helper utilities
├── run_prompt_optimization.py  # CLI interface
└── test_optimization.py        # Test suite

Creating Datasets

Dataset Structure

Create a Langfuse dataset with items following this structure:

{
  "input": {
    "question": "What is the weather like in Paris today?"
  },
  "expected_output": {
    "answer": "I'll help you check the current weather in Paris.",
    "tools_to_use": [
      {
        "name": "google_search",
        "config": [
          {"parameter": "query", "value": "Paris weather today"}
        ]
      }
    ],
    "conversation_summary": {
      "summary": "User asking about current weather in Paris"
    },
    "pause_to_confirm": false
  },
  "metadata": {
    "type": "weather_query",
    "complexity": "simple",
    "priority": "high"
  }
}

Dataset Best Practices

  1. Diverse Test Cases: Include various query types:
    • Simple factual questions
    • Complex analysis requests
    • Multi-tool scenarios
    • Edge cases and error conditions
  2. Realistic Expectations: Ensure expected_output reflects achievable responses

  3. Metadata Tags: Use metadata for categorization:
    {
      "type": "search|analysis|generation|factual",
      "complexity": "simple|medium|complex",
      "tools_required": ["vertex_search", "file-browser"],
      "priority": "high|medium|low"
    }
    
  4. Minimum Dataset Size: At least 10-20 test cases for meaningful optimization

Creating Datasets via Langfuse UI

  1. Navigate to your Langfuse project
  2. Go to Datasets → Create New Dataset
  3. Name it (e.g., first_impression_optimisation)
  4. Add items manually or via API
  5. Use production traces as inspiration

Creating Datasets Programmatically

from langfuse import Langfuse

langfuse = Langfuse()

# Create dataset
dataset = langfuse.create_dataset(
    name="first_impression_optimisation",
    description="Test cases for prompt optimization"
)

# Add items
langfuse.create_dataset_item(
    dataset_name="first_impression_optimisation",
    input={
        "question": "Analyze our Q3 financial reports"
    },
    expected_output={
        "answer": "I'll analyze your Q3 financial reports for you.",
        "tools_to_use": [
            {"name": "vertex_search", "config": [{"parameter": "query", "value": "Q3 financial reports"}]},
            {"name": "file-browser", "config": []}
        ],
        "conversation_summary": {"summary": "User requesting Q3 financial analysis"},
        "pause_to_confirm": True
    },
    metadata={
        "type": "document_analysis",
        "complexity": "complex"
    }
)

Running Optimizations

Basic Usage

cd backend
source .venv/bin/activate

# Run optimization on production dataset (default)
python scripts/run_prompt_optimization.py --dataset first_impression_optimisation

# Run optimization on development dataset
python scripts/run_prompt_optimization.py --dataset first_impression_optimisation --env dev

# Run optimization on test dataset
python scripts/run_prompt_optimization.py --dataset first_impression_optimisation --env test

Advanced Options

# Full command with all options
python scripts/run_prompt_optimization.py \
    --dataset first_impression_optimisation \
    --env dev \
    --max-iterations 3 \
    --output-dir ./optimization_results \
    --report-format json \
    --save-prompts \
    --verbose

# Dry run (validation only) in test environment
python scripts/run_prompt_optimization.py \
    --dataset first_impression_optimisation \
    --env test \
    --dry-run

# Quick test with fewer iterations in development
python scripts/run_prompt_optimization.py \
    --dataset first_impression_optimisation \
    --env dev \
    --max-iterations 2

Command-Line Options

Option Description Default
--dataset Langfuse dataset name Required
--env Environment (dev/test/prod) prod
--max-iterations Maximum optimization iterations 5
--output-dir Directory for results ./optimization_results
--report-format Output format (json/text) text
--save-prompts Save optimized prompts to files False
--verbose Enable detailed logging False
--dry-run Validate without running False

What Happens During Optimization

  1. Initial Evaluation: Tests current prompts against all dataset items
  2. Iteration Loop (up to max_iterations):
    • Analyzes failures and low scores
    • Generates improved prompts using LLM
    • Tests improved prompts
    • Tracks performance trends
  3. Early Stopping: Automatically stops if:
    • Score reaches 95+ (excellent)
    • Performance plateaus
    • Scores decline consistently
  4. Final Report: Generates comprehensive analysis

Understanding Results

Score Interpretation

Score Range Rating Recommendation
90-100 Excellent Ready for production
80-89 Good Minor refinements optional
70-79 Fair Additional optimization recommended
Below 70 Poor Major revision needed

Reading the Summary Report

================================================================================
                        PROMPT OPTIMIZATION SUMMARY REPORT
================================================================================

Dataset: first_impression_optimisation
Generated: 2024-01-15 10:30:45
Total Test Cases: 15

PERFORMANCE METRICS
==================
Initial Score:      72.34/100
Final Score:        88.67/100
Best Score:         89.12/100
Total Improvement:  +16.33 points
Iterations:         4

TREND ANALYSIS
==============
Trend:              improving
Score Volatility:   2.45
Average Score:      82.15

ITERATION DETAILS
=================
Iteration  1: Score  72.34 (+0.00) Success Rate: 86.7%
Iteration  2: Score  81.23 (+8.89) Success Rate: 93.3%
Iteration  3: Score  87.45 (+6.22) Success Rate: 100.0%
Iteration  4: Score  88.67 (+1.22) Success Rate: 100.0%

Detailed Results Analysis

The system generates multiple output files:

  1. Summary Report (optimization_report_*.txt)
    • Overall performance metrics
    • Iteration-by-iteration progress
    • Recommendations
  2. JSON Results (optimization_results_*.json)
    • Detailed test results
    • Individual evaluations
    • Error details
  3. Optimized Prompts (optimized_*.txt)
    • Best-performing prompt versions
    • Performance metadata
    • Version hashes

Evaluation Criteria Breakdown

Each test case receives scores on:

{
  "criteria_scores": {
    "relevance": 23.5,        // out of 25
    "tool_selection": 24.0,   // out of 25
    "clarity": 18.5,          // out of 20
    "completeness": 14.0,     // out of 15
    "reasoning": 9.0,         // out of 10
    "technical_accuracy": 4.5 // out of 5
  },
  "overall_score": 93.5,
  "overall_rating": "excellent"
}

Best Practices

1. Dataset Quality

Do:

  • Include real user queries from production
  • Cover edge cases and error scenarios
  • Balance simple and complex queries
  • Update datasets regularly

Don’t:

  • Use overly synthetic examples
  • Create impossible expectations
  • Ignore tool limitations

2. Optimization Strategy

Iterative Approach:

  1. Start with small dataset (10-15 cases)
  2. Run initial optimization
  3. Analyze failures
  4. Add more targeted test cases
  5. Re-run optimization

Performance Monitoring:

# Monitor optimization progress
tail -f optimization_*.log

# Check for specific issues
grep "ERROR\|FAILED" optimization_*.log

3. Prompt Management

Version Control:

  • Save successful prompts before major changes
  • Document why changes were made
  • Track performance over time

Testing Before Deployment:

# Validate optimized prompts
python scripts/run_prompt_optimization.py \
    --dataset production_validation \
    --dry-run

4. Common Patterns

Tool Selection Issues:

// Bad: Over-selecting tools
"tools_to_use": ["google_search", "vertex_search", "file-browser", "code_execution"]

// Good: Focused tool selection
"tools_to_use": ["vertex_search"]

Response Clarity:

// Bad: Technical jargon
"answer": "Initiating multi-modal search with semantic vectorization..."

// Good: Clear communication
"answer": "I'll search for that information in your documents."

Troubleshooting

Common Issues

1. Dataset Not Found

Error: Dataset 'first_impression_optimisation' not found

Solution: Verify dataset exists in Langfuse UI and check spelling

2. Low Initial Scores

Initial Score: 45.23/100

Solutions:

  • Review expected outputs for realism
  • Check if current prompts match dataset expectations
  • Analyze specific failure patterns

3. Optimization Plateau

Performance has plateaued. Consider manual review.

Solutions:

  • Add more diverse test cases
  • Manually review problem areas
  • Consider prompt structure changes

4. Module Import Errors

ModuleNotFoundError: No module named 'langfuse'

Solution:

cd backend
source .venv/bin/activate
uv sync

Debug Mode

Enable verbose logging for detailed diagnostics:

# Full debug output
python scripts/run_prompt_optimization.py \
    --dataset first_impression_optimisation \
    --verbose 2>&1 | tee optimization_debug.log

# Analyze specific test case
grep -A 10 -B 10 "test_case_id" optimization_*.log

Performance Issues

Slow Optimization:

  • Reduce dataset size for testing
  • Lower max iterations
  • Check API rate limits

Memory Issues:

  • Process datasets in batches
  • Clear cache between runs
  • Monitor system resources

API Reference

PromptOptimizer

from scripts.prompt_optimization import PromptOptimizer

# Initialize optimizer
optimizer = PromptOptimizer(
    dataset_name="first_impression_optimisation",
    max_iterations=5
)

# Run optimization
results = await optimizer.run_optimization()

# Generate report
report = optimizer.generate_report()

PromptEvaluator

from scripts.prompt_evaluator import PromptEvaluator

# Initialize evaluator
evaluator = PromptEvaluator(model_name="gemini")

# Evaluate single response
result = await evaluator.evaluate_first_impression_response(
    user_question="What's the weather?",
    expected_output={...},
    actual_output={...}
)

# Batch evaluation
results = await evaluator.batch_evaluate(test_cases)

Utility Functions

from scripts.optimization_utils import (
    validate_prompt_format,
    calculate_similarity,
    extract_prompt_variables
)

# Validate prompt
validation = validate_prompt_format(prompt_text)
if not validation["valid"]:
    print(f"Issues: {validation['issues']}")

# Extract variables
variables = extract_prompt_variables("Hello , ")
# Returns: ['name', 'greeting']

Integration with CI/CD

Automated Testing

Add to your CI pipeline:

- name: Run Prompt Optimization Tests
  run: |
    cd backend
    source .venv/bin/activate
    python scripts/run_prompt_optimization.py \
      --dataset ci_test_dataset \
      --max-iterations 2 \
      --dry-run

Performance Tracking

Track prompt performance over time:

# Create performance tracking script
#!/bin/bash
DATE=$(date +%Y%m%d)
python scripts/run_prompt_optimization.py \
    --dataset production_dataset \
    --output-dir ./performance_tracking/$DATE \
    --report-format json

Extending the System

Adding New Evaluation Criteria

Edit prompt_evaluator.py:

class EvaluationCriteria(Enum):
    RELEVANCE = "relevance"
    TOOL_SELECTION = "tool_selection"
    # Add new criteria
    CULTURAL_SENSITIVITY = "cultural_sensitivity"
    RESPONSE_TIME = "response_time"

Supporting Additional Prompts

Modify prompt_optimization.py:

prompt_names = ["first_impression", "first_impression-tools", "smart-context"]

Custom Scoring Logic

Implement custom evaluators:

class CustomEvaluator(PromptEvaluator):
    def score_response(self, response):
        # Custom scoring logic
        return custom_score

Security Considerations

  1. Sensitive Data: Optimization logs may contain sensitive information
  2. Access Control: Restrict access to Langfuse datasets
  3. Prompt Injection: Validate all generated prompts
  4. API Keys: Never commit credentials to version control
  5. Environment Isolation: Keep development, test, and production environments separate
  6. Credential Management: Each environment uses separate API keys and datasets

Future Enhancements

Planned Features

  1. Dynamic prompt injection during testing
  2. Multi-model evaluation (Claude + Gemini)
  3. A/B testing framework
  4. Automated deployment pipeline
  5. Real-time performance monitoring

Contributing

To contribute improvements:

  1. Test changes with test_optimization.py
  2. Update documentation
  3. Follow existing code patterns
  4. Add appropriate logging

Conclusion

The prompt optimization system provides a robust framework for improving AI assistant prompts through automated testing and evaluation. By following the practices outlined in this guide, you can systematically enhance prompt performance and maintain high-quality AI interactions.