Performance Considerations
Guidelines for optimizing performance in time-sensitive operations.
Latency-Sensitive Operations
First Token Streaming Performance
Critical: First-token-to-user latency is crucial for perceived responsiveness in streaming AI responses.
check_and_display_thinking() Latency Impact
⚠️ WARNING: High Latency Function
The check_and_display_thinking() function adds significant latency to operations:
- Typical delay: 1-4 seconds per call
- Cumulative impact: Multiple calls can add 10+ seconds total
- User perception: Severely impacts perceived responsiveness
When to Avoid
Never use in:
- First token streaming paths (
first_impression.py) - Real-time user interactions
- API endpoints with SLA requirements
- Performance-critical loops
Examples of problematic usage:
# ❌ BAD - Blocks first token by 4+ seconds
await check_and_display_thinking("Added first impression prompts.", callback)
await check_and_display_thinking("Checking semantic memory.", callback)
await check_and_display_thinking("Making first impression.", callback)
When It’s Acceptable
OK to use in:
- Background processing
- Non-time-sensitive operations
- Debug/development environments
- Long-running batch operations
Best Practices
Option 1: Remove completely
# Comment out for production performance
# await check_and_display_thinking("Making first impression.", callback)
Option 2: Conditional usage
# Only show thinking in debug mode
if os.getenv("DEBUG_THINKING", "false").lower() == "true":
await check_and_display_thinking("Making first impression.", callback)
Option 3: Async/background
# Don't await if time-sensitive
asyncio.create_task(check_and_display_thinking("Background process", callback))
Performance Testing
Measuring Latency
Key metrics to track:
- First token time: Time until first streaming token appears
- Total response time: Complete response delivery
- Perceived latency: User-facing response time
Tools for measurement:
import time
start_time = time.time()
# ... operation ...
first_token_time = time.time() - start_time
log.info(f"First token in {first_token_time:.2f}s")
Performance Targets
Recommended targets:
- First token: < 2 seconds
- Streaming response: < 10 seconds for typical queries
- Tool execution: < 30 seconds per tool
Architecture Patterns
Concurrent Processing
Use AsyncTaskRunner for parallel operations:
runner = AsyncTaskRunner(retry_enabled=True)
runner.add_task(fast_streamer) # Start immediately
runner.add_task(background_tooler) # Process in parallel
Early Response Pattern
Start streaming while processing:
# Send immediate response
await callback.async_on_llm_new_token("Let me look into that...")
# Process in background
async_task = asyncio.create_task(expensive_operation())
Client Connection Warmup
Pre-warm expensive connections:
# In app startup
from models.genai_client import async_warm_up_genai
await async_warm_up_genai(["gemini-2.5-flash"])
Monitoring and Alerting
Performance Metrics
Track these metrics:
- P95 first token latency
- P95 total response time
- Error rates during performance optimization
- User abandonment rates (proxy for perceived slowness)
Performance Regression Detection
Alert on:
- First token time > 5 seconds
- Total response time > 30 seconds
- Significant increase in abandonment rate
Common Performance Anti-Patterns
Blocking Operations in Streaming
# ❌ BAD - Blocks streaming
await expensive_sync_operation()
await stream_response()
# ✅ GOOD - Stream while processing
stream_task = asyncio.create_task(stream_response())
process_task = asyncio.create_task(expensive_operation())
await asyncio.gather(stream_task, process_task)
Excessive Thinking Messages
# ❌ BAD - Multiple blocking calls
await check_and_display_thinking("Step 1", callback)
await check_and_display_thinking("Step 2", callback)
await check_and_display_thinking("Step 3", callback)
# ✅ GOOD - Single summary or none
await check_and_display_thinking("Processing complete", callback)
Cold Client Creation
# ❌ BAD - Creates client on each request
client = genai_client()
# ✅ GOOD - Use warmed singleton
client = genai_client() # Already warmed via background init
File Locations
Related files:
/backend/first_impression.py- Critical streaming performance/backend/models/genai_client.py- Client warmup and connection pooling/backend/my_utils.py- Containscheck_and_display_thinking()/backend/models/gemini.py- Streaming implementation