Performance Considerations

Guidelines for optimizing performance in time-sensitive operations.

Latency-Sensitive Operations

First Token Streaming Performance

Critical: First-token-to-user latency is crucial for perceived responsiveness in streaming AI responses.

check_and_display_thinking() Latency Impact

⚠️ WARNING: High Latency Function

The check_and_display_thinking() function adds significant latency to operations:

  • Typical delay: 1-4 seconds per call
  • Cumulative impact: Multiple calls can add 10+ seconds total
  • User perception: Severely impacts perceived responsiveness

When to Avoid

Never use in:

  • First token streaming paths (first_impression.py)
  • Real-time user interactions
  • API endpoints with SLA requirements
  • Performance-critical loops

Examples of problematic usage:

# ❌ BAD - Blocks first token by 4+ seconds
await check_and_display_thinking("Added first impression prompts.", callback)
await check_and_display_thinking("Checking semantic memory.", callback)
await check_and_display_thinking("Making first impression.", callback)

When It’s Acceptable

OK to use in:

  • Background processing
  • Non-time-sensitive operations
  • Debug/development environments
  • Long-running batch operations

Best Practices

Option 1: Remove completely

# Comment out for production performance
# await check_and_display_thinking("Making first impression.", callback)

Option 2: Conditional usage

# Only show thinking in debug mode
if os.getenv("DEBUG_THINKING", "false").lower() == "true":
    await check_and_display_thinking("Making first impression.", callback)

Option 3: Async/background

# Don't await if time-sensitive
asyncio.create_task(check_and_display_thinking("Background process", callback))

Performance Testing

Measuring Latency

Key metrics to track:

  • First token time: Time until first streaming token appears
  • Total response time: Complete response delivery
  • Perceived latency: User-facing response time

Tools for measurement:

import time

start_time = time.time()
# ... operation ...
first_token_time = time.time() - start_time
log.info(f"First token in {first_token_time:.2f}s")

Performance Targets

Recommended targets:

  • First token: < 2 seconds
  • Streaming response: < 10 seconds for typical queries
  • Tool execution: < 30 seconds per tool

Architecture Patterns

Concurrent Processing

Use AsyncTaskRunner for parallel operations:

runner = AsyncTaskRunner(retry_enabled=True)
runner.add_task(fast_streamer)    # Start immediately
runner.add_task(background_tooler) # Process in parallel

Early Response Pattern

Start streaming while processing:

# Send immediate response
await callback.async_on_llm_new_token("Let me look into that...")

# Process in background
async_task = asyncio.create_task(expensive_operation())

Client Connection Warmup

Pre-warm expensive connections:

# In app startup
from models.genai_client import async_warm_up_genai
await async_warm_up_genai(["gemini-2.5-flash"])

Monitoring and Alerting

Performance Metrics

Track these metrics:

  • P95 first token latency
  • P95 total response time
  • Error rates during performance optimization
  • User abandonment rates (proxy for perceived slowness)

Performance Regression Detection

Alert on:

  • First token time > 5 seconds
  • Total response time > 30 seconds
  • Significant increase in abandonment rate

Common Performance Anti-Patterns

Blocking Operations in Streaming

# ❌ BAD - Blocks streaming
await expensive_sync_operation()
await stream_response()

# ✅ GOOD - Stream while processing
stream_task = asyncio.create_task(stream_response())
process_task = asyncio.create_task(expensive_operation())
await asyncio.gather(stream_task, process_task)

Excessive Thinking Messages

# ❌ BAD - Multiple blocking calls
await check_and_display_thinking("Step 1", callback)
await check_and_display_thinking("Step 2", callback)
await check_and_display_thinking("Step 3", callback)

# ✅ GOOD - Single summary or none
await check_and_display_thinking("Processing complete", callback)

Cold Client Creation

# ❌ BAD - Creates client on each request
client = genai_client()

# ✅ GOOD - Use warmed singleton
client = genai_client()  # Already warmed via background init

File Locations

Related files:

  • /backend/first_impression.py - Critical streaming performance
  • /backend/models/genai_client.py - Client warmup and connection pooling
  • /backend/my_utils.py - Contains check_and_display_thinking()
  • /backend/models/gemini.py - Streaming implementation