Text-to-Speech (TTS) Feature

Overview

The TTS (Text-to-Speech) feature allows users to convert text messages into audio format using various AI voices. This feature is integrated into the chat interface and provides both real-time streaming and downloadable audio files.

Features

🎙️ Multiple Voice Options

Zephyr: Clear, professional voice
Puck: Friendly, casual tone
Charon: Deep, authoritative voice
Kore: Default balanced voice
Fenrir: Dynamic, energetic voice
Leda: Smooth, elegant voice

🔄 Audio Formats

Streaming Audio: Real-time audio playback
WAV Downloads: High-quality downloadable files
Progressive Loading: Audio streams as it’s generated

⚡ Integration Points

Chat Interface: Convert assistant responses to speech
Frontend API: /api/tts-stream/ endpoint for streaming
Backend Service: Direct TTS processing at localhost:1956

User Interface

TTS Controls in Chat

The TTS feature appears as audio controls in the chat interface:

Play Button: Convert message to speech and play
Voice Selector: Choose from available voices
Volume Control: Adjust playback volume
Download Option: Save audio as WAV file

Voice Selection

Users can select different voices for different use cases:

Presentations: Use Charon for authority
Casual Chat: Use Puck for friendliness
Default: Kore provides balanced tone

Technical Implementation

Frontend Integration

// TTS streaming endpoint
const response = await fetch('/api/tts-stream/', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    text: messageText,
    voice: selectedVoice
  })
});

Backend Processing

The backend provides multiple endpoints:

/tts/health - Service health check
/tts/voices - Available voice list
/tts - Generate downloadable WAV
/tts/stream - Real-time audio streaming

Audio Pipeline

Text Processing: Clean and prepare text for TTS
Voice Synthesis: Generate audio using selected voice
Streaming: Progressive audio delivery to frontend
Playback: Browser audio playback with controls

API Reference

Get Available Voices

curl -X GET http://localhost:1956/tts/voices

Generate Speech

curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "Zephyr"}' \
  --output speech.wav

Stream Audio

curl -X POST http://localhost:1956/tts/stream \
  -H "Content-Type: application/json" \
  -d '{"text": "Streaming audio", "voice": "Kore"}' \
  --output stream.wav

Configuration

Voice Settings

Default voice can be configured per user:

const defaultVoice = 'Kore';
const userPreferredVoice = getUserPreference('tts-voice') || defaultVoice;

Quality Settings

Audio quality parameters:

Sample Rate: 22050 Hz (standard)
Bit Depth: 16-bit
Format: WAV (uncompressed)
Mono: Single channel audio

Usage Examples

Converting Chat Messages

// Convert assistant response to speech
const convertToSpeech = async (messageText: string, voice: string) => {
  const audio = await generateTTS(messageText, voice);
  playAudio(audio);
};

Downloading Audio Files

// Save message as audio file
const downloadSpeech = async (text: string, voice: string) => {
  const audioBlob = await fetch('/api/tts-stream/', {
    method: 'POST',
    body: JSON.stringify({ text, voice })
  }).then(r => r.blob());
  
  downloadFile(audioBlob, 'speech.wav');
};

Performance Considerations

Optimization

Text Chunking: Long messages split into manageable chunks
Caching: Frequently used phrases cached for faster delivery
Progressive Loading: Audio streams while generating
Compression: Efficient audio encoding for faster transmission

Limitations

Text Length: Maximum ~500 characters per request
Rate Limiting: Prevents API abuse
Voice Availability: Some voices may have regional restrictions
Network Dependent: Requires stable internet for streaming

Troubleshooting

Common Issues

No Audio Playback

Check browser audio permissions
Verify volume settings
Test with different voice

Slow Generation

Check network connection
Try shorter text segments
Use default voice for faster processing

Voice Not Available

Verify voice name spelling
Check available voices endpoint
Use fallback to default voice

API Test Commands

1. Health Check

curl -X GET http://localhost:1956/tts/health

2. Get Available Voices

curl -X GET http://localhost:1956/tts/voices

3. Generate Speech (Download WAV file)

# Basic test with default voice (Kore)
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, this is a test of the text to speech system"}' \
  --output test_speech.wav

# Test with specific voice
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world, this is Zephyr speaking", "voice": "Zephyr"}' \
  --output zephyr_speech.wav

# Test with longer text
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "The quick brown fox jumps over the lazy dog. This is a longer sentence to test the text to speech capabilities.", "voice": "Puck"}' \
  --output long_speech.wav

4. Stream Audio (Get raw audio data)

# Stream audio directly
curl -X POST http://localhost:1956/tts/stream \
  -H "Content-Type: application/json" \
  -d '{"text": "This is streamed audio", "voice": "Kore"}' \
  --output stream_test.wav

# Stream with different voice
curl -X POST http://localhost:1956/tts/stream \
  -H "Content-Type: application/json" \
  -d '{"text": "Streaming with Charon voice", "voice": "Charon"}' \
  --output charon_stream.wav

5. Error Testing

# Test missing text field
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"voice": "Kore"}'

# Test invalid voice
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Testing invalid voice", "voice": "InvalidVoice"}'

# Test empty JSON
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{}'

# Test no JSON body
curl -X POST http://localhost:1956/tts

6. Batch Testing Different Voices

# Test multiple voices quickly
voices=("Zephyr" "Puck" "Charon" "Kore" "Fenrir" "Leda")
for voice in "${voices[@]}"; do
  echo "Testing voice: $voice"
  curl -X POST http://localhost:1956/tts \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"Hello, this is $voice speaking\", \"voice\": \"$voice\"}" \
    --output "${voice,,}_test.wav"
  echo "Saved to ${voice,,}_test.wav"
done

7. Performance Testing

# Time the request
time curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Performance test message", "voice": "Kore"}' \
  --output performance_test.wav

# Test with verbose output to see response headers
curl -v -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Verbose test", "voice": "Kore"}' \
  --output verbose_test.wav

Expected Responses

Successful Voice List Response:

{
  "voices": ["Zephyr", "Puck", "Charon", ...],
  "default": "Kore"
}

Successful Health Check:

{
  "status": "healthy",
  "service": "TTS API"
}

Error Response (Invalid Voice):

{
  "error": "Invalid voice. Available voices: ['Zephyr', 'Puck', ...]"
}

Error Response (Missing Text):

{
  "error": "Text field is required"
}

Notes:

Replace localhost:1956 with your actual server address if different
WAV files will be saved to your current directory
Use --silent flag to suppress curl progress output: curl --silent -X POST ...
Add --fail flag to make curl return non-zero exit code on HTTP errors
The generated WAV files should be playable in any audio player