Text-to-Speech (TTS) Feature

Overview

The TTS (Text-to-Speech) feature allows users to convert text messages into audio format using various AI voices. This feature is integrated into the chat interface and provides both real-time streaming and downloadable audio files.

Features

๐ŸŽ™๏ธ Multiple Voice Options

  • Zephyr: Clear, professional voice
  • Puck: Friendly, casual tone
  • Charon: Deep, authoritative voice
  • Kore: Default balanced voice
  • Fenrir: Dynamic, energetic voice
  • Leda: Smooth, elegant voice

๐Ÿ”„ Audio Formats

  • Streaming Audio: Real-time audio playback
  • WAV Downloads: High-quality downloadable files
  • Progressive Loading: Audio streams as itโ€™s generated

โšก Integration Points

  • Chat Interface: Convert assistant responses to speech
  • Frontend API: /api/tts-stream/ endpoint for streaming
  • Backend Service: Direct TTS processing at localhost:1956

User Interface

TTS Controls in Chat

The TTS feature appears as audio controls in the chat interface:

  • Play Button: Convert message to speech and play
  • Voice Selector: Choose from available voices
  • Volume Control: Adjust playback volume
  • Download Option: Save audio as WAV file

Voice Selection

Users can select different voices for different use cases:

  • Presentations: Use Charon for authority
  • Casual Chat: Use Puck for friendliness
  • Default: Kore provides balanced tone

Technical Implementation

Frontend Integration

// TTS streaming endpoint
const response = await fetch('/api/tts-stream/', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    text: messageText,
    voice: selectedVoice
  })
});

Backend Processing

The backend provides multiple endpoints:

  • /tts/health - Service health check
  • /tts/voices - Available voice list
  • /tts - Generate downloadable WAV
  • /tts/stream - Real-time audio streaming

Audio Pipeline

  1. Text Processing: Clean and prepare text for TTS
  2. Voice Synthesis: Generate audio using selected voice
  3. Streaming: Progressive audio delivery to frontend
  4. Playback: Browser audio playback with controls

API Reference

Get Available Voices

curl -X GET http://localhost:1956/tts/voices

Generate Speech

curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice": "Zephyr"}' \
  --output speech.wav

Stream Audio

curl -X POST http://localhost:1956/tts/stream \
  -H "Content-Type: application/json" \
  -d '{"text": "Streaming audio", "voice": "Kore"}' \
  --output stream.wav

Configuration

Voice Settings

Default voice can be configured per user:

const defaultVoice = 'Kore';
const userPreferredVoice = getUserPreference('tts-voice') || defaultVoice;

Quality Settings

Audio quality parameters:

  • Sample Rate: 22050 Hz (standard)
  • Bit Depth: 16-bit
  • Format: WAV (uncompressed)
  • Mono: Single channel audio

Usage Examples

Converting Chat Messages

// Convert assistant response to speech
const convertToSpeech = async (messageText: string, voice: string) => {
  const audio = await generateTTS(messageText, voice);
  playAudio(audio);
};

Downloading Audio Files

// Save message as audio file
const downloadSpeech = async (text: string, voice: string) => {
  const audioBlob = await fetch('/api/tts-stream/', {
    method: 'POST',
    body: JSON.stringify({ text, voice })
  }).then(r => r.blob());
  
  downloadFile(audioBlob, 'speech.wav');
};

Performance Considerations

Optimization

  • Text Chunking: Long messages split into manageable chunks
  • Caching: Frequently used phrases cached for faster delivery
  • Progressive Loading: Audio streams while generating
  • Compression: Efficient audio encoding for faster transmission

Limitations

  • Text Length: Maximum ~500 characters per request
  • Rate Limiting: Prevents API abuse
  • Voice Availability: Some voices may have regional restrictions
  • Network Dependent: Requires stable internet for streaming

Troubleshooting

Common Issues

No Audio Playback

  • Check browser audio permissions
  • Verify volume settings
  • Test with different voice

Slow Generation

  • Check network connection
  • Try shorter text segments
  • Use default voice for faster processing

Voice Not Available

  • Verify voice name spelling
  • Check available voices endpoint
  • Use fallback to default voice

API Test Commands

1. Health Check

curl -X GET http://localhost:1956/tts/health

2. Get Available Voices

curl -X GET http://localhost:1956/tts/voices

3. Generate Speech (Download WAV file)

# Basic test with default voice (Kore)
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, this is a test of the text to speech system"}' \
  --output test_speech.wav

# Test with specific voice
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world, this is Zephyr speaking", "voice": "Zephyr"}' \
  --output zephyr_speech.wav

# Test with longer text
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "The quick brown fox jumps over the lazy dog. This is a longer sentence to test the text to speech capabilities.", "voice": "Puck"}' \
  --output long_speech.wav

4. Stream Audio (Get raw audio data)

# Stream audio directly
curl -X POST http://localhost:1956/tts/stream \
  -H "Content-Type: application/json" \
  -d '{"text": "This is streamed audio", "voice": "Kore"}' \
  --output stream_test.wav

# Stream with different voice
curl -X POST http://localhost:1956/tts/stream \
  -H "Content-Type: application/json" \
  -d '{"text": "Streaming with Charon voice", "voice": "Charon"}' \
  --output charon_stream.wav

5. Error Testing

# Test missing text field
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"voice": "Kore"}'

# Test invalid voice
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Testing invalid voice", "voice": "InvalidVoice"}'

# Test empty JSON
curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{}'

# Test no JSON body
curl -X POST http://localhost:1956/tts

6. Batch Testing Different Voices

# Test multiple voices quickly
voices=("Zephyr" "Puck" "Charon" "Kore" "Fenrir" "Leda")
for voice in "${voices[@]}"; do
  echo "Testing voice: $voice"
  curl -X POST http://localhost:1956/tts \
    -H "Content-Type: application/json" \
    -d "{\"text\": \"Hello, this is $voice speaking\", \"voice\": \"$voice\"}" \
    --output "${voice,,}_test.wav"
  echo "Saved to ${voice,,}_test.wav"
done

7. Performance Testing

# Time the request
time curl -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Performance test message", "voice": "Kore"}' \
  --output performance_test.wav

# Test with verbose output to see response headers
curl -v -X POST http://localhost:1956/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Verbose test", "voice": "Kore"}' \
  --output verbose_test.wav

Expected Responses

Successful Voice List Response:

{
  "voices": ["Zephyr", "Puck", "Charon", ...],
  "default": "Kore"
}

Successful Health Check:

{
  "status": "healthy",
  "service": "TTS API"
}

Error Response (Invalid Voice):

{
  "error": "Invalid voice. Available voices: ['Zephyr', 'Puck', ...]"
}

Error Response (Missing Text):

{
  "error": "Text field is required"
}

Notes:

  • Replace localhost:1956 with your actual server address if different
  • WAV files will be saved to your current directory
  • Use --silent flag to suppress curl progress output: curl --silent -X POST ...
  • Add --fail flag to make curl return non-zero exit code on HTTP errors
  • The generated WAV files should be playable in any audio player