Guides

Tts

Table of Contents

  1. Overview
  2. Querying Available Voices
  3. Voice Preview
  4. Real-Time TTS (WebSocket)
  5. Historical TTS (SSE)
  6. Broadcast TTS
  7. Word Boundary Karaoke Effect
  8. TTS Settings Management
  9. Related Reference Documents

Overview

VAS provides TTS (Text-to-Speech) speech synthesis, which converts translated text into playable audio. The system supports 154 languages, with a total of 325 voices to choose from (fully aligned with the speech provider's Monolingual Neural Voice set).

Supported Languages

Language CodeLanguage NameNumber of Voices
zh-TWTraditional Chinese3
zh-CNSimplified Chinese4
en-USEnglish (United States)6
en-GBEnglish (United Kingdom)3
ja-JPJapanese4
ko-KRKorean4
fr-FRFrench3
de-DEGerman3
es-ESSpanish3
it-ITItalian3
pt-BRPortuguese (Brazil)3
th-THThai3
vi-VNVietnamese2
id-IDIndonesian2

The table above is a summary of popular locales (14 locales, 46 voices total). For the complete set of 154 locales x 325 voices, query GET /api/v1/tts/voices?language={code} as the authoritative source.

Key Features

  • Multi-scenario support: three scenarios — real-time TTS (WebSocket), historical TTS (SSE), and broadcast TTS
  • Word Boundary: each word carries a precise timestamp, enabling a word-by-word karaoke highlight effect
  • Sync/async modes: sync mode automatically plays the latest translation, while async mode gives you manual control over playback
  • Multilingual broadcast TTS: in broadcast mode, you can configure a separate TTS voice for each translation language

Authentication

All TTS-related REST APIs require API Key authentication. See Authentication for details.


Querying Available Voices

Before using TTS, query which voices are available for a given language.

Supported Languages

VAS currently supports TTS speech synthesis in 154 languages. For the complete list, see Appendix - Supported Languages.

Getting the Voice List for a Specific Language

GET https://vas-poc.vurbo.ai/api/v1/tts/voices?language={language}

Example: querying English voices

curl -X GET "https://vas-poc.vurbo.ai/api/v1/tts/voices?language=en-US" \
  -H "X-API-Key: YOUR_API_KEY"

Key response fields:

FieldDescription
voice_nameVoice identifier, used in API calls
display_nameVoice display name
genderGender: Female / Male
is_defaultWhether this is the default voice for the language
sample_urlPreview audio URL

For the complete parameter and response formats, see TTS REST API.


Voice Preview

After querying the voice list, you can use the sample URL to preview how each voice sounds.

GET https://vas-poc.vurbo.ai/api/v1/tts/voices/{voiceName}/sample

Key points:

  • The response is binary MP3 audio data (not JSON)
  • The first request synthesizes the audio on the fly and caches it; subsequent requests return directly from the cache
  • Does not count toward TTS usage charges
  • Rate limit: 30 requests per minute per user

Frontend preview example:

// Play directly with an Audio element
const audio = new Audio(
  'https://vas-poc.vurbo.ai/api/v1/tts/voices/en-US-JennyNeural/sample'
);
audio.play();

Real-Time TTS (WebSocket)

Real-time TTS converts translation results into speech on the fly while speech recognition is in progress. It is sent and received over WebSocket.

Enabling TTS

Add the TTS-related parameters to the start action:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "realtime_translation": true,
    "type": "transcribe",
    "tts_enabled": true,
    "tts_language": "en-US",
    "tts_voice": "en-US-JennyNeural",
    "tts_mode": "sync"
  }
}
ParameterDescription
tts_enabledSet to true to enable TTS
tts_languageTTS output language (must be in translation_languages)
tts_voiceTTS voice name (e.g., en-US-JennyNeural)
tts_modesync (synchronous, automatic playback) or async (asynchronous, manual control)

Synchronous Mode (sync)

  • The system automatically plays the latest translated sentence with is_final=true
  • If the previous sentence is still playing, subsequent sentences enter a queue and wait
  • Suitable for scenarios that do not require manual control

Asynchronous Mode (async)

You can manually select any translated sentence for TTS playback. Repeated requests for the same sid (replay) are supported.

Conversation mode (conversation) also supports tts_mode: "async". Once set, tts_ready is not pushed automatically when translation completes; you must trigger it manually with tts_play. In conversation mode, the system automatically synthesizes the corresponding language based on tts_config.

Playing a specific sentence:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5
  }
}

Playing multiple sentences (play 3 sentences starting from sid 5):

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5,
    "length": 3
  }
}

The maximum value of length is 20 (controlled by the backend TTS_SSE_MAX_LENGTH).

Stopping playback:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_stop"
  }
}

Receiving TTS Audio

When TTS synthesis completes, the server pushes a tts_ready event:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_ready",
    "sid": 1,
    "language": "en-US",
    "transcript": "你好,很高興認識你",
    "text": "Hello, nice to meet you",
    "audio": "Base64EncodedMP3...",
    "format": "mp3",
    "duration_ms": 2500,
    "boundaries": [
      {"offset_ms": 0, "duration_ms": 350, "text_offset": 0, "word_length": 5, "text": "Hello"},
      {"offset_ms": 500, "duration_ms": 250, "text_offset": 7, "word_length": 4, "text": "nice"},
      {"offset_ms": 750, "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
      {"offset_ms": 950, "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
      {"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
    ]
  }
}

Frontend playback example:

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.data?.action === 'tts_ready') {
    const { audio, boundaries, text } = msg.data;

    // Base64 to Blob
    const byteChars = atob(audio);
    const byteArray = new Uint8Array(byteChars.length);
    for (let i = 0; i < byteChars.length; i++) {
      byteArray[i] = byteChars.charCodeAt(i);
    }
    const blob = new Blob([byteArray], { type: 'audio/mp3' });
    const audioEl = new Audio(URL.createObjectURL(blob));
    audioEl.play();
  }
};

Switching TTS Mode

You can dynamically switch between synchronous and asynchronous mode while recording is in progress:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}

Historical TTS (SSE)

Historical TTS is used to play back the translated audio of a completed recording, streaming the audio sentence by sentence over SSE.

Request Format

GET https://vas-poc.vurbo.ai/api/v1/sse/tts/{taskId}?language={language}&sid={sid}&length={length}
ParameterRequiredDescription
taskIdYesRecording ID
languageYesTTS output language (e.g., en-US)
voiceNoSpecific voice name (e.g., en-US-JennyNeural)
sidNoStarting sentence ID (default 1)
lengthNoNumber of sentences to return (default 1, max 20)

Event Sequence

connected  ->  tts_audio (repeated N times)  ->  tts_done
  1. connected: connection confirmation, including voice information
  2. tts_audio: TTS audio sent sentence by sentence (including Word Boundary)
  3. tts_done: all sentences have been sent

Multi-Sentence Playback Example

async function playTTS(taskId, language, apiKey, startSid = 1, length = 3) {
  const url = new URL(`https://vas-poc.vurbo.ai/api/v1/sse/tts/${taskId}`);
  url.searchParams.set('language', language);
  url.searchParams.set('sid', startSid);
  url.searchParams.set('length', length);

  const response = await fetch(url, {
    headers: { 'X-API-Key': apiKey }
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const events = parseSSE(decoder.decode(value));
    for (const event of events) {
      if (event.type === 'tts_audio') {
        // Play the audio and set up the karaoke effect
        const blob = base64ToBlob(event.data.audio, 'audio/mp3');
        const audio = new Audio(URL.createObjectURL(blob));
        setupKaraoke(audio, event.data.boundaries, event.data.text);
        audio.play();
      }
    }
  }
}

Note: the browser's native EventSource does not support custom headers, so use the fetch API together with ReadableStream.


Broadcast TTS

TTS in broadcast mode lets the viewer side receive translated audio. TTS audio is pushed to viewers over SSE.

Host-Side Configuration

When creating a broadcast (REST API) or starting the WebSocket, use tts_config to specify which languages have TTS enabled:

Configuring when creating a broadcast (REST API):

curl -X POST "https://vas-poc.vurbo.ai/api/v1/broadcasts" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "transcription_language": "zh-TW",
    "translation_languages": ["en-US", "ja-JP"],
    "tts_config": {
      "en-US": {"voice": "en-US-JennyNeural", "speaking_rate": 1.0},
      "ja-JP": {"voice": "ja-JP-NanamiNeural", "speaking_rate": 1.0}
    }
  }'

Configuring at WebSocket start:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "YOUR_BROADCAST_TOKEN",
    "audio_format": "pcm",
    "tts_config": {
      "en-US": {
        "voice": "en-US-JennyNeural",
        "speaking_rate": 1.0
      },
      "ja-JP": {
        "voice": "ja-JP-NanamiNeural",
        "speaking_rate": 1.0
      }
    }
  }
}

tts_config Parameters

FieldTypeDescription
voicestringTTS voice name
speaking_ratenumberSpeaking rate (0.5 ~ 2.0, default 1.0)

Receiving TTS on the Viewer Side

When viewers connect via SSE, they add the tts=true parameter:

const eventSource = new EventSource(
  'https://vas-poc.vurbo.ai/broadcast/{token}/text?lang=en-US&tts=true'
);

eventSource.addEventListener('tts_ready', (e) => {
  const data = JSON.parse(e.data);
  // data.audio is the Base64-encoded MP3
  // data.boundaries is the Word Boundary array
  const blob = base64ToBlob(data.audio, 'audio/mp3');
  const audio = new Audio(URL.createObjectURL(blob));
  audio.play();
});

Important Notes

  • The TTS language must be in translation_languages; invalid languages are automatically ignored
  • The host (WebSocket) does not receive TTS audio; only SSE viewers receive the tts_ready event
  • TTS is only sent during the live phase; it is not sent during the standby phase

Word Boundary Karaoke Effect

TTS responses include a boundaries array that records the precise time position of each word within the audio. You can use this information to implement a word-by-word karaoke highlight effect.

Word Boundary Data Structure

FieldTypeDescription
offset_msintThe word's start time within the audio (milliseconds)
duration_msintThe word's duration (milliseconds)
text_offsetintThe start position within the text string (character index)
word_lengthintWord length (number of characters)
textstringWord content

Example Data

Using "Hello, nice to meet you" as an example:

[
  {"offset_ms": 0,    "duration_ms": 350, "text_offset": 0,  "word_length": 5, "text": "Hello"},
  {"offset_ms": 350,  "duration_ms": 100, "text_offset": 5,  "word_length": 1, "text": ","},
  {"offset_ms": 500,  "duration_ms": 250, "text_offset": 7,  "word_length": 4, "text": "nice"},
  {"offset_ms": 750,  "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
  {"offset_ms": 950,  "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
  {"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
]

Implementing the Karaoke Effect

function setupKaraoke(audioElement, boundaries, text) {
  const updateHighlight = () => {
    const currentTimeMs = audioElement.currentTime * 1000;

    // Find the word currently being played
    const currentWord = boundaries.find((b, i) => {
      const nextOffset = boundaries[i + 1]?.offset_ms ?? Infinity;
      return currentTimeMs >= b.offset_ms && currentTimeMs < nextOffset;
    });

    if (currentWord) {
      // Highlight the current word
      highlightWord(text, currentWord.text_offset, currentWord.word_length);
    }
  };

  // Update the highlight position every 50ms
  const interval = setInterval(updateHighlight, 50);
  audioElement.addEventListener('ended', () => clearInterval(interval));
}

function highlightWord(text, offset, length) {
  const before = text.substring(0, offset);
  const word = text.substring(offset, offset + length);
  const after = text.substring(offset + length);

  // Update the DOM (adjust to your actual UI framework)
  document.getElementById('tts-text').innerHTML =
    `${before}<span class="highlight">${word}</span>${after}`;
}

CSS Style Reference

.highlight {
  background-color: #FFD700;
  color: #000;
  padding: 2px 4px;
  border-radius: 3px;
  transition: background-color 0.1s ease;
}

TTS Settings Management

Switching TTS Mode

You can switch between synchronous and asynchronous mode at any time while recording is in progress:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}

Success response:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode_changed",
    "tts_mode": "async"
  }
}

Dynamically Updating TTS Settings in Broadcast Mode

While a broadcast is in progress, you can update the TTS settings via the REST API:

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/broadcasts/{id}" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "tts_config": {
      "zh-TW": {"voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0},
      "ja-JP": {"voice": "ja-JP-NanamiNeural", "speaking_rate": 1.2}
    }
  }'

Clearing the TTS settings (pass null):

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/broadcasts/{id}" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "tts_config": null
  }'

TTS Error Handling

Error CodeDescriptionRecommended Action
tts_not_enabledTTS not enabledEnable TTS at start
tts_segment_not_foundSpecified sentence not foundVerify that the SID exists
tts_translation_not_foundTranslation for the language is missingVerify that the translation exists
translation_not_foundTranslation not foundVerify that the translation is complete
tts_synthesis_failedTTS synthesis failedRetry later
tts_quota_exceededTTS usage limit reachedRetry later
invalid_dataInvalid modeUse sync or async


Version: V1.5.7 Last Updated: 2026-05-20

Copyright © 2026