Guides

Tts

Overview
Querying Available Voices
Voice Preview
Real-Time TTS (WebSocket)
Historical TTS (SSE)
Broadcast TTS
Word Boundary Karaoke Effect
TTS Settings Management
Related Reference Documents

Overview

VAS provides TTS (Text-to-Speech) speech synthesis, which converts translated text into playable audio. The system supports 154 languages, with a total of 325 voices to choose from (fully aligned with the speech provider's Monolingual Neural Voice set).

Supported Languages

Language Code	Language Name	Number of Voices
zh-TW	Traditional Chinese	3
zh-CN	Simplified Chinese	4
en-US	English (United States)	6
en-GB	English (United Kingdom)	3
ja-JP	Japanese	4
ko-KR	Korean	4
fr-FR	French	3
de-DE	German	3
es-ES	Spanish	3
it-IT	Italian	3
pt-BR	Portuguese (Brazil)	3
th-TH	Thai	3
vi-VN	Vietnamese	2
id-ID	Indonesian	2

The table above is a summary of popular locales (14 locales, 46 voices total). For the complete set of 154 locales x 325 voices, query GET /api/v1/tts/voices?language={code} as the authoritative source.

Key Features

Multi-scenario support: three scenarios — real-time TTS (WebSocket), historical TTS (SSE), and broadcast TTS
Word Boundary: each word carries a precise timestamp, enabling a word-by-word karaoke highlight effect
Sync/async modes: sync mode automatically plays the latest translation, while async mode gives you manual control over playback
Multilingual broadcast TTS: in broadcast mode, you can configure a separate TTS voice for each translation language

Authentication

All TTS-related REST APIs require API Key authentication. See Authentication for details.

Querying Available Voices

Before using TTS, query which voices are available for a given language.

Supported Languages

VAS currently supports TTS speech synthesis in 154 languages. For the complete list, see Appendix - Supported Languages.

Getting the Voice List for a Specific Language

GET https://vas-poc.vurbo.ai/api/v1/tts/voices?language={language}

Example: querying English voices

curl -X GET "https://vas-poc.vurbo.ai/api/v1/tts/voices?language=en-US" \
  -H "X-API-Key: YOUR_API_KEY"

Key response fields:

Field	Description
`voice_name`	Voice identifier, used in API calls
`display_name`	Voice display name
`gender`	Gender: `Female` / `Male`
`is_default`	Whether this is the default voice for the language
`sample_url`	Preview audio URL

For the complete parameter and response formats, see TTS REST API.

Voice Preview

After querying the voice list, you can use the sample URL to preview how each voice sounds.

GET https://vas-poc.vurbo.ai/api/v1/tts/voices/{voiceName}/sample

Key points:

The response is binary MP3 audio data (not JSON)
The first request synthesizes the audio on the fly and caches it; subsequent requests return directly from the cache
Does not count toward TTS usage charges
Rate limit: 30 requests per minute per user

Frontend preview example:

// Play directly with an Audio element
const audio = new Audio(
  'https://vas-poc.vurbo.ai/api/v1/tts/voices/en-US-JennyNeural/sample'
);
audio.play();

Real-Time TTS (WebSocket)

Real-time TTS converts translation results into speech on the fly while speech recognition is in progress. It is sent and received over WebSocket.

Enabling TTS

Add the TTS-related parameters to the start action:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "realtime_translation": true,
    "type": "transcribe",
    "tts_enabled": true,
    "tts_language": "en-US",
    "tts_voice": "en-US-JennyNeural",
    "tts_mode": "sync"
  }
}

Parameter	Description
`tts_enabled`	Set to `true` to enable TTS
`tts_language`	TTS output language (must be in `translation_languages`)
`tts_voice`	TTS voice name (e.g., `en-US-JennyNeural`)
`tts_mode`	`sync` (synchronous, automatic playback) or `async` (asynchronous, manual control)

Synchronous Mode (sync)

The system automatically plays the latest translated sentence with is_final=true
If the previous sentence is still playing, subsequent sentences enter a queue and wait
Suitable for scenarios that do not require manual control

Asynchronous Mode (async)

You can manually select any translated sentence for TTS playback. Repeated requests for the same sid (replay) are supported.

Conversation mode (conversation) also supports tts_mode: "async". Once set, tts_ready is not pushed automatically when translation completes; you must trigger it manually with tts_play. In conversation mode, the system automatically synthesizes the corresponding language based on tts_config.

Playing a specific sentence:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5
  }
}

Playing multiple sentences (play 3 sentences starting from sid 5):

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5,
    "length": 3
  }
}

The maximum value of length is 20 (controlled by the backend TTS_SSE_MAX_LENGTH).

Stopping playback:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_stop"
  }
}

Receiving TTS Audio

When TTS synthesis completes, the server pushes a tts_ready event:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_ready",
    "sid": 1,
    "language": "en-US",
    "transcript": "你好，很高興認識你",
    "text": "Hello, nice to meet you",
    "audio": "Base64EncodedMP3...",
    "format": "mp3",
    "duration_ms": 2500,
    "boundaries": [
      {"offset_ms": 0, "duration_ms": 350, "text_offset": 0, "word_length": 5, "text": "Hello"},
      {"offset_ms": 500, "duration_ms": 250, "text_offset": 7, "word_length": 4, "text": "nice"},
      {"offset_ms": 750, "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
      {"offset_ms": 950, "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
      {"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
    ]
  }
}

Frontend playback example:

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.data?.action === 'tts_ready') {
    const { audio, boundaries, text } = msg.data;

    // Base64 to Blob
    const byteChars = atob(audio);
    const byteArray = new Uint8Array(byteChars.length);
    for (let i = 0; i < byteChars.length; i++) {
      byteArray[i] = byteChars.charCodeAt(i);
    }
    const blob = new Blob([byteArray], { type: 'audio/mp3' });
    const audioEl = new Audio(URL.createObjectURL(blob));
    audioEl.play();
  }
};

Switching TTS Mode

You can dynamically switch between synchronous and asynchronous mode while recording is in progress:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}

Historical TTS (SSE)

Historical TTS is used to play back the translated audio of a completed recording, streaming the audio sentence by sentence over SSE.

Request Format

GET https://vas-poc.vurbo.ai/api/v1/sse/tts/{taskId}?language={language}&sid={sid}&length={length}

Parameter	Required	Description
`taskId`	Yes	Recording ID
`language`	Yes	TTS output language (e.g., `en-US`)
`voice`	No	Specific voice name (e.g., `en-US-JennyNeural`)
`sid`	No	Starting sentence ID (default 1)
`length`	No	Number of sentences to return (default 1, max 20)

Event Sequence

connected  ->  tts_audio (repeated N times)  ->  tts_done

connected: connection confirmation, including voice information
tts_audio: TTS audio sent sentence by sentence (including Word Boundary)
tts_done: all sentences have been sent

Multi-Sentence Playback Example

async function playTTS(taskId, language, apiKey, startSid = 1, length = 3) {
  const url = new URL(`https://vas-poc.vurbo.ai/api/v1/sse/tts/${taskId}`);
  url.searchParams.set('language', language);
  url.searchParams.set('sid', startSid);
  url.searchParams.set('length', length);

  const response = await fetch(url, {
    headers: { 'X-API-Key': apiKey }
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const events = parseSSE(decoder.decode(value));
    for (const event of events) {
      if (event.type === 'tts_audio') {
        // Play the audio and set up the karaoke effect
        const blob = base64ToBlob(event.data.audio, 'audio/mp3');
        const audio = new Audio(URL.createObjectURL(blob));
        setupKaraoke(audio, event.data.boundaries, event.data.text);
        audio.play();
      }
    }
  }
}

Note: the browser's native EventSource does not support custom headers, so use the fetch API together with ReadableStream.

Broadcast TTS

TTS in broadcast mode lets the viewer side receive translated audio. TTS audio is pushed to viewers over SSE.

Host-Side Configuration

When creating a broadcast (REST API) or starting the WebSocket, use tts_config to specify which languages have TTS enabled:

Configuring when creating a broadcast (REST API):

curl -X POST "https://vas-poc.vurbo.ai/api/v1/broadcasts" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "transcription_language": "zh-TW",
    "translation_languages": ["en-US", "ja-JP"],
    "tts_config": {
      "en-US": {"voice": "en-US-JennyNeural", "speaking_rate": 1.0},
      "ja-JP": {"voice": "ja-JP-NanamiNeural", "speaking_rate": 1.0}
    }
  }'

Configuring at WebSocket start:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "YOUR_BROADCAST_TOKEN",
    "audio_format": "pcm",
    "tts_config": {
      "en-US": {
        "voice": "en-US-JennyNeural",
        "speaking_rate": 1.0
      },
      "ja-JP": {
        "voice": "ja-JP-NanamiNeural",
        "speaking_rate": 1.0
      }
    }
  }
}

tts_config Parameters

Field	Type	Description
`voice`	string	TTS voice name
`speaking_rate`	number	Speaking rate (0.5 ~ 2.0, default 1.0)

Receiving TTS on the Viewer Side

When viewers connect via SSE, they add the tts=true parameter:

const eventSource = new EventSource(
  'https://vas-poc.vurbo.ai/broadcast/{token}/text?lang=en-US&tts=true'
);

eventSource.addEventListener('tts_ready', (e) => {
  const data = JSON.parse(e.data);
  // data.audio is the Base64-encoded MP3
  // data.boundaries is the Word Boundary array
  const blob = base64ToBlob(data.audio, 'audio/mp3');
  const audio = new Audio(URL.createObjectURL(blob));
  audio.play();
});

Important Notes

The TTS language must be in translation_languages; invalid languages are automatically ignored
The host (WebSocket) does not receive TTS audio; only SSE viewers receive the tts_ready event
TTS is only sent during the live phase; it is not sent during the standby phase

Word Boundary Karaoke Effect

TTS responses include a boundaries array that records the precise time position of each word within the audio. You can use this information to implement a word-by-word karaoke highlight effect.

Word Boundary Data Structure

Field	Type	Description
`offset_ms`	int	The word's start time within the audio (milliseconds)
`duration_ms`	int	The word's duration (milliseconds)
`text_offset`	int	The start position within the text string (character index)
`word_length`	int	Word length (number of characters)
`text`	string	Word content

Example Data

Using "Hello, nice to meet you" as an example:

[
  {"offset_ms": 0,    "duration_ms": 350, "text_offset": 0,  "word_length": 5, "text": "Hello"},
  {"offset_ms": 350,  "duration_ms": 100, "text_offset": 5,  "word_length": 1, "text": ","},
  {"offset_ms": 500,  "duration_ms": 250, "text_offset": 7,  "word_length": 4, "text": "nice"},
  {"offset_ms": 750,  "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
  {"offset_ms": 950,  "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
  {"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
]

Implementing the Karaoke Effect

function setupKaraoke(audioElement, boundaries, text) {
  const updateHighlight = () => {
    const currentTimeMs = audioElement.currentTime * 1000;

    // Find the word currently being played
    const currentWord = boundaries.find((b, i) => {
      const nextOffset = boundaries[i + 1]?.offset_ms ?? Infinity;
      return currentTimeMs >= b.offset_ms && currentTimeMs < nextOffset;
    });

    if (currentWord) {
      // Highlight the current word
      highlightWord(text, currentWord.text_offset, currentWord.word_length);
    }
  };

  // Update the highlight position every 50ms
  const interval = setInterval(updateHighlight, 50);
  audioElement.addEventListener('ended', () => clearInterval(interval));
}

function highlightWord(text, offset, length) {
  const before = text.substring(0, offset);
  const word = text.substring(offset, offset + length);
  const after = text.substring(offset + length);

  // Update the DOM (adjust to your actual UI framework)
  document.getElementById('tts-text').innerHTML =
    `${before}<span class="highlight">${word}</span>${after}`;
}

CSS Style Reference

.highlight {
  background-color: #FFD700;
  color: #000;
  padding: 2px 4px;
  border-radius: 3px;
  transition: background-color 0.1s ease;
}

TTS Settings Management

Switching TTS Mode

You can switch between synchronous and asynchronous mode at any time while recording is in progress:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}

Success response:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode_changed",
    "tts_mode": "async"
  }
}

Dynamically Updating TTS Settings in Broadcast Mode

While a broadcast is in progress, you can update the TTS settings via the REST API:

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/broadcasts/{id}" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "tts_config": {
      "zh-TW": {"voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0},
      "ja-JP": {"voice": "ja-JP-NanamiNeural", "speaking_rate": 1.2}
    }
  }'

Clearing the TTS settings (pass null):

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/broadcasts/{id}" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "tts_config": null
  }'

TTS Error Handling

Error Code	Description	Recommended Action
`tts_not_enabled`	TTS not enabled	Enable TTS at start
`tts_segment_not_found`	Specified sentence not found	Verify that the SID exists
`tts_translation_not_found`	Translation for the language is missing	Verify that the translation exists
`translation_not_found`	Translation not found	Verify that the translation is complete
`tts_synthesis_failed`	TTS synthesis failed	Retry later
`tts_quota_exceeded`	TTS usage limit reached	Retry later
`invalid_data`	Invalid mode	Use `sync` or `async`

Version: V1.5.7 Last Updated: 2026-05-20

Summary Customization

Voice Translation