Tts
Table of Contents
- Overview
- Querying Available Voices
- Voice Preview
- Real-Time TTS (WebSocket)
- Historical TTS (SSE)
- Broadcast TTS
- Word Boundary Karaoke Effect
- TTS Settings Management
- Related Reference Documents
Overview
VAS provides TTS (Text-to-Speech) speech synthesis, which converts translated text into playable audio. The system supports 154 languages, with a total of 325 voices to choose from (fully aligned with the speech provider's Monolingual Neural Voice set).
Supported Languages
| Language Code | Language Name | Number of Voices |
|---|---|---|
| zh-TW | Traditional Chinese | 3 |
| zh-CN | Simplified Chinese | 4 |
| en-US | English (United States) | 6 |
| en-GB | English (United Kingdom) | 3 |
| ja-JP | Japanese | 4 |
| ko-KR | Korean | 4 |
| fr-FR | French | 3 |
| de-DE | German | 3 |
| es-ES | Spanish | 3 |
| it-IT | Italian | 3 |
| pt-BR | Portuguese (Brazil) | 3 |
| th-TH | Thai | 3 |
| vi-VN | Vietnamese | 2 |
| id-ID | Indonesian | 2 |
The table above is a summary of popular locales (14 locales, 46 voices total). For the complete set of 154 locales x 325 voices, query
GET /api/v1/tts/voices?language={code}as the authoritative source.
Key Features
- Multi-scenario support: three scenarios — real-time TTS (WebSocket), historical TTS (SSE), and broadcast TTS
- Word Boundary: each word carries a precise timestamp, enabling a word-by-word karaoke highlight effect
- Sync/async modes: sync mode automatically plays the latest translation, while async mode gives you manual control over playback
- Multilingual broadcast TTS: in broadcast mode, you can configure a separate TTS voice for each translation language
Authentication
All TTS-related REST APIs require API Key authentication. See Authentication for details.
Querying Available Voices
Before using TTS, query which voices are available for a given language.
Supported Languages
VAS currently supports TTS speech synthesis in 154 languages. For the complete list, see Appendix - Supported Languages.
Getting the Voice List for a Specific Language
GET https://vas-poc.vurbo.ai/api/v1/tts/voices?language={language}
Example: querying English voices
curl -X GET "https://vas-poc.vurbo.ai/api/v1/tts/voices?language=en-US" \
-H "X-API-Key: YOUR_API_KEY"
Key response fields:
| Field | Description |
|---|---|
voice_name | Voice identifier, used in API calls |
display_name | Voice display name |
gender | Gender: Female / Male |
is_default | Whether this is the default voice for the language |
sample_url | Preview audio URL |
For the complete parameter and response formats, see TTS REST API.
Voice Preview
After querying the voice list, you can use the sample URL to preview how each voice sounds.
GET https://vas-poc.vurbo.ai/api/v1/tts/voices/{voiceName}/sample
Key points:
- The response is binary MP3 audio data (not JSON)
- The first request synthesizes the audio on the fly and caches it; subsequent requests return directly from the cache
- Does not count toward TTS usage charges
- Rate limit: 30 requests per minute per user
Frontend preview example:
// Play directly with an Audio element
const audio = new Audio(
'https://vas-poc.vurbo.ai/api/v1/tts/voices/en-US-JennyNeural/sample'
);
audio.play();
Real-Time TTS (WebSocket)
Real-time TTS converts translation results into speech on the fly while speech recognition is in progress. It is sent and received over WebSocket.
Enabling TTS
Add the TTS-related parameters to the start action:
{
"type": "voice-translation",
"data": {
"action": "start",
"transcription_languages": ["zh-TW"],
"translation_languages": ["en-US"],
"realtime_translation": true,
"type": "transcribe",
"tts_enabled": true,
"tts_language": "en-US",
"tts_voice": "en-US-JennyNeural",
"tts_mode": "sync"
}
}
| Parameter | Description |
|---|---|
tts_enabled | Set to true to enable TTS |
tts_language | TTS output language (must be in translation_languages) |
tts_voice | TTS voice name (e.g., en-US-JennyNeural) |
tts_mode | sync (synchronous, automatic playback) or async (asynchronous, manual control) |
Synchronous Mode (sync)
- The system automatically plays the latest translated sentence with
is_final=true - If the previous sentence is still playing, subsequent sentences enter a queue and wait
- Suitable for scenarios that do not require manual control
Asynchronous Mode (async)
You can manually select any translated sentence for TTS playback. Repeated requests for the same sid (replay) are supported.
Conversation mode (conversation) also supports
tts_mode: "async". Once set,tts_readyis not pushed automatically when translation completes; you must trigger it manually withtts_play. In conversation mode, the system automatically synthesizes the corresponding language based ontts_config.
Playing a specific sentence:
{
"type": "voice-translation",
"data": {
"action": "tts_play",
"sid": 5
}
}
Playing multiple sentences (play 3 sentences starting from sid 5):
{
"type": "voice-translation",
"data": {
"action": "tts_play",
"sid": 5,
"length": 3
}
}
The maximum value of
lengthis 20 (controlled by the backendTTS_SSE_MAX_LENGTH).
Stopping playback:
{
"type": "voice-translation",
"data": {
"action": "tts_stop"
}
}
Receiving TTS Audio
When TTS synthesis completes, the server pushes a tts_ready event:
{
"type": "voice-translation",
"data": {
"action": "tts_ready",
"sid": 1,
"language": "en-US",
"transcript": "你好,很高興認識你",
"text": "Hello, nice to meet you",
"audio": "Base64EncodedMP3...",
"format": "mp3",
"duration_ms": 2500,
"boundaries": [
{"offset_ms": 0, "duration_ms": 350, "text_offset": 0, "word_length": 5, "text": "Hello"},
{"offset_ms": 500, "duration_ms": 250, "text_offset": 7, "word_length": 4, "text": "nice"},
{"offset_ms": 750, "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
{"offset_ms": 950, "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
{"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
]
}
}
Frontend playback example:
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.data?.action === 'tts_ready') {
const { audio, boundaries, text } = msg.data;
// Base64 to Blob
const byteChars = atob(audio);
const byteArray = new Uint8Array(byteChars.length);
for (let i = 0; i < byteChars.length; i++) {
byteArray[i] = byteChars.charCodeAt(i);
}
const blob = new Blob([byteArray], { type: 'audio/mp3' });
const audioEl = new Audio(URL.createObjectURL(blob));
audioEl.play();
}
};
Switching TTS Mode
You can dynamically switch between synchronous and asynchronous mode while recording is in progress:
{
"type": "voice-translation",
"data": {
"action": "tts_mode",
"tts_mode": "async"
}
}
Historical TTS (SSE)
Historical TTS is used to play back the translated audio of a completed recording, streaming the audio sentence by sentence over SSE.
Request Format
GET https://vas-poc.vurbo.ai/api/v1/sse/tts/{taskId}?language={language}&sid={sid}&length={length}
| Parameter | Required | Description |
|---|---|---|
taskId | Yes | Recording ID |
language | Yes | TTS output language (e.g., en-US) |
voice | No | Specific voice name (e.g., en-US-JennyNeural) |
sid | No | Starting sentence ID (default 1) |
length | No | Number of sentences to return (default 1, max 20) |
Event Sequence
connected -> tts_audio (repeated N times) -> tts_done
- connected: connection confirmation, including voice information
- tts_audio: TTS audio sent sentence by sentence (including Word Boundary)
- tts_done: all sentences have been sent
Multi-Sentence Playback Example
async function playTTS(taskId, language, apiKey, startSid = 1, length = 3) {
const url = new URL(`https://vas-poc.vurbo.ai/api/v1/sse/tts/${taskId}`);
url.searchParams.set('language', language);
url.searchParams.set('sid', startSid);
url.searchParams.set('length', length);
const response = await fetch(url, {
headers: { 'X-API-Key': apiKey }
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const events = parseSSE(decoder.decode(value));
for (const event of events) {
if (event.type === 'tts_audio') {
// Play the audio and set up the karaoke effect
const blob = base64ToBlob(event.data.audio, 'audio/mp3');
const audio = new Audio(URL.createObjectURL(blob));
setupKaraoke(audio, event.data.boundaries, event.data.text);
audio.play();
}
}
}
}
Note: the browser's native EventSource does not support custom headers, so use the fetch API together with ReadableStream.
Broadcast TTS
TTS in broadcast mode lets the viewer side receive translated audio. TTS audio is pushed to viewers over SSE.
Host-Side Configuration
When creating a broadcast (REST API) or starting the WebSocket, use tts_config to specify which languages have TTS enabled:
Configuring when creating a broadcast (REST API):
curl -X POST "https://vas-poc.vurbo.ai/api/v1/broadcasts" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"transcription_language": "zh-TW",
"translation_languages": ["en-US", "ja-JP"],
"tts_config": {
"en-US": {"voice": "en-US-JennyNeural", "speaking_rate": 1.0},
"ja-JP": {"voice": "ja-JP-NanamiNeural", "speaking_rate": 1.0}
}
}'
Configuring at WebSocket start:
{
"type": "voice-translation",
"data": {
"action": "start",
"type": "broadcast",
"broadcast_token": "YOUR_BROADCAST_TOKEN",
"audio_format": "pcm",
"tts_config": {
"en-US": {
"voice": "en-US-JennyNeural",
"speaking_rate": 1.0
},
"ja-JP": {
"voice": "ja-JP-NanamiNeural",
"speaking_rate": 1.0
}
}
}
}
tts_config Parameters
| Field | Type | Description |
|---|---|---|
voice | string | TTS voice name |
speaking_rate | number | Speaking rate (0.5 ~ 2.0, default 1.0) |
Receiving TTS on the Viewer Side
When viewers connect via SSE, they add the tts=true parameter:
const eventSource = new EventSource(
'https://vas-poc.vurbo.ai/broadcast/{token}/text?lang=en-US&tts=true'
);
eventSource.addEventListener('tts_ready', (e) => {
const data = JSON.parse(e.data);
// data.audio is the Base64-encoded MP3
// data.boundaries is the Word Boundary array
const blob = base64ToBlob(data.audio, 'audio/mp3');
const audio = new Audio(URL.createObjectURL(blob));
audio.play();
});
Important Notes
- The TTS language must be in
translation_languages; invalid languages are automatically ignored - The host (WebSocket) does not receive TTS audio; only SSE viewers receive the
tts_readyevent - TTS is only sent during the
livephase; it is not sent during thestandbyphase
Word Boundary Karaoke Effect
TTS responses include a boundaries array that records the precise time position of each word within the audio. You can use this information to implement a word-by-word karaoke highlight effect.
Word Boundary Data Structure
| Field | Type | Description |
|---|---|---|
offset_ms | int | The word's start time within the audio (milliseconds) |
duration_ms | int | The word's duration (milliseconds) |
text_offset | int | The start position within the text string (character index) |
word_length | int | Word length (number of characters) |
text | string | Word content |
Example Data
Using "Hello, nice to meet you" as an example:
[
{"offset_ms": 0, "duration_ms": 350, "text_offset": 0, "word_length": 5, "text": "Hello"},
{"offset_ms": 350, "duration_ms": 100, "text_offset": 5, "word_length": 1, "text": ","},
{"offset_ms": 500, "duration_ms": 250, "text_offset": 7, "word_length": 4, "text": "nice"},
{"offset_ms": 750, "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
{"offset_ms": 950, "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
{"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
]
Implementing the Karaoke Effect
function setupKaraoke(audioElement, boundaries, text) {
const updateHighlight = () => {
const currentTimeMs = audioElement.currentTime * 1000;
// Find the word currently being played
const currentWord = boundaries.find((b, i) => {
const nextOffset = boundaries[i + 1]?.offset_ms ?? Infinity;
return currentTimeMs >= b.offset_ms && currentTimeMs < nextOffset;
});
if (currentWord) {
// Highlight the current word
highlightWord(text, currentWord.text_offset, currentWord.word_length);
}
};
// Update the highlight position every 50ms
const interval = setInterval(updateHighlight, 50);
audioElement.addEventListener('ended', () => clearInterval(interval));
}
function highlightWord(text, offset, length) {
const before = text.substring(0, offset);
const word = text.substring(offset, offset + length);
const after = text.substring(offset + length);
// Update the DOM (adjust to your actual UI framework)
document.getElementById('tts-text').innerHTML =
`${before}<span class="highlight">${word}</span>${after}`;
}
CSS Style Reference
.highlight {
background-color: #FFD700;
color: #000;
padding: 2px 4px;
border-radius: 3px;
transition: background-color 0.1s ease;
}
TTS Settings Management
Switching TTS Mode
You can switch between synchronous and asynchronous mode at any time while recording is in progress:
{
"type": "voice-translation",
"data": {
"action": "tts_mode",
"tts_mode": "async"
}
}
Success response:
{
"type": "voice-translation",
"data": {
"action": "tts_mode_changed",
"tts_mode": "async"
}
}
Dynamically Updating TTS Settings in Broadcast Mode
While a broadcast is in progress, you can update the TTS settings via the REST API:
curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/broadcasts/{id}" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"tts_config": {
"zh-TW": {"voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0},
"ja-JP": {"voice": "ja-JP-NanamiNeural", "speaking_rate": 1.2}
}
}'
Clearing the TTS settings (pass null):
curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/broadcasts/{id}" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"tts_config": null
}'
TTS Error Handling
| Error Code | Description | Recommended Action |
|---|---|---|
tts_not_enabled | TTS not enabled | Enable TTS at start |
tts_segment_not_found | Specified sentence not found | Verify that the SID exists |
tts_translation_not_found | Translation for the language is missing | Verify that the translation exists |
translation_not_found | Translation not found | Verify that the translation is complete |
tts_synthesis_failed | TTS synthesis failed | Retry later |
tts_quota_exceeded | TTS usage limit reached | Retry later |
invalid_data | Invalid mode | Use sync or async |
Related Reference Documents
- REST API - TTS Voices
- WebSocket - Voice Translation
- SSE - TTS Audio Streaming
- REST API - Broadcasts (tts_config settings)
Version: V1.5.7 Last Updated: 2026-05-20