Voice Translation
Table of Contents
- Overview
- Prerequisites
- Starting Voice Translation
- Sending Audio
- Receiving Recognition and Translation Results
- Operation Controls
- Advanced Features
- Conversation Mode
- Stopping and Summary
- Complete Flow Diagram
- Related Documents
Overview
The VAS real-time voice translation service provides low-latency speech-to-text (STT) and real-time translation over WebSocket. The complete flow is:
- The client captures audio from the microphone
- The audio stream is sent to the VAS server
- The server performs speech recognition and returns the transcript
- Multi-language translation is performed in parallel and the results are returned
- (Optional) TTS speech is synthesized to play back the translation results
Use Cases
| Scenario | Recording Type (type) |
|---|---|
| Meeting notes, interview records | transcribe |
| Bilingual real-time interpretation, cross-language conversation | conversation |
| Voice memos, quick notes | record |
| Lectures, presentations, live streaming | broadcast (see the Broadcast Guide) |
Prerequisites
1. Obtain an API Key
Make sure you have a valid API Key (format: vas_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx). For authentication details, see Authentication.
2. Obtain a Ticket
WebSocket connections are authenticated using a Ticket mechanism. First, exchange your API Key for a one-time Ticket:
curl -X POST "https://vas-poc.vurbo.ai/api/v1/auth/ticket" \
-H "X-API-Key: vas_your_api_key_here"
Response:
{
"ticket": "aBcDeFgHiJkLmNoPqRsTuVwXyZ012345",
"expires_in": 60
}
Note: A Ticket is valid for 60 seconds and can be used only once.
3. Establish a WebSocket Connection
Place the Ticket in Sec-WebSocket-Protocol using the format ticket.{TICKET_VALUE}:
const ws = new WebSocket('wss://vas-poc.vurbo.ai/ws', [`ticket.${ticket}`]);
ws.onopen = () => {
console.log('WebSocket connected');
};
4. Maintain a Heartbeat
We recommend sending a ping every 30 seconds to ensure the connection does not time out:
{
"type": "health",
"data": { "action": "ping" }
}
The server responds with pong.
Starting Voice Translation
After the connection is established, send the start action to launch a voice translation session.
Basic Request
{
"type": "voice-translation",
"data": {
"action": "start",
"transcription_languages": ["zh-TW"],
"translation_languages": ["en-US"],
"type": "transcribe",
"audio_format": "pcm",
"summary_template": "meeting"
}
}
Core Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
transcription_languages | string | Yes | Speech recognition languages, up to 2 (e.g., ["zh-TW"]) |
translation_languages | string | No | Target translation languages; an empty array or omitting it means no translation |
type | string | Yes | Recording type: transcribe, conversation, record, broadcast |
audio_format | string | No | Audio format: pcm (default) or webm |
summary_template | string | Conditional | Summary template (required for the transcribe type, e.g., meeting, interview) |
realtime_translation | boolean | No | Real-time translation mode (default false) |
recognition_mode | string | No | single (single speaker, default) or multi_speaker (multi-speaker diarization); under multi_speaker, transcription_languages must contain exactly 1 language, otherwise a diarization_multilang_conflict error is returned and the session is refused |
name | string | No | Initial default recording name (max 60 characters; the system may still override it; if not provided, a name such as Transcription #1 is generated automatically) |
Request with TTS
To enable speech synthesis of the translation results, add the TTS-related parameters:
{
"type": "voice-translation",
"data": {
"action": "start",
"transcription_languages": ["zh-TW"],
"translation_languages": ["en-US"],
"type": "transcribe",
"audio_format": "pcm",
"summary_template": "meeting",
"tts_enabled": true,
"tts_language": "en-US",
"tts_voice": "en-US-JennyNeural",
"tts_mode": "sync"
}
}
| TTS Parameter | Description |
|---|---|
tts_enabled | Whether to enable TTS (default false) |
tts_language | TTS output language (must be in translation_languages) |
tts_voice | TTS voice name (e.g., en-US-JennyNeural) |
tts_mode | sync (automatic playback, default) or async (manual control) |
Success Response
After a successful start, the server returns a session_started event:
{
"type": "voice-translation",
"data": {
"action": "session_started",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"recording_type": "transcribe",
"recognition_mode": "single",
"message": "Speech recognition started"
}
}
Save the session_id and recording_id; they will be used in subsequent API operations.
Sending Audio
Once the session has started, continuously send audio data to the server.
Audio Format Requirements
PCM format (default, recommended):
| Item | Specification |
|---|---|
| Sample rate | 16000 Hz |
| Bit depth | 16-bit |
| Channels | Mono |
| Byte order | Little-endian |
WebM/Opus format: Any sample rate and number of channels; the server converts automatically.
Sending Format
Audio data must be Base64-encoded and sent with the audio action:
{
"type": "voice-translation",
"data": {
"action": "audio",
"payload": "Base64-encoded audio data..."
}
}
Front-End Audio Capture Example
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
const float32 = e.inputBuffer.getChannelData(0);
// Convert to 16-bit PCM
const int16 = new Int16Array(float32.length);
for (let i = 0; i < float32.length; i++) {
int16[i] = Math.max(-32768, Math.min(32767, float32[i] * 32768));
}
// Base64-encode and send
const base64 = btoa(String.fromCharCode(...new Uint8Array(int16.buffer)));
ws.send(JSON.stringify({
type: 'voice-translation',
data: { action: 'audio', payload: base64 }
}));
};
source.connect(processor);
processor.connect(audioContext.destination);
Receiving Recognition and Translation Results
The server pushes recognition and translation results via the result event.
Speech Recognition Result (Origin)
{
"type": "voice-translation",
"data": {
"action": "result",
"origin": {
"sid": 1,
"language": "zh-TW",
"text": "你好,很高興認識你",
"is_final": true,
"speaker_id": "0",
"detected_language": "zh-TW",
"start_time": "00:05"
}
}
}
| Field | Description |
|---|---|
sid | Sentence number, incrementing from 1 |
text | The recognized text |
is_final | false for intermediate results (which will be overwritten); true for final results |
speaker_id | Speaker ID (meaningful in multi-speaker mode) |
start_time | Sentence start time (format mm:ss) |
Translation Result (Translations)
{
"type": "voice-translation",
"data": {
"action": "result",
"translations": {
"en-US": {
"sid": 1,
"text": "Hello, nice to meet you",
"is_final": true
}
}
}
}
Important:
originandtranslationsmay arrive in the sameresultevent, or they may be pushed separately. The front end should match them bysid.
TTS Audio Ready (TTS Ready)
If TTS is enabled, you receive a tts_ready event after the translation completes:
{
"type": "voice-translation",
"data": {
"action": "tts_ready",
"sid": 1,
"language": "en-US",
"text": "Hello, nice to meet you",
"audio": "Base64EncodedMP3...",
"format": "mp3",
"duration_ms": 2500,
"boundaries": [...]
}
}
The boundaries array contains Word Boundary information, which can be used to implement karaoke-style synchronized highlighting.
Operation Controls
Pause
Temporarily stop speech recognition processing:
{
"type": "voice-translation",
"data": { "action": "pause" }
}
Resume
Resume paused speech recognition:
{
"type": "voice-translation",
"data": { "action": "resume" }
}
Set Recording Name
There are two ways to set the recording name:
Method 1: Specify the name parameter at start (initial default name)
{
"type": "voice-translation",
"data": {
"action": "start",
"transcription_languages": ["zh-TW"],
"type": "transcribe",
"summary_template": "meeting",
"name": "Product Planning Meeting"
}
}
This name is an initial default; when the session ends, the system may still override it based on the transcript content.
Method 2: Use set_name during recording (fixed name)
{
"type": "voice-translation",
"data": {
"action": "set_name",
"name": "Product Planning Meeting"
}
}
A name set via
set_namewill not be overridden by the system.
If no name is set, the system automatically uses a "type + sequence number" format (e.g., Transcription #1, Broadcast #3). After the session ends, the system attempts to automatically generate a more meaningful name based on the transcript content (but it will not override a name set via set_name).
Switch Translation Language
Switch the target language during recording; the system automatically retranslates all previously translated sentences:
{
"type": "voice-translation",
"data": {
"action": "switch_language",
"translation_languages": ["ja-JP"]
}
}
The system returns a language_switch_start event, followed by multiple batch_retranslation events, and finally a language_switch_done event, in order.
Retranslate a Specific Sentence
After correcting a recognition error, you can retranslate a single sentence:
{
"type": "voice-translation",
"data": {
"action": "retranslate",
"sid": 1,
"translation_languages": ["en-US"],
"text": "Corrected source text"
}
}
Advanced Features
Multi-Language Translation
Specify multiple target languages in translation_languages to translate into several languages at once:
{
"transcription_languages": ["zh-TW"],
"translation_languages": ["en-US", "ja-JP", "ko-KR"]
}
Translation results are returned together, keyed by language code.
Speaker Recognition (Multi Speaker)
Set recognition_mode to multi_speaker to enable speaker recognition:
{
"recognition_mode": "multi_speaker"
}
Note: In
multi_speakermode,transcription_languagesmust contain exactly 1 language. If you provide multiple languages, you will receive adiarization_multilang_conflicterror and the session will be refused.
Once enabled, the speaker_id in the recognition results automatically distinguishes different speakers (e.g., Guest-1, Guest-2). You can manage speakers with the following operations:
rename_speaker: Globally rename a speaker (e.g., changeGuest-1toManager Wang)reassign_speaker: Change the speaker identity of a single sentencemerge_speakers: Merge two speakers (assign all sentences from one to the other)
TTS Playback Control
In async mode, you can manually control TTS playback:
Play a specific sentence:
{
"type": "voice-translation",
"data": {
"action": "tts_play",
"sid": 5,
"length": 3
}
}
Stop playback:
{
"type": "voice-translation",
"data": { "action": "tts_stop" }
}
Switch playback mode:
{
"type": "voice-translation",
"data": {
"action": "tts_mode",
"tts_mode": "async"
}
}
| Mode | Behavior |
|---|---|
sync | Automatically plays the latest is_final=true translation; the next sentence plays only after the previous one finishes |
async | Manually controls playback via tts_play |
Text Processing Parameters (Config)
Before start or during recording, you can use the config action to set the terminology list, fuzzy-term correction, and the translation dictionary:
{
"type": "voice-translation",
"data": {
"action": "config",
"terminology": {
"zh-TW": [
{ "term": "語者分離", "boost": 1.5 },
{ "term": "CVD製程", "boost": 1.5 }
]
},
"translation_dict": [
{
"source": "語者分離",
"translations": { "en-US": "Speaker Diarization" }
}
]
}
}
| Setting | Description |
|---|---|
terminology | Terminology list -- improves recognition accuracy for specific terms (up to 500 per language) |
fuzzy_correction | Fuzzy-term correction -- automatically corrects homophone errors (usually does not need to be set manually; the system generates it automatically from terminology) |
translation_dict | Translation dictionary -- ensures consistent translation of proper nouns (we recommend no more than 50 entries) |
Recommended practice: Set only
terminology; the system will automatically generate correction rules for homophones, near-homophones, and Traditional/Simplified Chinese variants of each term.
Conversation Mode
Conversation mode lets two people who speak different languages hold a real-time interpreted conversation over a single WebSocket connection. The system automatically detects the language of each utterance, translates it into the other person's language, and returns the translation result as TTS audio. Language detection is fully automatic; no manual switching is required.
Start a Conversation
{
"type": "voice-translation",
"data": {
"action": "start",
"type": "conversation",
"transcription_languages": ["zh-TW", "en-US"],
"audio_format": "pcm",
"realtime_translation": true,
"tts_config": {
"zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
"en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
}
}
}
transcription_languagesmust contain exactly 2 languagesactive_languageis optional and specifies the initial preferred language (language detection is still automatic)tts_configcan be omitted; the system uses default voices automaticallytts_enableddefaults totrue; set it tofalseto return text translations only
Automatic Language Detection
The system automatically detects the language of each utterance. The origin.language of each utterance directly reflects the detected language, and the translation target is automatically the other of the two languages.
Note: You do not need to call
switch_languagemanually to switch languages; the system detects them automatically.switch_languagecan still be used, but it only updates the internal preference state.
Switching TTS Settings Mid-Conversation
During a conversation, you can use set_tts to toggle TTS on or off or to update voice settings:
{
"type": "voice-translation",
"data": {
"action": "set_tts",
"tts_enabled": true,
"tts_config": {
"en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
}
}
}
On success, you receive a tts_updated event containing the full updated settings.
Complete Conversation Flow
1. start (conversation, zh-TW + en-US)
2. session_started
3. Send audio (Person A speaks Chinese)
4. result (origin.language: "zh-TW", translations: en-US) ← automatic detection
5. tts_ready (en-US audio → played to Person B)
6. Send audio (Person B speaks English, no switching needed!)
7. result (origin.language: "en-US", translations: zh-TW) ← automatic detection
8. tts_ready (zh-TW audio → played to Person A)
9. stop
10. task_complete
Stopping and Summary
Stop Recording
Send the stop action to end the voice translation session:
{
"type": "voice-translation",
"data": { "action": "stop" }
}
Event Flow
After stopping, the system performs the following steps in order and pushes events:
status-- confirms that speech recognition has stopped- (Background processing) -- uploads the audio file and saves the transcript
task_complete-- task processing is complete, including thetask_id
{
"type": "voice-translation",
"data": {
"action": "task_complete",
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"message": "Task processing complete"
}
}
- (If a summary template was set) -- the system automatically generates a summary
Save the
task_idso you can later query the results via the Tasks API or load the history via the SSE API.
Complete Flow Diagram
Prerequisites
│
┌──────────────┼──────────────┐
│ │ │
Get API Key Get Ticket Open WebSocket
│ │ │
└──────────────┼──────────────┘
│
┌───────▼───────┐
│ config (optional)│ Set terminology / correction rules
└───────┬───────┘
│
┌───────▼───────┐
│ start │ Start voice translation
└───────┬───────┘
│
session_started
│
┌────────────▼────────────┐
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ audio │────────────│ result │
│ (ongoing) │ Send audio │ Results │
└─────┬─────┘ └─────┬─────┘
│ │
│ ┌───────────────────┤
│ │ │
│ origin translations
│ (source) (translation)
│ │
│ ┌────────▼────────┐
│ │ tts_ready (optional)│
│ └─────────────────┘
│
┌─────▼─────┐ ┌──────────┐
│ pause / │◄──►│ resume │ Operation controls
│ resume │ └──────────┘
└─────┬─────┘
│
┌─────▼─────┐
│ stop │ Stop translation
└─────┬─────┘
│
┌─────▼──────────┐
│ task_complete │ Task complete (with task_id)
└─────┬──────────┘
│
┌─────▼─────┐
│ summary │ Summary generation (if a template is set)
└───────────┘
Related Documents
| Document | Description |
|---|---|
| Authentication | Detailed description of API Key and Ticket authentication |
| Voice Translation Reference | Complete API specification for all actions |
| Response Events Reference | Reference for all response event formats |
| History and Playback | How to load history after stopping |
| TTS Speech Synthesis | Complete guide to the TTS feature |
| Speaker Management | Renaming, reassigning, and merging speakers |
Version: V1.5.7 Last Updated: 2026-05-20