Guides

Voice Translation

Table of Contents

  1. Overview
  2. Prerequisites
  3. Starting Voice Translation
  4. Sending Audio
  5. Receiving Recognition and Translation Results
  6. Operation Controls
  7. Advanced Features
  8. Conversation Mode
  9. Stopping and Summary
  10. Complete Flow Diagram
  11. Related Documents

Overview

The VAS real-time voice translation service provides low-latency speech-to-text (STT) and real-time translation over WebSocket. The complete flow is:

  1. The client captures audio from the microphone
  2. The audio stream is sent to the VAS server
  3. The server performs speech recognition and returns the transcript
  4. Multi-language translation is performed in parallel and the results are returned
  5. (Optional) TTS speech is synthesized to play back the translation results

Use Cases

ScenarioRecording Type (type)
Meeting notes, interview recordstranscribe
Bilingual real-time interpretation, cross-language conversationconversation
Voice memos, quick notesrecord
Lectures, presentations, live streamingbroadcast (see the Broadcast Guide)

Prerequisites

1. Obtain an API Key

Make sure you have a valid API Key (format: vas_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx). For authentication details, see Authentication.

2. Obtain a Ticket

WebSocket connections are authenticated using a Ticket mechanism. First, exchange your API Key for a one-time Ticket:

curl -X POST "https://vas-poc.vurbo.ai/api/v1/auth/ticket" \
  -H "X-API-Key: vas_your_api_key_here"

Response:

{
  "ticket": "aBcDeFgHiJkLmNoPqRsTuVwXyZ012345",
  "expires_in": 60
}

Note: A Ticket is valid for 60 seconds and can be used only once.

3. Establish a WebSocket Connection

Place the Ticket in Sec-WebSocket-Protocol using the format ticket.{TICKET_VALUE}:

const ws = new WebSocket('wss://vas-poc.vurbo.ai/ws', [`ticket.${ticket}`]);

ws.onopen = () => {
  console.log('WebSocket connected');
};

4. Maintain a Heartbeat

We recommend sending a ping every 30 seconds to ensure the connection does not time out:

{
  "type": "health",
  "data": { "action": "ping" }
}

The server responds with pong.


Starting Voice Translation

After the connection is established, send the start action to launch a voice translation session.

Basic Request

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting"
  }
}

Core Parameters

ParameterTypeRequiredDescription
transcription_languagesstringYesSpeech recognition languages, up to 2 (e.g., ["zh-TW"])
translation_languagesstringNoTarget translation languages; an empty array or omitting it means no translation
typestringYesRecording type: transcribe, conversation, record, broadcast
audio_formatstringNoAudio format: pcm (default) or webm
summary_templatestringConditionalSummary template (required for the transcribe type, e.g., meeting, interview)
realtime_translationbooleanNoReal-time translation mode (default false)
recognition_modestringNosingle (single speaker, default) or multi_speaker (multi-speaker diarization); under multi_speaker, transcription_languages must contain exactly 1 language, otherwise a diarization_multilang_conflict error is returned and the session is refused
namestringNoInitial default recording name (max 60 characters; the system may still override it; if not provided, a name such as Transcription #1 is generated automatically)

Request with TTS

To enable speech synthesis of the translation results, add the TTS-related parameters:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting",
    "tts_enabled": true,
    "tts_language": "en-US",
    "tts_voice": "en-US-JennyNeural",
    "tts_mode": "sync"
  }
}
TTS ParameterDescription
tts_enabledWhether to enable TTS (default false)
tts_languageTTS output language (must be in translation_languages)
tts_voiceTTS voice name (e.g., en-US-JennyNeural)
tts_modesync (automatic playback, default) or async (manual control)

Success Response

After a successful start, the server returns a session_started event:

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "transcribe",
    "recognition_mode": "single",
    "message": "Speech recognition started"
  }
}

Save the session_id and recording_id; they will be used in subsequent API operations.


Sending Audio

Once the session has started, continuously send audio data to the server.

Audio Format Requirements

PCM format (default, recommended):

ItemSpecification
Sample rate16000 Hz
Bit depth16-bit
ChannelsMono
Byte orderLittle-endian

WebM/Opus format: Any sample rate and number of channels; the server converts automatically.

Sending Format

Audio data must be Base64-encoded and sent with the audio action:

{
  "type": "voice-translation",
  "data": {
    "action": "audio",
    "payload": "Base64-encoded audio data..."
  }
}

Front-End Audio Capture Example

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (e) => {
  const float32 = e.inputBuffer.getChannelData(0);
  // Convert to 16-bit PCM
  const int16 = new Int16Array(float32.length);
  for (let i = 0; i < float32.length; i++) {
    int16[i] = Math.max(-32768, Math.min(32767, float32[i] * 32768));
  }
  // Base64-encode and send
  const base64 = btoa(String.fromCharCode(...new Uint8Array(int16.buffer)));
  ws.send(JSON.stringify({
    type: 'voice-translation',
    data: { action: 'audio', payload: base64 }
  }));
};

source.connect(processor);
processor.connect(audioContext.destination);

Receiving Recognition and Translation Results

The server pushes recognition and translation results via the result event.

Speech Recognition Result (Origin)

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "origin": {
      "sid": 1,
      "language": "zh-TW",
      "text": "你好,很高興認識你",
      "is_final": true,
      "speaker_id": "0",
      "detected_language": "zh-TW",
      "start_time": "00:05"
    }
  }
}
FieldDescription
sidSentence number, incrementing from 1
textThe recognized text
is_finalfalse for intermediate results (which will be overwritten); true for final results
speaker_idSpeaker ID (meaningful in multi-speaker mode)
start_timeSentence start time (format mm:ss)

Translation Result (Translations)

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "translations": {
      "en-US": {
        "sid": 1,
        "text": "Hello, nice to meet you",
        "is_final": true
      }
    }
  }
}

Important: origin and translations may arrive in the same result event, or they may be pushed separately. The front end should match them by sid.

TTS Audio Ready (TTS Ready)

If TTS is enabled, you receive a tts_ready event after the translation completes:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_ready",
    "sid": 1,
    "language": "en-US",
    "text": "Hello, nice to meet you",
    "audio": "Base64EncodedMP3...",
    "format": "mp3",
    "duration_ms": 2500,
    "boundaries": [...]
  }
}

The boundaries array contains Word Boundary information, which can be used to implement karaoke-style synchronized highlighting.


Operation Controls

Pause

Temporarily stop speech recognition processing:

{
  "type": "voice-translation",
  "data": { "action": "pause" }
}

Resume

Resume paused speech recognition:

{
  "type": "voice-translation",
  "data": { "action": "resume" }
}

Set Recording Name

There are two ways to set the recording name:

Method 1: Specify the name parameter at start (initial default name)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "type": "transcribe",
    "summary_template": "meeting",
    "name": "Product Planning Meeting"
  }
}

This name is an initial default; when the session ends, the system may still override it based on the transcript content.

Method 2: Use set_name during recording (fixed name)

{
  "type": "voice-translation",
  "data": {
    "action": "set_name",
    "name": "Product Planning Meeting"
  }
}

A name set via set_name will not be overridden by the system.

If no name is set, the system automatically uses a "type + sequence number" format (e.g., Transcription #1, Broadcast #3). After the session ends, the system attempts to automatically generate a more meaningful name based on the transcript content (but it will not override a name set via set_name).

Switch Translation Language

Switch the target language during recording; the system automatically retranslates all previously translated sentences:

{
  "type": "voice-translation",
  "data": {
    "action": "switch_language",
    "translation_languages": ["ja-JP"]
  }
}

The system returns a language_switch_start event, followed by multiple batch_retranslation events, and finally a language_switch_done event, in order.

Retranslate a Specific Sentence

After correcting a recognition error, you can retranslate a single sentence:

{
  "type": "voice-translation",
  "data": {
    "action": "retranslate",
    "sid": 1,
    "translation_languages": ["en-US"],
    "text": "Corrected source text"
  }
}

Advanced Features

Multi-Language Translation

Specify multiple target languages in translation_languages to translate into several languages at once:

{
  "transcription_languages": ["zh-TW"],
  "translation_languages": ["en-US", "ja-JP", "ko-KR"]
}

Translation results are returned together, keyed by language code.

Speaker Recognition (Multi Speaker)

Set recognition_mode to multi_speaker to enable speaker recognition:

{
  "recognition_mode": "multi_speaker"
}

Note: In multi_speaker mode, transcription_languages must contain exactly 1 language. If you provide multiple languages, you will receive a diarization_multilang_conflict error and the session will be refused.

Once enabled, the speaker_id in the recognition results automatically distinguishes different speakers (e.g., Guest-1, Guest-2). You can manage speakers with the following operations:

  • rename_speaker: Globally rename a speaker (e.g., change Guest-1 to Manager Wang)
  • reassign_speaker: Change the speaker identity of a single sentence
  • merge_speakers: Merge two speakers (assign all sentences from one to the other)

TTS Playback Control

In async mode, you can manually control TTS playback:

Play a specific sentence:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5,
    "length": 3
  }
}

Stop playback:

{
  "type": "voice-translation",
  "data": { "action": "tts_stop" }
}

Switch playback mode:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}
ModeBehavior
syncAutomatically plays the latest is_final=true translation; the next sentence plays only after the previous one finishes
asyncManually controls playback via tts_play

Text Processing Parameters (Config)

Before start or during recording, you can use the config action to set the terminology list, fuzzy-term correction, and the translation dictionary:

{
  "type": "voice-translation",
  "data": {
    "action": "config",
    "terminology": {
      "zh-TW": [
        { "term": "語者分離", "boost": 1.5 },
        { "term": "CVD製程", "boost": 1.5 }
      ]
    },
    "translation_dict": [
      {
        "source": "語者分離",
        "translations": { "en-US": "Speaker Diarization" }
      }
    ]
  }
}
SettingDescription
terminologyTerminology list -- improves recognition accuracy for specific terms (up to 500 per language)
fuzzy_correctionFuzzy-term correction -- automatically corrects homophone errors (usually does not need to be set manually; the system generates it automatically from terminology)
translation_dictTranslation dictionary -- ensures consistent translation of proper nouns (we recommend no more than 50 entries)

Recommended practice: Set only terminology; the system will automatically generate correction rules for homophones, near-homophones, and Traditional/Simplified Chinese variants of each term.


Conversation Mode

Conversation mode lets two people who speak different languages hold a real-time interpreted conversation over a single WebSocket connection. The system automatically detects the language of each utterance, translates it into the other person's language, and returns the translation result as TTS audio. Language detection is fully automatic; no manual switching is required.

Start a Conversation

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "conversation",
    "transcription_languages": ["zh-TW", "en-US"],
    "audio_format": "pcm",
    "realtime_translation": true,
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
    }
  }
}
  • transcription_languages must contain exactly 2 languages
  • active_language is optional and specifies the initial preferred language (language detection is still automatic)
  • tts_config can be omitted; the system uses default voices automatically
  • tts_enabled defaults to true; set it to false to return text translations only

Automatic Language Detection

The system automatically detects the language of each utterance. The origin.language of each utterance directly reflects the detected language, and the translation target is automatically the other of the two languages.

Note: You do not need to call switch_language manually to switch languages; the system detects them automatically. switch_language can still be used, but it only updates the internal preference state.

Switching TTS Settings Mid-Conversation

During a conversation, you can use set_tts to toggle TTS on or off or to update voice settings:

{
  "type": "voice-translation",
  "data": {
    "action": "set_tts",
    "tts_enabled": true,
    "tts_config": {
      "en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
    }
  }
}

On success, you receive a tts_updated event containing the full updated settings.

Complete Conversation Flow

1. start (conversation, zh-TW + en-US)
2. session_started
3. Send audio (Person A speaks Chinese)
4. result (origin.language: "zh-TW", translations: en-US)  ← automatic detection
5. tts_ready (en-US audio → played to Person B)
6. Send audio (Person B speaks English, no switching needed!)
7. result (origin.language: "en-US", translations: zh-TW)  ← automatic detection
8. tts_ready (zh-TW audio → played to Person A)
9. stop
10. task_complete

Stopping and Summary

Stop Recording

Send the stop action to end the voice translation session:

{
  "type": "voice-translation",
  "data": { "action": "stop" }
}

Event Flow

After stopping, the system performs the following steps in order and pushes events:

  1. status -- confirms that speech recognition has stopped
  2. (Background processing) -- uploads the audio file and saves the transcript
  3. task_complete -- task processing is complete, including the task_id
{
  "type": "voice-translation",
  "data": {
    "action": "task_complete",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "message": "Task processing complete"
  }
}
  1. (If a summary template was set) -- the system automatically generates a summary

Save the task_id so you can later query the results via the Tasks API or load the history via the SSE API.


Complete Flow Diagram

                    Prerequisites
                       │
        ┌──────────────┼──────────────┐
        │              │              │
   Get API Key     Get Ticket    Open WebSocket
        │              │              │
        └──────────────┼──────────────┘
                       │
               ┌───────▼───────┐
               │ config (optional)│  Set terminology / correction rules
               └───────┬───────┘
                       │
               ┌───────▼───────┐
               │     start     │  Start voice translation
               └───────┬───────┘
                       │
               session_started
                       │
          ┌────────────▼────────────┐
          │                         │
    ┌─────▼─────┐            ┌─────▼─────┐
    │   audio   │────────────│   result  │
    │ (ongoing) │  Send audio │  Results  │
    └─────┬─────┘            └─────┬─────┘
          │                        │
          │    ┌───────────────────┤
          │    │                   │
          │  origin           translations
          │  (source)          (translation)
          │                        │
          │               ┌────────▼────────┐
          │               │ tts_ready (optional)│
          │               └─────────────────┘
          │
    ┌─────▼─────┐    ┌──────────┐
    │  pause /  │◄──►│ resume   │  Operation controls
    │  resume   │    └──────────┘
    └─────┬─────┘
          │
    ┌─────▼─────┐
    │   stop    │  Stop translation
    └─────┬─────┘
          │
    ┌─────▼──────────┐
    │  task_complete  │  Task complete (with task_id)
    └─────┬──────────┘
          │
    ┌─────▼─────┐
    │  summary  │  Summary generation (if a template is set)
    └───────────┘

DocumentDescription
AuthenticationDetailed description of API Key and Ticket authentication
Voice Translation ReferenceComplete API specification for all actions
Response Events ReferenceReference for all response event formats
History and PlaybackHow to load history after stopping
TTS Speech SynthesisComplete guide to the TTS feature
Speaker ManagementRenaming, reassigning, and merging speakers

Version: V1.5.7 Last Updated: 2026-05-20

Copyright © 2026