Guides

Voice Translation

Overview
Prerequisites
Starting Voice Translation
Sending Audio
Receiving Recognition and Translation Results
Operation Controls
Advanced Features
Conversation Mode
Stopping and Summary
Complete Flow Diagram
Related Documents

Overview

The VAS real-time voice translation service provides low-latency speech-to-text (STT) and real-time translation over WebSocket. The complete flow is:

The client captures audio from the microphone
The audio stream is sent to the VAS server
The server performs speech recognition and returns the transcript
Multi-language translation is performed in parallel and the results are returned
(Optional) TTS speech is synthesized to play back the translation results

Use Cases

Scenario	Recording Type (`type`)
Meeting notes, interview records	`transcribe`
Bilingual real-time interpretation, cross-language conversation	`conversation`
Voice memos, quick notes	`record`
Lectures, presentations, live streaming	`broadcast` (see the Broadcast Guide)

Starting Voice Translation

After the connection is established, send the start action to launch a voice translation session.

Basic Request

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting"
  }
}

Core Parameters

Parameter	Type	Required	Description
`transcription_languages`	string	Yes	Speech recognition languages, up to 2 (e.g., `["zh-TW"]`)
`translation_languages`	string	No	Target translation languages; an empty array or omitting it means no translation
`type`	string	Yes	Recording type: `transcribe`, `conversation`, `record`, `broadcast`
`audio_format`	string	No	Audio format: `pcm` (default) or `webm`
`summary_template`	string	Conditional	Summary template (required for the `transcribe` type, e.g., `meeting`, `interview`)
`realtime_translation`	boolean	No	Real-time translation mode (default `false`)
`recognition_mode`	string	No	`single` (single speaker, default) or `multi_speaker` (multi-speaker diarization); under `multi_speaker`, `transcription_languages` must contain exactly 1 language, otherwise a `diarization_multilang_conflict` error is returned and the session is refused
`name`	string	No	Initial default recording name (max 60 characters; the system may still override it; if not provided, a name such as `Transcription #1` is generated automatically)

Request with TTS

To enable speech synthesis of the translation results, add the TTS-related parameters:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting",
    "tts_enabled": true,
    "tts_language": "en-US",
    "tts_voice": "en-US-JennyNeural",
    "tts_mode": "sync"
  }
}

TTS Parameter	Description
`tts_enabled`	Whether to enable TTS (default `false`)
`tts_language`	TTS output language (must be in `translation_languages`)
`tts_voice`	TTS voice name (e.g., `en-US-JennyNeural`)
`tts_mode`	`sync` (automatic playback, default) or `async` (manual control)

Success Response

After a successful start, the server returns a session_started event:

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "transcribe",
    "recognition_mode": "single",
    "message": "Speech recognition started"
  }
}

Save the session_id and recording_id; they will be used in subsequent API operations.

Sending Audio

Once the session has started, continuously send audio data to the server.

Audio Format Requirements

PCM format (default, recommended):

Item	Specification
Sample rate	16000 Hz
Bit depth	16-bit
Channels	Mono
Byte order	Little-endian

WebM/Opus format: Any sample rate and number of channels; the server converts automatically.

Sending Format

Audio data must be Base64-encoded and sent with the audio action:

{
  "type": "voice-translation",
  "data": {
    "action": "audio",
    "payload": "Base64-encoded audio data..."
  }
}

Front-End Audio Capture Example

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

processor.onaudioprocess = (e) => {
  const float32 = e.inputBuffer.getChannelData(0);
  // Convert to 16-bit PCM
  const int16 = new Int16Array(float32.length);
  for (let i = 0; i < float32.length; i++) {
    int16[i] = Math.max(-32768, Math.min(32767, float32[i] * 32768));
  }
  // Base64-encode and send
  const base64 = btoa(String.fromCharCode(...new Uint8Array(int16.buffer)));
  ws.send(JSON.stringify({
    type: 'voice-translation',
    data: { action: 'audio', payload: base64 }
  }));
};

source.connect(processor);
processor.connect(audioContext.destination);

Receiving Recognition and Translation Results

The server pushes recognition and translation results via the result event.

Speech Recognition Result (Origin)

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "origin": {
      "sid": 1,
      "language": "zh-TW",
      "text": "你好，很高興認識你",
      "is_final": true,
      "speaker_id": "0",
      "detected_language": "zh-TW",
      "start_time": "00:05"
    }
  }
}

Field	Description
`sid`	Sentence number, incrementing from 1
`text`	The recognized text
`is_final`	`false` for intermediate results (which will be overwritten); `true` for final results
`speaker_id`	Speaker ID (meaningful in multi-speaker mode)
`start_time`	Sentence start time (format `mm:ss`)

Translation Result (Translations)

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "translations": {
      "en-US": {
        "sid": 1,
        "text": "Hello, nice to meet you",
        "is_final": true
      }
    }
  }
}

Important: origin and translations may arrive in the same result event, or they may be pushed separately. The front end should match them by sid.

TTS Audio Ready (TTS Ready)

If TTS is enabled, you receive a tts_ready event after the translation completes:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_ready",
    "sid": 1,
    "language": "en-US",
    "text": "Hello, nice to meet you",
    "audio": "Base64EncodedMP3...",
    "format": "mp3",
    "duration_ms": 2500,
    "boundaries": [...]
  }
}

The boundaries array contains Word Boundary information, which can be used to implement karaoke-style synchronized highlighting.

Operation Controls

Pause

Temporarily stop speech recognition processing:

{
  "type": "voice-translation",
  "data": { "action": "pause" }
}

Resume

Resume paused speech recognition:

{
  "type": "voice-translation",
  "data": { "action": "resume" }
}

Set Recording Name

There are two ways to set the recording name:

Method 1: Specify the name parameter at start (initial default name)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "type": "transcribe",
    "summary_template": "meeting",
    "name": "Product Planning Meeting"
  }
}

This name is an initial default; when the session ends, the system may still override it based on the transcript content.

Method 2: Use set_name during recording (fixed name)

{
  "type": "voice-translation",
  "data": {
    "action": "set_name",
    "name": "Product Planning Meeting"
  }
}

A name set via set_name will not be overridden by the system.

If no name is set, the system automatically uses a "type + sequence number" format (e.g., Transcription #1, Broadcast #3). After the session ends, the system attempts to automatically generate a more meaningful name based on the transcript content (but it will not override a name set via set_name).

Switch Translation Language

Switch the target language during recording; the system automatically retranslates all previously translated sentences:

{
  "type": "voice-translation",
  "data": {
    "action": "switch_language",
    "translation_languages": ["ja-JP"]
  }
}

The system returns a language_switch_start event, followed by multiple batch_retranslation events, and finally a language_switch_done event, in order.

Retranslate a Specific Sentence

After correcting a recognition error, you can retranslate a single sentence:

{
  "type": "voice-translation",
  "data": {
    "action": "retranslate",
    "sid": 1,
    "translation_languages": ["en-US"],
    "text": "Corrected source text"
  }
}

Advanced Features

Multi-Language Translation

Specify multiple target languages in translation_languages to translate into several languages at once:

{
  "transcription_languages": ["zh-TW"],
  "translation_languages": ["en-US", "ja-JP", "ko-KR"]
}

Translation results are returned together, keyed by language code.

Speaker Recognition (Multi Speaker)

Set recognition_mode to multi_speaker to enable speaker recognition:

{
  "recognition_mode": "multi_speaker"
}

Note: In multi_speaker mode, transcription_languages must contain exactly 1 language. If you provide multiple languages, you will receive a diarization_multilang_conflict error and the session will be refused.

Once enabled, the speaker_id in the recognition results automatically distinguishes different speakers (e.g., Guest-1, Guest-2). You can manage speakers with the following operations:

rename_speaker: Globally rename a speaker (e.g., change Guest-1 to Manager Wang)
reassign_speaker: Change the speaker identity of a single sentence
merge_speakers: Merge two speakers (assign all sentences from one to the other)

TTS Playback Control

In async mode, you can manually control TTS playback:

Play a specific sentence:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5,
    "length": 3
  }
}

Stop playback:

{
  "type": "voice-translation",
  "data": { "action": "tts_stop" }
}

Switch playback mode:

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}

Mode	Behavior
`sync`	Automatically plays the latest `is_final=true` translation; the next sentence plays only after the previous one finishes
`async`	Manually controls playback via `tts_play`

Text Processing Parameters (Config)

Before start or during recording, you can use the config action to set the terminology list, fuzzy-term correction, and the translation dictionary:

{
  "type": "voice-translation",
  "data": {
    "action": "config",
    "terminology": {
      "zh-TW": [
        { "term": "語者分離", "boost": 1.5 },
        { "term": "CVD製程", "boost": 1.5 }
      ]
    },
    "translation_dict": [
      {
        "source": "語者分離",
        "translations": { "en-US": "Speaker Diarization" }
      }
    ]
  }
}

Setting	Description
`terminology`	Terminology list -- improves recognition accuracy for specific terms (up to 500 per language)
`fuzzy_correction`	Fuzzy-term correction -- automatically corrects homophone errors (usually does not need to be set manually; the system generates it automatically from `terminology`)
`translation_dict`	Translation dictionary -- ensures consistent translation of proper nouns (we recommend no more than 50 entries)

Recommended practice: Set only terminology; the system will automatically generate correction rules for homophones, near-homophones, and Traditional/Simplified Chinese variants of each term.

Conversation Mode

Conversation mode lets two people who speak different languages hold a real-time interpreted conversation over a single WebSocket connection. The system automatically detects the language of each utterance, translates it into the other person's language, and returns the translation result as TTS audio. Language detection is fully automatic; no manual switching is required.

Start a Conversation

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "conversation",
    "transcription_languages": ["zh-TW", "en-US"],
    "audio_format": "pcm",
    "realtime_translation": true,
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
    }
  }
}

transcription_languages must contain exactly 2 languages
active_language is optional and specifies the initial preferred language (language detection is still automatic)
tts_config can be omitted; the system uses default voices automatically
tts_enabled defaults to true; set it to false to return text translations only

Automatic Language Detection

The system automatically detects the language of each utterance. The origin.language of each utterance directly reflects the detected language, and the translation target is automatically the other of the two languages.

Note: You do not need to call switch_language manually to switch languages; the system detects them automatically. switch_language can still be used, but it only updates the internal preference state.

Switching TTS Settings Mid-Conversation

During a conversation, you can use set_tts to toggle TTS on or off or to update voice settings:

{
  "type": "voice-translation",
  "data": {
    "action": "set_tts",
    "tts_enabled": true,
    "tts_config": {
      "en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
    }
  }
}

On success, you receive a tts_updated event containing the full updated settings.

Complete Conversation Flow

1. start (conversation, zh-TW + en-US)
2. session_started
3. Send audio (Person A speaks Chinese)
4. result (origin.language: "zh-TW", translations: en-US)  ← automatic detection
5. tts_ready (en-US audio → played to Person B)
6. Send audio (Person B speaks English, no switching needed!)
7. result (origin.language: "en-US", translations: zh-TW)  ← automatic detection
8. tts_ready (zh-TW audio → played to Person A)
9. stop
10. task_complete

Stopping and Summary

Stop Recording

Send the stop action to end the voice translation session:

{
  "type": "voice-translation",
  "data": { "action": "stop" }
}

Event Flow

After stopping, the system performs the following steps in order and pushes events:

status -- confirms that speech recognition has stopped
(Background processing) -- uploads the audio file and saves the transcript
task_complete -- task processing is complete, including the task_id

{
  "type": "voice-translation",
  "data": {
    "action": "task_complete",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "message": "Task processing complete"
  }
}

(If a summary template was set) -- the system automatically generates a summary

Save the task_id so you can later query the results via the Tasks API or load the history via the SSE API.

Complete Flow Diagram

                    Prerequisites
                       │
        ┌──────────────┼──────────────┐
        │              │              │
   Get API Key     Get Ticket    Open WebSocket
        │              │              │
        └──────────────┼──────────────┘
                       │
               ┌───────▼───────┐
               │ config (optional)│  Set terminology / correction rules
               └───────┬───────┘
                       │
               ┌───────▼───────┐
               │     start     │  Start voice translation
               └───────┬───────┘
                       │
               session_started
                       │
          ┌────────────▼────────────┐
          │                         │
    ┌─────▼─────┐            ┌─────▼─────┐
    │   audio   │────────────│   result  │
    │ (ongoing) │  Send audio │  Results  │
    └─────┬─────┘            └─────┬─────┘
          │                        │
          │    ┌───────────────────┤
          │    │                   │
          │  origin           translations
          │  (source)          (translation)
          │                        │
          │               ┌────────▼────────┐
          │               │ tts_ready (optional)│
          │               └─────────────────┘
          │
    ┌─────▼─────┐    ┌──────────┐
    │  pause /  │◄──►│ resume   │  Operation controls
    │  resume   │    └──────────┘
    └─────┬─────┘
          │
    ┌─────▼─────┐
    │   stop    │  Stop translation
    └─────┬─────┘
          │
    ┌─────▼──────────┐
    │  task_complete  │  Task complete (with task_id)
    └─────┬──────────┘
          │
    ┌─────▼─────┐
    │  summary  │  Summary generation (if a template is set)
    └───────────┘

Document	Description
Authentication	Detailed description of API Key and Ticket authentication
Voice Translation Reference	Complete API specification for all actions
Response Events Reference	Reference for all response event formats
History and Playback	How to load history after stopping
TTS Speech Synthesis	Complete guide to the TTS feature
Speaker Management	Renaming, reassigning, and merging speakers

Version: V1.5.7 Last Updated: 2026-05-20

Tts

Webhook