WebSocket API

Voice Translation

Overview

A complete list of all actions available under the voice-translation type. For connection and authentication, see Connection and Authentication; for response event formats, see Response Events.

start - Start Voice Translation
config - Configure Terminology / Correction Rules
audio - Send Audio
pause - Pause Translation
resume - Resume Translation
stop - Stop Translation
retranslate - Retranslate a Single Sentence
switch_language - Switch Language
set_name - Set Recording Name
rename_speaker - Globally Rename a Speaker
reassign_speaker - Change the Speaker of a Single Sentence
merge_speakers - Merge Speakers
tts_play - Play TTS
tts_stop - Stop TTS
tts_mode - Switch TTS Mode
set_tts - Two-Way Translation TTS Settings
start_speaking - Start Speaking (Manual Mode)
stop_speaking - Stop Speaking (Manual Mode)
switch_conversation_mode - Switch Conversation Mode
set_speaker_language - Set Speaker Language
broadcast_go_live - Switch to the Live Phase
broadcast_announcement - Send an Announcement
set_standby_message - Set the Standby Phase Message

start - Start Voice Translation

Description

Start a new voice translation session and begin processing audio according to the configured parameters.

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `start`
`transcription_languages`	string	Yes	Speech recognition languages (up to 2)
`translation_languages`	string	No	Translation target languages (empty = no translation)
`realtime_translation`	boolean	No	Real-time translation mode (default `false`)
`recognition_mode`	string	No	Recognition mode: `single` (single speaker, default), `multi_speaker` (multiple speakers); under `multi_speaker`, `transcription_languages` must contain exactly 1 language, otherwise a `diarization_multilang_conflict` error is returned and the session is refused
`type`	string	Yes	Recording type: `transcribe`, `conversation`, `record`, `broadcast`
`audio_format`	string	No	Audio format: `pcm` (default), `webm`
`summary_template`	string	Conditional	Summary template (required for `transcribe`, optional for `conversation`/`broadcast`)
`options`	object	No	Speech recognition options
`tts_enabled`	boolean	No	Whether to enable TTS speech synthesis (default `false`)
`tts_language`	string	No	TTS output language (must be in `translation_languages`)
`tts_voice`	string	No	TTS voice name (e.g. `en-US-JennyNeural`)
`tts_mode`	string	No	TTS playback mode: `sync` (synchronous, default), `async` (asynchronous)
`broadcast_token`	string	Conditional	Broadcast token (required for `broadcast` type, obtained from the REST API)
`active_language`	string	No	Initial active language in two-way translation mode (default `transcription_languages[0]`)
`tts_config`	object	No	Multi-language TTS settings (broadcast / two-way translation mode)
`broadcast_phase`	string	No	Initial broadcast phase: `standby`, `live` (default)
`standby_message`	string	No	Message viewers see during the standby phase (default: "Preparing, please wait...")
`name`	string	No	Initial default recording name (max 60 characters; the system may still override it; if not provided, one is generated automatically, e.g. `Transcription #1`)
`summary_language`	string	No	Summary output language (defaults to the recognition language when not specified; in broadcast mode it is read automatically from the channel settings)
`summary_mode`	string	No	Summary mode enum: `builtin` (apply the built-in template, default) / `custom` (the customer prompt fully replaces the default). When omitted, `builtin` is inferred automatically
`summary_prompt`	string	No	Required in custom mode; treated as supplementary instructions in builtin mode. ≤2000 characters
`summary_prompt_slug`	string	No	Required in custom mode; must not be provided in builtin mode. The customer's own identifier (≤64 characters, Unicode, no control characters; passed through and stored in the backend record for historical lookup)
`summary_plain_text`	boolean	No	Request plain-text summary output (default `false`; when enabled, the backend performs Markdown post-processing)
`speakers`	object	Conditional	Speaker language settings for two-way translation mode (required for `conversation` type, exactly 2 entries, see below)
`conversation_mode`	string	No	Two-way conversation mode: `auto` (automatic detection, default), `manual` (manual PTT)

Request Example (Basic)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "realtime_translation": false,
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting",
    "options": {
      "speaking_speed": "normal",
      "segmentation_mode": "auto",
      "profanity_handling": "mask"
    }
  }
}

Request Example (Initial Default Name)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting",
    "name": "Product Planning Meeting"
  }
}

Recording Name Rules

Scenario	Name	name_source	Overridden by system?
`start` with a `name` parameter	Initial default name	`default`	Yes
`start` without a `name`	Auto-generated (e.g. `Transcription #1`, `Broadcast #3`)	`default`	Yes
Set via `set_name`	Name explicitly set by the user	`user`	No
Auto-generated by the system after the session ends	Summary name generated from the transcript content	`llm`	—

Note: The name in start is an initial default name; the system may still override it when the session ends. If you need a fixed name, use set_name.

Default name formats (fixed English):

Recording Type	Default Name Format
`transcribe`	`Transcription #N`
`conversation`	`Conversation #N`
`record`	`Recording #N`
`broadcast`	`Broadcast #N`

N is the sequential number of recordings of the same type for that user. Name priority: user > llm > default. Once the user sets a name, the system will not override it when the session ends.

Request Example (with TTS)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "realtime_translation": true,
    "type": "transcribe",
    "tts_enabled": true,
    "tts_language": "en-US",
    "tts_voice": "en-US-JennyNeural",
    "tts_mode": "sync"
  }
}

Request Example (Two-Way Translation Mode - Automatic Detection)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "conversation",
    "transcription_languages": ["zh-TW", "en-US"],
    "active_language": "zh-TW",
    "audio_format": "pcm",
    "realtime_translation": true,
    "speakers": [
      { "id": 1, "language": "zh-TW" },
      { "id": 2, "language": "en-US" }
    ],
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
    }
  }
}

Request Example (Two-Way Translation Mode - Manual Mode)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "conversation",
    "transcription_languages": ["zh-TW", "en-US"],
    "conversation_mode": "manual",
    "audio_format": "pcm",
    "realtime_translation": true,
    "speakers": [
      { "id": 1, "language": "zh-TW" },
      { "id": 2, "language": "en-US" }
    ],
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
    }
  }
}

Special rules for two-way translation mode:

Item	Description
`transcription_languages`	Must contain exactly 2 languages, and they must differ
`translation_languages`	Not required (automatically derived as the non-active language)
`active_language`	Optional, defaults to `transcription_languages[0]`
`recognition_mode`	Forced to `single` (`speaker_diarization` is ignored)
`tts_enabled`	Defaults to `true`; set to `false` to return text translation only
`tts_config`	Optional; configures the TTS voice for each of the two languages; leave empty to use the default voices automatically
`summary_template`	Optional; when provided, a summary is generated automatically after stopping
`speakers`	Required in two-way translation mode; specifies each user's language (exactly 2 entries)
`conversation_mode`	Optional, `auto` (automatic detection, default) or `manual` (manual PTT)

speakers field description:

Field	Type	Required	Description
`id`	int	Yes	User number (1 or 2)
`language`	string	Yes	That user's language code (must be in `transcription_languages`)

conversation_mode description:

Mode	Description
`auto` (default)	The system automatically detects the spoken language and segments sentences automatically
`manual`	The user controls the speaking interval via `start_speaking` / `stop_speaking`; audio during that interval is merged into a single sentence

Successful Response

After a successful start, a session_started event is returned, containing the complete initial session information.

General recording (transcribe / conversation / record):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "task_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "transcribe",
    "recognition_mode": "single",
    "message": "Speech recognition started"
  }
}

Broadcast mode (broadcast):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "task_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "broadcast",
    "recognition_mode": "multi_speaker",
    "phase": "standby",
    "viewer_count": 0,
    "queue_count": 0,
    "peak_viewers": 0,
    "total_viewers": 0,
    "message": "Speech recognition started"
  }
}

For response field descriptions, see the session_started event.

Recording Type Descriptions

type	Description	Use Case
`transcribe`	Speech-to-text	Meeting minutes, interview records
`conversation`	Conversation log	Two-way communication, customer service dialogues
`record`	Plain recording	Voice memos, quick notes
`broadcast`	Broadcast / live stream	Lectures, speeches, live content

Broadcast Mode Description (type: "broadcast")

In broadcast mode, the language settings are obtained automatically from the broadcast channel settings and do not need to be sent in the WebSocket message.

Required parameters:

Parameter	Type	Description
`type`	string	Must be `"broadcast"`
`broadcast_token`	string	Broadcast token (obtained after creating a broadcast via the REST API)
`audio_format`	string	Audio format (`pcm` or `webm`)

Optional parameters (override broadcast channel settings):

Parameter	Type	Description
`tts_config`	object	Multi-language TTS settings (override the settings used at creation)
`summary_template`	string	Summary template slug (overrides the settings used at creation; if not provided, the broadcast channel default is used)

Automatically configured parameters (can be omitted):

transcription_languages: read automatically from the broadcast settings
translation_languages: read automatically from the broadcast settings
realtime_translation: enabled by default in broadcast mode
summary_template: read automatically from the broadcast settings (the value passed via WebSocket takes precedence)
summary_language: read automatically from the broadcast settings (the value passed via WebSocket takes precedence)

Broadcast phase description:

broadcast_phase	Description	Behavior
`live` (default)	Live phase	STT/translation results are broadcast to viewers and written to the transcript
`standby`	Standby phase	STT/translation results go only to the host; viewers see the standby_message

Purpose of the standby phase: Lets the host run STT/translation warm-up tests before going live, confirming the equipment works before switching to the live phase.

Broadcast mode request example:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "a3f9",
    "audio_format": "pcm"
  }
}

Broadcast mode request example (standby phase + override summary template):

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "a3f9",
    "audio_format": "pcm",
    "broadcast_phase": "standby",
    "standby_message": "The talk is about to begin, please wait...",
    "summary_template": "lecture"
  }
}

Summary template priority: the value passed in the WebSocket start > the default set when creating the broadcast channel. If neither is set, no summary is generated automatically.

Broadcast mode TTS settings (tts_config):

Use the tts_config parameter to specify which translation languages should produce TTS audio for viewers.

tts_config field	Type	Description
voice	string	TTS voice name
speaking_rate	number	Speaking rate (0.5–2.0, default 1.0)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "a3f9",
    "audio_format": "pcm",
    "tts_config": {
      "en-US": {
        "voice": "en-US-JennyNeural",
        "speaking_rate": 1.0
      },
      "ja-JP": {
        "voice": "ja-JP-NanamiNeural",
        "speaking_rate": 1.0
      }
    }
  }
}

Note:
The TTS language must be a valid language in translation_languages; invalid languages are ignored automatically
The host (WebSocket) does not receive TTS audio; only SSE viewers receive the tts_ready event
TTS is sent only during the live phase; it is not sent during the standby phase

TTS Playback Mode Description

Mode	Description	Behavior
`sync`	Synchronous mode (default)	Automatically plays the most recent `is_final=true` translated sentence; if the previous sentence is still playing, it enters the queue and waits
`async`	Asynchronous mode (manual control)	The user can select any translated sentence for TTS, controlled with the `tts_play` command

Error Code	HTTP Status	Description	Recommended Action
`missing_transcription_languages`	400	No language parameter provided	Make sure the request includes `transcription_languages`
`invalid_transcription_language`	400	Invalid language code	Make sure the language code format is correct (e.g. `zh-TW`)
`too_many_languages`	400	Number of languages exceeds the limit	You can specify at most 2 languages
`invalid_recording_type`	400	Invalid recording type	Use a valid type value
`invalid_summary_template`	400	Invalid summary template	Make sure the template identifier is correct
`stt_init_failed`	503	Service initialization failed	Retry later
`auth_budget_exceeded`	402	Monthly budget exceeded	Wait for the next month's budget reset or adjust the budget
`tts_init_failed`	503	TTS service initialization failed	Retry later
`tts_invalid_language`	400	TTS language is not in the translation languages	Make sure `tts_language` is in `translation_languages`
`broadcast_token_required`	400	Broadcast mode requires a token	A `broadcast` type must provide `broadcast_token`
`broadcast_token_invalid`	400	Invalid broadcast token	Make sure the token is correct and has not expired
`broadcast_not_ready`	503	Broadcast service not yet started	Retry later
`summary_invalid_mode`	400	`summary_mode` is not `builtin` / `custom`	Change to a valid mode
`summary_mode_field_mismatch`	400	The mode and field combination do not match (a required field is missing / a forbidden field was provided)	Adjust the fields according to the mode rules
`summary_prompt_too_long`	400	`summary_prompt` exceeds 2000 characters	Shorten the custom prompt
`summary_prompt_slug_too_long`	400	`summary_prompt_slug` exceeds 64 characters	Shorten the identifier
`summary_prompt_slug_invalid`	400	`summary_prompt_slug` contains control characters (`\n` / `\r` / `\t` / `\0`, etc.)	Remove the control characters

Error Code	HTTP Status	Description	Recommended Action
`config_empty`	400	No settings provided	Provide at least one setting item
`config_term_too_long`	400	Term exceeds 100 characters	Shorten the term
`config_too_many_entries`	400	Number of terms exceeds 500	Reduce the number of terms
`config_too_many_dict_entries`	400	Translation dictionary exceeds 50 entries	Reduce the number of dictionary entries

Item	Specification
Format	PCM (raw audio)
Sample rate	16000 Hz
Bit depth	16-bit
Channels	Mono
Byte order	Little-endian
Transfer encoding	Base64

Item	Specification
Format	WebM container + Opus codec
Sample rate	Any (the server converts automatically)
Channels	Mono or Stereo (the server converts automatically)
Transfer encoding	Base64