Websocket Api
Note: This is a consolidated document. For detailed specifications, refer to the individual documents under reference/websocket/.
Note: The URL used in this document (
vas-poc.vurbo.ai) is the planned deployment address. A separate notice will be issued after the official launch.
Table of Contents
- Connection Info
- Authentication
- Message Format
- Health - Heartbeat Service
- Voice Translation - start
- Voice Translation - config
- Voice Translation - audio
- Voice Translation - pause
- Voice Translation - resume
- Voice Translation - stop
- Voice Translation - retranslate
- Voice Translation - switch_language
- Voice Translation - set_name
- Voice Translation - rename_speaker
- Voice Translation - reassign_speaker
- Voice Translation - merge_speakers
- Voice Translation - tts_play
- Voice Translation - tts_stop
- Voice Translation - tts_mode
- Voice Translation - set_tts
- Voice Translation - start_speaking
- Voice Translation - stop_speaking
- Voice Translation - switch_conversation_mode
- Voice Translation - set_speaker_language
- Voice Translation - broadcast_go_live
- Voice Translation - broadcast_announcement
- Voice Translation - set_standby_message
- Response Events
Connection Info
| Item | Value |
|---|---|
| Endpoint | wss://vas-poc.vurbo.ai/ws |
| Protocol | WebSocket |
| Data Format | JSON |
| Auth Method | Ticket (see below) |
Authentication
The VAS WebSocket uses a Ticket mechanism for authentication, passing a one-time Ticket via Sec-WebSocket-Protocol. For details, refer to Authentication.
Step 1: Obtain a Ticket
Exchange your API Key for a one-time Ticket via the REST API:
POST /api/v1/auth/ticket
X-API-Key: vas_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Response:
{
"ticket": "aBcDeFgHiJkLmNoPqRsTuVwXyZ012345",
"expires_in": 60
}
| Field | Type | Description |
|---|---|---|
ticket | string | One-time Ticket (32 chars) |
expires_in | int | Validity period (seconds) |
Step 2: Connect to the WebSocket using the Ticket
Place the Ticket into Sec-WebSocket-Protocol in the format ticket.{TICKET_VALUE}:
// Native browser support
const ws = new WebSocket('wss://vas-poc.vurbo.ai/ws', [`ticket.${ticket}`]);
ws.onopen = () => {
console.log('Connected! Protocol:', ws.protocol);
// Start using the WebSocket...
};
ws.onerror = (error) => {
console.error('Connection failed:', error);
};
Node.js example:
const WebSocket = require('ws');
const ws = new WebSocket('wss://vas-poc.vurbo.ai/ws', [`ticket.${ticket}`]);
Ticket Characteristics
| Characteristic | Description |
|---|---|
| Validity period | 60 seconds |
| Usage count | Can be used only once (deleted immediately after) |
| Security | The API Key is never exposed in the WebSocket connection |
| Replay protection | Uses an atomic operation to guarantee single use |
Ticket Error Codes
| Error Code | HTTP Status | Description |
|---|---|---|
ticket_invalid | 401 | Ticket invalid or expired |
ticket_expired | 401 | Ticket expired |
ticket_already_used | 401 | Ticket already used |
ticket_validation_failed | 500 | Ticket validation failed |
For the full API specification, refer to Auth Ticket API.
Message Format
All messages use a unified nested structure:
{
"type": "service type",
"data": { ... }
}
Service Types
| type | Description |
|---|---|
health | Heartbeat mechanism |
voice-translation | Voice translation service |
error | Error message |
Error Message Format
When an error occurs, the server returns a message with type: "error":
{
"type": "error",
"data": {
"error_code": "auth_invalid_api_key",
"severity": "fatal",
"message": "Invalid API key",
"context": "auth",
"request_id": "req_abc123xyz789",
"timestamp": "2026-01-15T10:30:45.123Z"
}
}
Sentence-level errors (such as a translation failure for one language of a sentence) additionally carry sid and details:
{
"type": "error",
"data": {
"error_code": "llm_content_filtered",
"severity": "warning",
"message": "Content filtered",
"context": "translation",
"sid": 5,
"request_id": "req_abc123xyz789",
"timestamp": "2026-01-15T10:30:45.123Z",
"details": {
"provider": "azure_openai",
"translation_language": "ja"
}
}
}
Session-level translation service errors (escalated after consecutive failures reach a threshold) do not carry sid. The frontend should display a global notice but does not need to disconnect:
{
"type": "error",
"data": {
"error_code": "translation_service_unavailable",
"severity": "error",
"message": "Translation service unavailable",
"context": "translation",
"request_id": "req_abc123xyz789",
"timestamp": "2026-01-15T10:30:45.123Z",
"details": {
"provider": "azure_openai",
"last_error_code": "llm_provider_error",
"fail_count": 5
}
}
}
For the full trigger rules (consecutive failure threshold, error code classification), refer to the
translation_service_unavailablesection in Error Code Reference.
Single-Message Error (per-message panic recovered)
When the server encounters an internal error (panic) while handling a single WebSocket message (such as set_name, switch_language, tts_play, etc.), it returns internal_error. This error indicates only that the specific message failed to process; the connection is not terminated. The frontend should keep the connection open and may retry the operation:
{
"type": "error",
"data": {
"error_code": "internal_error",
"severity": "error",
"message": "Internal server error",
"context": "general",
"request_id": "req_abc123xyz789",
"timestamp": "2026-05-08T10:30:45.123Z",
"details": {
"message_type": "voice-translation",
"action": "set_name"
}
}
}
details Fields
| Field | Type | Description |
|---|---|---|
message_type | string | Service type: voice-translation / health |
action | string | (Optional) The specific operation that failed, such as set_name, switch_language, tts_play, tts_mode, retranslate, config, speaker.rename, etc. This field is absent when the message payload has no action field (such as a plain init message). |
What the Frontend Should Do
- Keep the WebSocket connection open: Do not call
ws.close(), navigate away, or return to a history page because of this error. The recording is still in progress. - Decide on follow-up handling based on
details.action:Scenario Recommended Action Idempotent operations such as set_name/switch_language/tts_mode/configSimply resend the same message. These operations use a "last write wins" approach, so retrying has no side effects. tts_play/tts_stop/retranslateUsually safe to retry directly. If the user is waiting for TTS playback, consider showing a transient toast indicating the retry is in progress. speaker.rename/speaker.mergeBefore retrying, use the REST API (speakers) to confirm the current DB state and avoid duplicate operations (for example, the rename already succeeded and only the response frame failed). details.actionis absentThe server panicked after parsing the message payload and cannot infer the specific operation. The frontend can infer it from "the most recent message the user sent," or display a generic error message such as "Operation failed, please retry." - User experience: Show a transient toast / inline error. Do not interrupt the user flow with a modal or a redirect.
- Telemetry / reporting: Report
request_id+detailsto your frontend error tracking (Sentry, Datadog, etc.) to make it easier to correlate with backend logs during troubleshooting.
What Will Not Happen (Guarantees)
- The recording will not be interrupted:
segment_uploaded,result,origin, and other messages keep arriving. - The connection will not be actively closed by the server.
- The session state will not be reset (
session_idstays the same). - State already written to the DB will not be rolled back (for example, if
set_namewas written to the DB successfully and only the response frame failed, the name still takes effect).
Client Handling Example
ws.onmessage = (event) => {
const msg = JSON.parse(event.data);
if (msg.type !== 'error') {
handleNormalMessage(msg);
return;
}
const { error_code, severity, request_id, details } = msg.data;
// Single-message panic: keep the connection, decide whether to retry based on action
if (error_code === 'internal_error') {
console.warn('[ws] message handler panic recovered', {
request_id,
message_type: details?.message_type,
action: details?.action,
});
showTransientToast(`Failed to process "${details?.action ?? 'operation'}", please retry`);
// Note: do not call ws.close() and do not navigate away from the current page
return;
}
// Handle other errors with your existing logic (only fatal errors require disconnecting)
handleErrorBySeverity(severity, msg.data);
};
| Field | Type | Description |
|---|---|---|
error_code | string | Error code (for programmatic handling) |
severity | string | Severity: fatal / error / warning |
message | string | Human-readable error message |
context | string | Error source category |
sid | int | Optional. The sentence number for sentence-level errors (such as a translation failure); absent for non-sentence-level errors |
request_id | string | Request tracking ID |
timestamp | string | Time the error occurred (ISO 8601) |
details | object | Optional. Error context; common keys: provider, translation_language, source_lang, etc. |
For the full list of error codes, refer to Error Code Reference.
Health (Heartbeat Service)
Description
Used to confirm that the WebSocket connection is healthy. We recommend sending a ping every 30 seconds; if no pong is received, treat the connection as dropped and reconnect.
Use Cases
- Maintaining a long-lived connection
- Detecting connection status
- Preventing connection timeouts
Request - Ping
{
"type": "health",
"data": {
"action": "ping"
}
}
Response - Pong
{
"type": "health",
"data": {
"action": "pong"
}
}
Voice Translation - start (Start Voice Translation)
Description
Starts a new voice translation session and begins processing audio according to the configured parameters.
Use Cases
- Starting a meeting record
- Starting real-time translation
- Starting a voice memo
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value start |
transcription_languages | string | Yes | Speech recognition languages (up to 2) |
translation_languages | string | No | Translation target languages (empty = no translation) |
realtime_translation | boolean | No | Real-time translation mode (default false) |
recognition_mode | string | No | Recognition mode: single (single speaker, default), multi_speaker (multiple speakers). Under multi_speaker, transcription_languages must contain exactly 1 language; otherwise the server returns a diarization_multilang_conflict error and refuses to start. |
type | string | Yes | Recording type: transcribe, conversation, record, broadcast |
audio_format | string | No | Audio format: pcm (default), webm |
summary_template | string | Conditional | Summary template. Required for transcribe when summary_mode=builtin; forbidden when summary_mode=custom; optional for conversation/broadcast. |
options | object | No | Speech recognition options |
tts_enabled | boolean | No | Whether to enable TTS speech synthesis (default false) |
tts_language | string | No | TTS output language (must be in translation_languages) |
tts_voice | string | No | TTS voice name (such as en-US-JennyNeural) |
tts_mode | string | No | TTS playback mode: sync (synchronous, default), async (asynchronous) |
broadcast_token | string | Conditional | Broadcast token (required for the broadcast type, obtained from the REST API) |
active_language | string | No | Initial active language for two-way mode (default transcription_languages[0]) |
speakers | array | Conditional | User-to-language mapping for two-way mode (required in two-way mode, exactly 2 users) |
conversation_mode | string | No | Two-way conversation mode: auto (auto-detect, default), manual (push-to-talk) |
speaker_diarization | boolean | No | Speaker diarization (forcibly ignored in two-way mode) |
tts_config | object | No | Multi-language TTS settings (applies to both broadcast mode and two-way mode) |
broadcast_phase | string | No | Initial broadcast phase: standby, live (default) |
standby_message | string | No | The message viewers see during the standby phase (default: "Getting ready, please wait...") |
name | string | No | Initial default recording name (max 60 chars; the system may still override it; if not provided, auto-generated such as Transcription #1) |
summary_language | string | No | Summary output language (defaults to the recognition language when unspecified; in broadcast mode, read automatically from the channel settings) |
summary_mode | string | No | Summary mode enum: builtin (default) / custom. Inferred as builtin when omitted. |
summary_prompt | string | No | Required in custom mode; supplemental instructions in builtin mode. <= 2000 characters. |
summary_prompt_slug | string | No | Required in custom mode; forbidden in builtin mode. Your own identifier (<= 64 characters, Unicode, no control characters; passed through and stored in the backend record for historical lookup). |
summary_plain_text | boolean | No | Request plain-text summary output (default false; when enabled, the backend performs Markdown post-processing). |
Recording Type Descriptions
| type | Description | Use Cases |
|---|---|---|
transcribe | Speech-to-text | Meeting minutes, interview notes |
conversation | Conversation record | Two-way communication, customer service conversations |
record | Plain recording | Voice memos, quick notes |
broadcast | Broadcast/live | Lectures, talks, live content |
Request Example (Basic)
{
"type": "voice-translation",
"data": {
"action": "start",
"transcription_languages": ["zh-TW"],
"translation_languages": ["en-US"],
"realtime_translation": false,
"type": "transcribe",
"audio_format": "pcm",
"summary_template": "meeting",
"options": {
"speaking_speed": "normal",
"segmentation_mode": "auto",
"profanity_handling": "mask"
}
}
}
Request Example (Initial Default Name)
{
"type": "voice-translation",
"data": {
"action": "start",
"transcription_languages": ["zh-TW"],
"translation_languages": ["en-US"],
"type": "transcribe",
"audio_format": "pcm",
"summary_template": "meeting",
"name": "Product Planning Meeting"
}
}
Recording Name Rules
| Scenario | Name | name_source | System Override? |
|---|---|---|---|
start with a name parameter | Initial default name | default | Yes |
start without a name | Auto-generated (such as Transcription #1, Broadcast #3) | default | Yes |
Set via set_name | The name explicitly set by the user | user | No |
| Auto-generated by the system after the session ends | A summary name generated from the transcript content | llm | — |
Note: The
nameinstartis the initial default name; the system may still override it when the session ends. If you need a fixed name, useset_name.
Default name format (fixed English):
| Recording Type | Default Name Format |
|---|---|
transcribe | Transcription #N |
conversation | Conversation #N |
record | Recording #N |
broadcast | Broadcast #N |
Nis the sequential number for that user's recordings of the same type. Name priority:user>llm>default. Once the user sets a name, the system will not override it when the session ends.
Request Example (With TTS)
{
"type": "voice-translation",
"data": {
"action": "start",
"transcription_languages": ["zh-TW"],
"translation_languages": ["en-US"],
"realtime_translation": true,
"type": "transcribe",
"tts_enabled": true,
"tts_language": "en-US",
"tts_voice": "en-US-JennyNeural",
"tts_mode": "sync"
}
}
Request Example (Two-Way Mode - Auto-Detect)
{
"type": "voice-translation",
"data": {
"action": "start",
"type": "conversation",
"transcription_languages": ["zh-TW", "en-US"],
"active_language": "zh-TW",
"audio_format": "pcm",
"realtime_translation": true,
"speakers": [
{ "id": 1, "language": "zh-TW" },
{ "id": 2, "language": "en-US" }
],
"tts_config": {
"zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
"en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
}
}
}
Request Example (Two-Way Mode - Manual Mode)
{
"type": "voice-translation",
"data": {
"action": "start",
"type": "conversation",
"transcription_languages": ["zh-TW", "en-US"],
"conversation_mode": "manual",
"audio_format": "pcm",
"realtime_translation": true,
"speakers": [
{ "id": 1, "language": "zh-TW" },
{ "id": 2, "language": "en-US" }
],
"tts_config": {
"zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
"en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
}
}
}
Request Example (Custom Summary Prompt - custom mode)
In
mode=custom, yoursummary_promptcontent completely replaces the built-in template rules, and the backend already adds prompt injection protection. Thesummary_prompt_slugis metadata for your own identification (stored in the backend record) and does not enter the prompt content.If you want to keep the built-in template and add your own supplemental instructions afterward, use
summary_mode=builtin+summary_template=<slug>+summary_prompt=<supplemental instructions>instead (in builtin mode,summary_promptis treated as supplemental and appended after the built-in template).
{
"type": "voice-translation",
"data": {
"action": "start",
"transcription_languages": ["zh-TW"],
"translation_languages": ["en-US"],
"type": "transcribe",
"audio_format": "pcm",
"summary_language": "zh-TW",
"summary_mode": "custom",
"summary_prompt": "You are a meeting-minutes assistant. List every amount and committed date discussed in bullet points, and note the responsible person for each.",
"summary_prompt_slug": "client_x_finance_v3",
"summary_plain_text": false
}
}
Important — How to Retrieve the Summary Result: In WebSocket mode, summaries are non-streaming by design; final_content is not pushed back via a WebSocket event (the summary_done event only signals completion and does not contain the content). The client must retrieve it afterward over HTTP:
- After receiving the
summary_doneevent, callGET /api/v1/sse/history/transcribe/{taskId}to retrieve the summary (theinit_summaryevent carries a top-levelsummaryplain string +summary_mode/summary_template/summary_plain_text/summary_prompt_snapshot+ the two content-filter fallback audit fieldssummary_fallback_level/summary_dropped_segmentsadded in v1.5.5). - Or query the
summary_mode/summary_template/summary_prompt_slugcolumns of therecordingstable via the REST API.
v1.5.5 Content-Filter Automatic Downgrade: If your prompt or transcript content triggers the LLM service's content filter, the system automatically downgrades (standard mode → neutral mode → segment-omission mode). The
summary_fallback_levelfield of thesummary_doneevent (value2or3; omitted when standard mode succeeds directly) tells the client which path was actually taken, so the frontend can display hints such as "neutral mode in use" / "N segments omitted." See reference/websocket/events.md – summary_done and the V1.5.5 changelog.
Two-Way Mode Special Rules:
| Item | Description |
|---|---|
transcription_languages | Must contain exactly 2 languages, and they cannot be the same. |
translation_languages | Not required (automatically derived as the non-active language). |
active_language | Optional, defaults to transcription_languages[0]. |
recognition_mode | Forced to single (ignores speaker_diarization). |
tts_enabled | Defaults to true; set to false to return text translations only. |
tts_config | Optional; sets the TTS voice for each of the two languages; leave empty to use the default voices automatically. |
summary_template | Optional; when provided, a summary is automatically generated after stopping. |
speakers | Required in two-way mode; specifies each user's language (exactly 2 users). |
conversation_mode | Optional; auto (auto-detect, default) or manual (push-to-talk). |
speakers Field Descriptions:
| Field | Type | Required | Description |
|---|---|---|---|
id | int | Yes | User number (1 or 2) |
language | string | Yes | The user's language code (must be in transcription_languages) |
conversation_mode Descriptions:
| Mode | Description |
|---|---|
auto (default) | The system automatically detects the spoken language and segments sentences automatically. |
manual | The user controls speaking periods via start_speaking / stop_speaking, during which the audio is merged into a single sentence. |
Broadcast Mode Description (type: "broadcast")
In broadcast mode, the language settings are automatically obtained from the broadcast channel settings and do not need to be sent in the WebSocket message.
Required parameters:
| Parameter | Type | Description |
|---|---|---|
type | string | Must be "broadcast" |
broadcast_token | string | Broadcast token (obtained after creating the broadcast via the REST API) |
audio_format | string | Audio format (pcm or webm) |
Optional parameters (override the broadcast channel settings):
| Parameter | Type | Description |
|---|---|---|
tts_config | object | Multi-language TTS settings (overrides the settings from creation time) |
summary_template | string | Summary template slug (overrides the settings from creation time; if not provided, the broadcast channel default is used) |
Auto-configured parameters (can be omitted):
transcription_languages: read automatically from the broadcast settingstranslation_languages: read automatically from the broadcast settingsrealtime_translation: enabled by default in broadcast modesummary_template: read automatically from the broadcast settings (the value passed via WebSocket takes precedence)summary_language: read automatically from the broadcast settings (the value passed via WebSocket takes precedence)
Broadcast Phase Descriptions:
| broadcast_phase | Description | Behavior |
|---|---|---|
live (default) | Live phase | STT/translation results are broadcast to viewers and written to the transcript. |
standby | Standby phase | STT/translation results go only to the host; viewers see the standby_message. |
Standby phase purpose: Lets the host warm up STT/translation before going live, confirm that equipment is working, and then switch to the live phase.
Broadcast Mode Request Example:
{
"type": "voice-translation",
"data": {
"action": "start",
"type": "broadcast",
"broadcast_token": "a3f9",
"audio_format": "pcm"
}
}
Broadcast Mode Request Example (Standby Phase + Override Summary Template):
{
"type": "voice-translation",
"data": {
"action": "start",
"type": "broadcast",
"broadcast_token": "a3f9",
"audio_format": "pcm",
"broadcast_phase": "standby",
"standby_message": "The talk is about to begin, please wait...",
"summary_template": "lecture"
}
}
Summary template priority: The value passed in the WebSocket
start> the default set when the broadcast channel was created. If neither is set, no summary is automatically generated.
Broadcast Mode TTS Settings (tts_config):
Use the tts_config parameter to specify which translation languages should produce TTS audio for viewers.
| tts_config Field | Type | Description |
|---|---|---|
| voice | string | TTS voice name |
| speaking_rate | number | Speaking rate (0.5–2.0, default 1.0) |
{
"type": "voice-translation",
"data": {
"action": "start",
"type": "broadcast",
"broadcast_token": "a3f9",
"audio_format": "pcm",
"tts_config": {
"en-US": {
"voice": "en-US-JennyNeural",
"speaking_rate": 1.0
},
"ja-JP": {
"voice": "ja-JP-NanamiNeural",
"speaking_rate": 1.0
}
}
}
}
Note:
- TTS languages must be valid languages in
translation_languages; invalid languages are automatically ignored.- The host (WebSocket) does not receive TTS audio; only SSE viewers receive the
tts_readyevent.- TTS is sent only during the
livephase; nothing is sent during thestandbyphase.
TTS Playback Mode Descriptions
| Mode | Description | Behavior |
|---|---|---|
sync | Synchronous mode (default) | Automatically plays the latest is_final=true translated sentence; if the previous sentence is still playing, it enters the queue and waits. |
async | Asynchronous mode (manual control) | The user can choose any translated sentence for TTS, controlled with the tts_play command. |
Success Response
After a successful start, a session_started event is returned containing complete session initialization info.
General recordings (transcribe / conversation / record):
{
"type": "voice-translation",
"data": {
"action": "session_started",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"recording_type": "transcribe",
"recognition_mode": "single",
"message": "Speech recognition started"
}
}
Broadcast mode (broadcast):
{
"type": "voice-translation",
"data": {
"action": "session_started",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"recording_type": "broadcast",
"recognition_mode": "multi_speaker",
"phase": "standby",
"viewer_count": 0,
"queue_count": 0,
"peak_viewers": 0,
"total_viewers": 0,
"message": "Speech recognition started"
}
}
| Field | Type | Description |
|---|---|---|
session_id | string | Session ID |
recording_id | string | Recording ID (can be used for subsequent API queries) |
recording_type | string | Recording type: transcribe, conversation, record, broadcast |
recognition_mode | string | Recognition mode: single, multi_speaker |
phase | string | Broadcast phase: standby or live (broadcast mode only) |
viewer_count | int | Current number of online viewers (broadcast mode only) |
queue_count | int | Number of viewers waiting in the queue (broadcast mode only) |
peak_viewers | int | Peak number of viewers for this broadcast (broadcast mode only) |
total_viewers | int | Total cumulative number of viewers who have connected (broadcast mode only) |
message | string | Status description message |
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
missing_transcription_languages | 400 | No language parameter provided | Make sure the request includes transcription_languages |
invalid_transcription_language | 400 | Invalid language code | Confirm the language code format is correct (such as zh-TW) |
too_many_languages | 400 | Number of languages exceeds the limit | At most 2 languages can be specified |
invalid_recording_type | 400 | Invalid recording type | Use a valid type value |
invalid_summary_template | 400 | Invalid summary template | Confirm the template identifier is correct |
stt_init_failed | 503 | Service initialization failed | Retry later |
auth_budget_exceeded | 402 | Monthly budget exceeded | Wait for the next month's budget reset or adjust the budget |
tts_init_failed | 503 | TTS service initialization failed | Retry later |
tts_invalid_language | 400 | TTS language not in the translation languages | Confirm tts_language is in translation_languages |
broadcast_token_required | 400 | Broadcast mode requires a token | The broadcast type must provide a broadcast_token |
broadcast_token_invalid | 400 | Invalid broadcast token | Confirm the token is correct and not expired |
broadcast_not_ready | 503 | Broadcast service not yet started | Retry later |
summary_invalid_mode | 400 | summary_mode is not builtin / custom | Use a valid mode |
summary_mode_field_mismatch | 400 | The mode and field combination does not match (a required field is missing / a forbidden field was included) | Adjust fields per the mode rules |
summary_prompt_too_long | 400 | summary_prompt exceeds 2000 characters | Shorten the custom prompt |
summary_prompt_slug_too_long | 400 | summary_prompt_slug exceeds 64 characters | Shorten the identifier |
summary_prompt_slug_invalid | 400 | summary_prompt_slug contains control characters (\n / \r / \t / \0, etc.) | Remove the control characters |
Voice Translation - config (Set Terminology / Correction Rules)
Description
Before or during recording, pass in terminology, fuzzy-word correction rules, and translation dictionary settings. These settings improve STT accuracy, fix homophone errors, and ensure translation consistency.
Auto-generated correction rules: When terminology is passed in, the system automatically generates fuzzy-word correction rules for each term (homophones, near-homophones, Traditional/Simplified variants). The frontend does not need to define fuzzy_correction manually, greatly simplifying the setup process.
Use Cases
- Pass in professional terminology (Phrase List) before recording starts
- Set fuzzy-word correction rules (homophone correction) - optional, the system generates them automatically
- Set a translation dictionary (ensure consistent terminology translation)
Timing
| Setting Type | Recommended Timing | Update During Recording |
|---|---|---|
| Terminology | Before or during start | Supported (takes effect on the next turn) |
| Fuzzy-word correction | Before or during start | Supported |
| Translation dictionary | Before or during start | Supported |
Note: When you update terminology during recording, the new terms automatically take effect at the next recognition turn boundary, with no need to reconnect. The response includes a
terminology_effective: "next_turn"field as a hint.
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value config |
terminology | object | No | Terminology settings |
fuzzy_correction | object | No | Fuzzy-word correction rules |
translation_dict | object | No | Translation dictionary |
Note: At least one setting item must be provided.
Terminology Format (terminology)
Keyed by language code, with an array of terms as the value:
{
"zh-TW": [
{ "term": "語者分離", "boost": 1.5 },
{ "term": "WebSocket", "boost": 2.0 }
],
"en-US": [
{ "term": "diarization", "boost": 1.5 }
]
}
| Field | Type | Required | Description |
|---|---|---|---|
term | string | Yes | The term (max 100 characters) |
boost | number | No | Weight (default 1.0, range 0.5–5.0) |
Limit: Up to 500 terms per language.
Fuzzy-Word Correction Format (fuzzy_correction)
Note: This field usually does not need to be set manually. The system automatically generates correction rules based on
terminology. Use it only when you need custom special rules.
Keyed by language code, with an array of correction rules as the value:
{
"zh-TW": [
{ "correct": "語者分離", "incorrect": ["語這分離", "語者分力"] }
]
}
| Field | Type | Required | Description |
|---|---|---|---|
correct | string | Yes | The correct word |
incorrect | string | Yes | List of incorrect variants |
Auto-Generated Correction Rule Description
When terminology is passed in, the system automatically generates fuzzy-word correction rules for each term:
| Generation Type | Description | Example |
|---|---|---|
| Homophone | Alternative characters with the same pinyin | 語者 → 語這, 語折 |
| Near-homophone | Alternative characters with similar tones | 媽 → 麻, 馬 |
| Traditional/Simplified | Traditional/Simplified conversion | 製程 → 制程 |
Mixed Chinese-English term support: For mixed terms like "CVD製程," the system generates variants only for the Chinese portion and leaves the English unchanged.
| Original Term | Auto-Generated Variants |
|---|---|
| CVD製程 | CVD制程, CVD之程, CVD製城 |
| wafer良率 | wafer量率, wafer涼率 |
| 5nm製程 | 5nm制程, 5nm製成 |
Translation Dictionary Format (translation_dict)
Use an array of entries directly:
[
{
"source": "語者分離",
"translations": {
"en-US": "Speaker Diarization",
"ja-JP": "話者分離"
}
}
]
| Field | Type | Required | Description |
|---|---|---|---|
source | string | Yes | The source word (in the STT language) |
translations | object | Yes | Translation mapping { "language code": "translation" } |
Limit: We recommend no more than 50 entries (to avoid degrading processing performance).
Request Example (Recommended: Terminology Only)
{
"type": "voice-translation",
"data": {
"action": "config",
"terminology": {
"zh-TW": [
{ "term": "語者分離", "boost": 1.5 },
{ "term": "CVD製程", "boost": 1.5 },
{ "term": "wafer良率", "boost": 1.5 }
]
}
}
}
Request Example (Full Settings, Including Manual Correction Rules)
{
"type": "voice-translation",
"data": {
"action": "config",
"terminology": {
"zh-TW": [
{ "term": "語者分離", "boost": 1.5 },
{ "term": "即時轉錄", "boost": 1.5 }
]
},
"fuzzy_correction": {
"zh-TW": [
{ "correct": "語者分離", "incorrect": ["語這分離", "語者分力"] }
]
},
"translation_dict": [
{ "source": "語者分離", "translations": { "en-US": "Speaker Diarization" } }
]
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "config_updated",
"updated": ["terminology", "fuzzy_correction", "translation_dict"],
"message": "Settings updated"
}
}
| Field | Type | Description |
|---|---|---|
updated | string | The setting types that were updated |
message | string | Status message |
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
config_empty | 400 | No settings provided | Provide at least one setting item |
config_term_too_long | 400 | Term exceeds 100 characters | Shorten the term length |
config_too_many_entries | 400 | More than 500 terms | Reduce the number of terms |
config_too_many_dict_entries | 400 | Translation dictionary exceeds 50 entries | Reduce the dictionary entries |
Voice Translation - audio (Send Audio)
Description
Sends audio data to the server for speech recognition. The audio must be Base64-encoded before sending.
Use Cases
- Continuously sending microphone audio
- Sending recorded audio segments
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value audio |
payload | string | Yes | Base64-encoded audio data |
Audio Format Requirements
PCM format (default):
| Item | Specification |
|---|---|
| Format | PCM (raw audio) |
| Sample rate | 16000 Hz |
| Bit depth | 16-bit |
| Channels | Mono |
| Byte order | Little-endian |
| Transport encoding | Base64 |
WebM/Opus format:
| Item | Specification |
|---|---|
| Format | WebM container + Opus codec |
| Sample rate | Any (the server converts automatically) |
| Channels | Mono or Stereo (the server converts automatically) |
| Transport encoding | Base64 |
Request Example
{
"type": "voice-translation",
"data": {
"action": "audio",
"payload": "Base64-encoded PCM audio data"
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
session_not_started | 400 | Speech recognition has not started | Call the start action first |
audio_invalid_format | 400 | Invalid audio data format | Confirm the Base64 encoding is correct |
audio_format_unsupported | 400 | Unsupported audio format | Use the pcm or webm format |
audio_decode_failed | 400 | Audio decoding failed | Confirm the audio format is correct |
Voice Translation - pause (Pause Translation)
Description
Pauses speech recognition processing. Audio received during the pause is buffered and continues to be processed after resuming.
Use Cases
- The user steps away temporarily
- You need to pause recording
Request Example
{
"type": "voice-translation",
"data": {
"action": "pause"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "status",
"message": "Speech recognition paused"
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
session_not_started | 400 | Speech recognition has not started | Call start first |
session_already_paused | 400 | Already paused | You can ignore this error |
Voice Translation - resume (Resume Translation)
Description
Resumes paused speech recognition processing.
Use Cases
- The user returns to continue
- You need to continue recording
Request Example
{
"type": "voice-translation",
"data": {
"action": "resume"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "status",
"message": "Speech recognition resumed"
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
session_not_started | 400 | Speech recognition has not started | Call start first |
session_not_paused | 400 | Not paused | You can ignore this error |
Voice Translation - stop (Stop Translation)
Description
Stops speech recognition and ends the session. The system automatically uploads the audio file and transcript, and generates a summary (if configured).
Use Cases
- The meeting ends
- Recording is complete
Request Example
{
"type": "voice-translation",
"data": {
"action": "stop"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "status",
"message": "Speech recognition stopped"
}
}
Task Complete Event
This event is sent after the audio file and transcript have been uploaded:
{
"type": "voice-translation",
"data": {
"action": "task_complete",
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"message": "Task processing complete"
}
}
| Field | Type | Description |
|---|---|---|
task_id | string | Recording UUID, can be used for subsequent API queries |
Voice Translation - retranslate (Retranslate)
Description
Retranslates a specified sentence, useful when the original text has been corrected and the translation needs to be updated.
Use Cases
- The user edits the original text and the translation needs updating
- Correcting recognition errors
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value retranslate |
sid | int | Yes | The sentence number to retranslate |
translation_languages | string | Yes | Array of translation language codes |
text | string | Yes | The original text to translate (the user-corrected text) |
Request Example
{
"type": "voice-translation",
"data": {
"action": "retranslate",
"sid": 1,
"translation_languages": ["en-US"],
"text": "The user-corrected original text"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "result",
"translations": {
"en-US": {
"sid": 1,
"text": "The new translation result",
"is_final": true,
"is_retranslation": true
}
}
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
retranslate_sid_not_found | 400 | The specified SID was not found | Confirm the SID exists |
retranslate_session_not_active | 400 | The session is not started or has ended | Confirm the session state |
retranslate_no_target_lang | 400 | No target language provided | Provide translation_languages |
retranslate_no_text | 400 | No text to translate provided | Provide the text parameter |
retranslate_llm_failed | 500 | Translation service failed | Retry later |
Voice Translation - switch_language (Switch Language)
Description
Switches the language while real-time translation is in progress. The behavior varies by recording type:
- General mode (transcribe, etc.): switches the translation target language and automatically batch-retranslates all already-translated sentences.
- Two-way mode (conversation): switches the STT source language (spoken language); the translation target automatically switches to the other language.
Use Cases
- Switching the translation target language
- A change in language needs mid-meeting
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value switch_language |
translation_languages | string | Conditional | Array of translation language codes (required in general mode) |
transcription_languages | string | Conditional | The target language to switch to (two-way mode; if omitted, automatically toggles to the other language) |
Request Example (General Mode)
{
"type": "voice-translation",
"data": {
"action": "switch_language",
"translation_languages": ["ja-JP"]
}
}
Request Example (Two-Way Mode)
Specify the target to switch to:
{
"type": "voice-translation",
"data": {
"action": "switch_language",
"transcription_languages": ["en-US"]
}
}
Automatic toggle (no parameters):
{
"type": "voice-translation",
"data": {
"action": "switch_language"
}
}
Two-Way Mode Special Behavior:
- Two-way mode uses automatic language detection and usually does not require manually switching the language.
switch_languageonly updates the internal preference state.- After a successful switch, a
language_switchedevent is returned (not a language_switch_start/done sequence). - Switching to the same language returns a
conversation_same_languagewarning.
Response Sequence (General Mode)
After switching the language, you receive the following events in order:
- language_switch_start: notifies that the switch has begun
{
"type": "voice-translation",
"data": {
"action": "language_switch_start",
"translation_language": "ja-JP",
"total_segments": 15,
"message": "Starting language switch and retranslation"
}
}
- batch_retranslation (multiple): returns retranslation results sentence by sentence
{
"type": "voice-translation",
"data": {
"action": "batch_retranslation",
"sid": 3,
"translations": {
"ja-JP": {
"sid": 3,
"text": "今日はプロジェクトの進捗について話し合いましょう",
"is_final": true,
"is_retranslation": true
}
}
}
}
- language_switch_done: notifies that the switch is complete
{
"type": "voice-translation",
"data": {
"action": "language_switch_done",
"translation_language": "ja-JP",
"success_count": 15,
"failed_count": 0,
"message": "Language switch complete"
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
switch_language_no_target | 400 | No target language provided | Provide translation_languages |
switch_language_in_progress | 400 | The previous switch is not yet complete | Wait for the switch to complete |
switch_language_same_target | 400 | The target language is the same as the current one | You can ignore this error |
conversation_requires_two_languages | 400 | Two-way mode requires exactly two languages | Confirm transcription_languages has 2 |
conversation_languages_identical | 400 | The two two-way languages cannot be the same | Provide two different languages |
conversation_invalid_language | 400 | Invalid two-way language | Confirm the language is in transcription_languages |
conversation_same_language | 400 | Already the current language | You can ignore this warning |
Voice Translation - set_name (Set Recording Name)
Description
Sets the name while recording is in progress. After it is set, this name is used when the recording ends and will not be auto-generated.
Tip: You can also set an initial default name via the
nameparameter atstart, but that name may still be overridden by the system when the session ends. If you need a fixed name, useset_name.
Use Cases
- Customizing the recording title after recording starts
- Overriding an auto-generated name or a previously set name
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value set_name |
name | string | Yes | Recording name (max 60 chars) |
Request Example
{
"type": "voice-translation",
"data": {
"action": "set_name",
"name": "Product Planning Meeting"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "status",
"message": "Recording name updated"
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
name_too_long | 400 | Recording name exceeds the limit | Shorten the name |
session_not_started | 400 | Speech recognition has not started | Call start first |
Voice Translation - rename_speaker (Globally Rename a Speaker)
Description
In multi-speaker diarization mode (multi_speaker), globally renames a speaker. All sentences using that speaker ID are updated in sync.
Use Cases
- Changing a system-assigned speaker ID (such as
Guest-1) to a meaningful name (such asManager Wang) - Naming a newly recognized speaker during a meeting
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value rename_speaker |
speaker_id | string | Yes | The original speaker ID (such as Guest-1); the current display label is also accepted for consecutive renaming; max 100 characters |
new_label | string | Yes | The new display label; max 100 characters, must not contain control characters (\x00-\x1F, \x7F) or line breaks |
Request Example
{
"type": "voice-translation",
"data": {
"action": "rename_speaker",
"speaker_id": "Guest-1",
"new_label": "Manager Wang"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "speaker_renamed",
"speaker_id": "Guest-1",
"new_label": "Manager Wang",
"affected_sids": [1, 3, 5, 8]
}
}
| Field | Type | Description |
|---|---|---|
speaker_id | string | The resolved original speaker ID (even if the input was a display label, the event returns the original ID) |
new_label | string | The new display label |
affected_sids | int | The list of affected sentence numbers |
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
speaker_not_found | 400 | The specified speaker was not found | Confirm the speaker_id or display label exists |
speaker_name_empty | 400 | new_label is empty | Provide a valid label |
speaker_name_duplicate | 422 | The display label is already in use | Use a different label, or first change the conflicting speaker |
session_not_started | 400 | Speech recognition has not started | Call start first |
Voice Translation - reassign_speaker (Change the Speaker of a Single Sentence)
Description
Changes the speaker identity (OriginalSpeakerID) of a specific sentence, assigning the sentence to an existing speaker.
Use Cases
- Correcting a speaker identity that the system recognized incorrectly
- Reassigning a sentence to another known speaker
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value reassign_speaker |
sid | int | Yes | The sentence number to change |
target_speaker_id | string | Yes | The target speaker's original ID (taken from init_sentence.speaker_id; reassign does not accept display labels) |
Request Example
{
"type": "voice-translation",
"data": {
"action": "reassign_speaker",
"sid": 5,
"target_speaker_id": "Guest-2"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "speaker_reassigned",
"sid": 5,
"old_speaker_id": "Guest-1",
"new_speaker_id": "Guest-2",
"new_speaker_label": "Lee Hsiao-hua"
}
}
| Field | Type | Description |
|---|---|---|
sid | int | The changed sentence number |
old_speaker_id | string | The original speaker ID |
new_speaker_id | string | The new original speaker ID |
new_speaker_label | string | The new speaker display label (after applying speaker_aliases; equals new_speaker_id when no alias exists) |
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
speaker_sid_not_found | 400 | The specified sentence was not found | Confirm the SID exists |
speaker_not_found | 400 | The target speaker does not exist | Use an existing speaker ID |
speaker_name_empty | 400 | The target speaker ID cannot be empty | Provide a valid speaker ID |
session_not_started | 400 | Speech recognition has not started | Call start first |
invalid_parameter | 400 | Creating a new speaker is not supported | Use an existing speaker ID |
Voice Translation - merge_speakers (Merge Speakers)
Description
Merges all sentences of one speaker into another speaker. After the merge, future recognition results for that speaker are also automatically converted to the target speaker.
Use Cases
- The speech recognition engine sometimes misidentifies the same person's voice as multiple speakers (for example, Guest-1 and Guest-2 are actually the same person)
- Use this feature to merge all of Guest-2's sentences into Guest-1
- After the merge, future Guest-2 recognition results are automatically displayed as Guest-1
Difference from reassign_speaker
| Feature | Scope | Future Impact |
|---|---|---|
reassign_speaker | A single sentence (1 SID) | None |
merge_speakers | All sentences of that speaker | Future appearances of the source are also automatically converted to the target |
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value merge_speakers |
source_speaker_id | string | Yes | The speaker ID to be merged (such as Guest-2) |
target_speaker_id | string | Yes | The merge target speaker ID (such as Guest-1) |
Request Example
{
"type": "voice-translation",
"data": {
"action": "merge_speakers",
"source_speaker_id": "Guest-2",
"target_speaker_id": "Guest-1"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "speakers_merged",
"source_speaker_id": "Guest-2",
"target_speaker_id": "Guest-1",
"target_speaker_label": "Manager Wang",
"affected_sids": [3, 5, 7]
}
}
| Field | Type | Description |
|---|---|---|
source_speaker_id | string | The original ID of the merged speaker |
target_speaker_id | string | The original ID of the merge target |
target_speaker_label | string | The target speaker display label (after applying speaker_aliases; equals the original ID when no alias exists) |
affected_sids | number | The list of affected sentence IDs |
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
speaker_not_found | 400 | The speaker does not exist | Confirm the speaker ID exists |
merge_speakers_same_id | 400 | The source and target speaker are the same | Use different speaker IDs |
speaker_name_empty | 400 | The speaker ID cannot be empty | Provide a valid speaker ID |
session_not_started | 400 | Speech recognition has not started | Call start first |
Voice Translation - tts_play (Play TTS)
Description
In async mode, manually plays the TTS audio for a specified sentence.
Use Cases
- The user selects a specific sentence for TTS playback
- Playing multiple consecutive sentences
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value tts_play |
sid | int | Yes | The starting sentence ID |
length | int | No | Number of sentences to play (default 1, max 20) |
Note: The maximum value of
lengthis controlled by the backend environment variableTTS_SSE_MAX_LENGTH(default 20).
Request Example (Single Sentence)
{
"type": "voice-translation",
"data": {
"action": "tts_play",
"sid": 5
}
}
Request Example (Multiple Sentences)
{
"type": "voice-translation",
"data": {
"action": "tts_play",
"sid": 5,
"length": 3
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
tts_not_enabled | 400 | TTS not enabled | Confirm TTS was enabled at start |
tts_sid_not_found | 400 | The specified sentence was not found | Confirm the SID exists |
tts_translation_not_found | 400 | The sentence has no translation in the specified language | Confirm the translation exists |
Voice Translation - tts_stop (Stop TTS)
Description
Stops the TTS audio that is currently playing.
Use Cases
- The user manually stops TTS playback
- Stopping the current playback before switching to another sentence
Request Example
{
"type": "voice-translation",
"data": {
"action": "tts_stop"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "status",
"message": "TTS playback stopped"
}
}
Voice Translation - tts_mode (Switch TTS Mode)
Description
Switches the TTS playback mode (synchronous/asynchronous) while recording is in progress.
Use Cases
- Switching from automatic playback to manual control
- Switching from manual control to automatic playback
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value tts_mode |
tts_mode | string | Yes | Mode: sync (synchronous) or async (asynchronous) |
Request Example
{
"type": "voice-translation",
"data": {
"action": "tts_mode",
"tts_mode": "async"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "tts_mode_changed",
"tts_mode": "async"
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
tts_not_enabled | 400 | TTS not enabled | Confirm TTS was enabled at start |
tts_invalid_mode | 400 | Invalid mode | Use sync or async |
Voice Translation - set_tts (Two-Way TTS Settings)
Description
While a two-way mode (conversation) recording is in progress, dynamically toggles TTS on/off or updates the TTS voice settings. Available only in two-way mode.
Use Cases
- Turning the TTS audio response off/on mid-conversation in two-way mode
- Changing the TTS voice or speaking rate for a specific language
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value set_tts |
tts_enabled | boolean | No | Whether to enable two-way TTS (true / false) |
tts_config | object | No | TTS settings per language; the key is the language code, and the value is {voice, speaking_rate} |
Note: At least one of
tts_enabledandtts_configmust be provided.tts_configupdates only the settings for the specified languages; unspecified languages remain unchanged.
Request Example (Disable TTS)
{
"type": "voice-translation",
"data": {
"action": "set_tts",
"tts_enabled": false
}
}
Request Example (Update Voice Settings)
{
"type": "voice-translation",
"data": {
"action": "set_tts",
"tts_enabled": true,
"tts_config": {
"en-US": {
"voice": "en-US-GuyNeural",
"speaking_rate": 1.2
}
}
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "tts_updated",
"tts_enabled": true,
"tts_config": {
"zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
"en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
}
}
}
| Field | Type | Description |
|---|---|---|
tts_enabled | boolean | The current TTS enabled state |
tts_config | object | The current complete TTS settings (all languages) |
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
invalid_action | 400 | Not two-way mode | This action is available only in conversation mode |
session_not_started | 400 | Speech recognition has not started | Call start first |
Voice Translation - start_speaking (Start Speaking / Manual Mode)
Description
In two-way manual mode (conversation_mode: "manual"), notifies the system that the user has started speaking. From this point on, audio is sent to STT for recognition, and all recognition results accumulate into a single sentence (no automatic segmentation).
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value start_speaking |
speaker | int | Yes | User number (1 or 2) |
Request Example
{
"type": "voice-translation",
"data": {
"action": "start_speaking",
"speaker": 1
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "status",
"message": "Speaking started"
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
invalid_action | 400 | Not two-way mode | Use only under the conversation type |
conversation_not_manual_mode | 400 | Not manual mode | Use only in manual mode |
conversation_speaking | 400 | Already speaking | Call stop_speaking first |
conversation_invalid_speaker | 400 | Invalid user number | Use 1 or 2 |
Voice Translation - stop_speaking (Stop Speaking / Manual Mode)
Description
In two-way manual mode, notifies the system that the user has stopped speaking. The system merges the recognition results accumulated during the period into a single complete sentence and performs translation and TTS synthesis.
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value stop_speaking |
Request Example
{
"type": "voice-translation",
"data": {
"action": "stop_speaking"
}
}
Success Response
After stopping speaking, the system sends a complete result event (containing origin and translations):
{
"type": "voice-translation",
"data": {
"action": "result",
"origin": {
"sid": 1,
"language": "zh-TW",
"text": "The complete sentence merged from all recognition during this period",
"is_final": true,
"speaker_id": "Speaker-1",
"start_time": "00:05"
},
"translations": {
"en-US": {
"sid": 1,
"text": "The complete merged sentence from this speaking period",
"is_final": true
}
}
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
invalid_action | 400 | Not two-way mode | Use only under the conversation type |
conversation_not_speaking | 400 | Not in a speaking state | Call start_speaking first |
Voice Translation - switch_conversation_mode (Switch Conversation Mode)
Description
While two-way mode is in progress, switches between auto-detect mode (auto) and manual mode (manual). If the user is currently speaking when the switch happens, speaking is ended automatically.
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value switch_conversation_mode |
conversation_mode | string | Yes | The target mode: auto or manual |
Request Example
{
"type": "voice-translation",
"data": {
"action": "switch_conversation_mode",
"conversation_mode": "manual"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "conversation_mode_changed",
"conversation_mode": "manual"
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
invalid_action | 400 | Not two-way mode | Use only under the conversation type |
conversation_invalid_mode | 400 | Invalid conversation mode | Use auto or manual |
Voice Translation - set_speaker_language (Set User Language)
Description
While two-way mode is in progress, changes a specified user's language in real time. The system rebuilds the STT connection to accommodate the new language, and the translation target is also updated automatically. The transcript content before the change keeps its original language, and the timestamp continues to count without resetting.
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value set_speaker_language |
speaker | int | Yes | User number (1 or 2) |
language | string | Yes | The new language code (such as ja-JP) |
Request Example
{
"type": "voice-translation",
"data": {
"action": "set_speaker_language",
"speaker": 1,
"language": "ja-JP"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "speaker_language_changed",
"speaker_language_map": {
"1": "ja-JP",
"2": "en-US"
}
}
}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
invalid_action | 400 | Not two-way mode | Use only under the conversation type |
conversation_invalid_speaker | 400 | Invalid user number | Use 1 or 2 |
conversation_invalid_language | 400 | Invalid language code | Use a valid BCP 47 language code |
conversation_same_language | 400 | Same as the current language | You can ignore this warning |
conversation_language_same_as_peer | 400 | The new language is the same as the other user | The two users cannot have the same language |
conversation_speaking | 400 | Currently speaking, cannot change language | End speaking before changing |
conversation_language_change_failed | 500 | Language change failed (STT rebuild failed) | Retry later |
Voice Translation - broadcast_go_live (Switch to the Live Phase)
Description
Switches from the broadcast standby phase (standby) to the live phase (live). After switching, STT/translation results begin broadcasting to viewers and start being written to the transcript.
Use Cases
- The host confirms the equipment is working and starts the official broadcast
- Switching from the warm-up phase to live streaming
Request Example
{
"type": "voice-translation",
"data": {
"action": "broadcast_go_live"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "broadcast_phase_changed",
"phase": "live",
"message": "Broadcast started"
}
}
| Field | Type | Description |
|---|---|---|
phase | string | The new phase (live) |
message | string | Status description message |
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
broadcast_not_enabled | 400 | Not broadcast mode | Confirm type: "broadcast" |
session_not_started | 400 | Speech recognition has not started | Call start first |
Note: If already in the live phase, a status message "Broadcast is already in progress" is returned and is not treated as an error.
Voice Translation - broadcast_announcement (Send an Announcement)
Description
The host sends a custom message announcement to all viewers. Viewers receive an announcement event via SSE. The announcement message is automatically translated into all translation languages, and the SSE event viewers receive includes a translations field.
Use Cases
- Notifying viewers that the meeting is about to end
- Sending an important reminder or announcement
- One-way communication with viewers
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value broadcast_announcement |
message | string | Yes | The announcement message content |
Request Example
{
"type": "voice-translation",
"data": {
"action": "broadcast_announcement",
"message": "The meeting will end in 5 minutes"
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "status",
"message": "Announcement sent"
}
}
The SSE event viewers receive (with translations):
event: announcement
data: {"message":"The meeting will end in 5 minutes","translations":{"en-US":"The meeting will end in 5 minutes","ja-JP":"会議は5分後に終了します"}}
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
broadcast_not_enabled | 400 | Not broadcast mode | Confirm type: "broadcast" |
invalid_parameter | 400 | Message is empty | Provide a valid message parameter |
Voice Translation - set_standby_message (Set the Standby Phase Message)
Description
During the broadcast standby phase (standby), dynamically sets the message shown to viewers. This allows the host to enter standby mode and then set the waiting message, rather than being required to provide it at start.
The message is automatically translated into all translation languages, and the SSE event viewers receive includes a translations field.
Use Cases
- After entering standby mode, dynamically set the waiting message shown to viewers
- Update the text on the standby screen before going live
- Reduce the required fields before starting the broadcast
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
action | string | Yes | Fixed value set_standby_message |
message | string | Yes | The text displayed during the standby phase (translated for viewers of each language via the existing translation pipeline) |
Request Example
{
"type": "voice-translation",
"data": {
"action": "set_standby_message",
"message": "The talk is about to begin, please wait..."
}
}
Success Response
{
"type": "voice-translation",
"data": {
"action": "status",
"message": "Standby phase text updated"
}
}
Event Viewers Receive
After a successful setting, all viewers in the standby phase receive an updated standby event via SSE:
event: standby
data: {"message":"The talk is about to begin, please wait...","translations":{"en-US":"The presentation is about to begin, please wait...","ja-JP":"プレゼンテーションがまもなく始まります。お待ちください..."}}
Note: The
translationsfield contains the translation results for all translation languages. The frontend can display the corresponding translation based on the language the viewer selects.
Error Responses
| Error Code | HTTP Status | Description | Recommended Action |
|---|---|---|---|
broadcast_not_enabled | 400 | Not broadcast mode | Confirm type: "broadcast" |
broadcast_not_in_standby | 400 | Not in the standby phase | Can be used only during the standby phase |
Note: This action can be used only during the standby phase (standby). If the broadcast has already entered the live phase (live), an error is returned.
Response Events
The following are all the response events you may receive over the WebSocket.
session_started - Session Started Successfully
After a start action succeeds, the server returns an event containing complete session initialization info. The frontend can distinguish the recording type via recording_type.
General recordings (transcribe / conversation / record):
{
"type": "voice-translation",
"data": {
"action": "session_started",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"recording_type": "transcribe",
"recognition_mode": "single",
"message": "Speech recognition started"
}
}
Broadcast mode (broadcast):
{
"type": "voice-translation",
"data": {
"action": "session_started",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"recording_type": "broadcast",
"recognition_mode": "multi_speaker",
"phase": "standby",
"viewer_count": 0,
"queue_count": 0,
"peak_viewers": 0,
"total_viewers": 0,
"message": "Speech recognition started"
}
}
| Field | Type | Description |
|---|---|---|
session_id | string | Session ID |
recording_id | string | Recording ID (can be used for subsequent API queries) |
recording_type | string | Recording type: transcribe, conversation, record, broadcast |
recognition_mode | string | Recognition mode: single, multi_speaker |
message | string | Status description message |
phase | string | Broadcast phase: standby or live (broadcast mode only) |
viewer_count | int | Current number of online viewers (broadcast mode only) |
queue_count | int | Number of viewers waiting in the queue (broadcast mode only) |
peak_viewers | int | Peak number of viewers for this broadcast (broadcast mode only) |
total_viewers | int | Total cumulative number of viewers who have connected (broadcast mode only) |
result - Recognition/Translation Result
Speech recognition and translation results. A single result event may contain origin (recognition result) and/or translations (translation results).
origin (speech recognition result):
{
"type": "voice-translation",
"data": {
"action": "result",
"origin": {
"sid": 1,
"language": "zh-TW",
"text": "Hello, nice to meet you",
"is_final": true,
"speaker_id": "0",
"detected_language": "zh-TW",
"start_time": "00:05"
}
}
}
| Field | Type | Description |
|---|---|---|
sid | int | Sentence number, starting from 1 |
language | string | Source language code. In two-way mode, this is the automatically detected language. |
text | string | The recognized text |
is_final | boolean | Whether it is the final result |
speaker_id | string | Speaker ID |
detected_language | string | The detected language. In two-way mode, this is determined automatically by the system. |
start_time | string | Sentence start time (mm:ss); not sent during the broadcast standby phase; after going live, counts from 00:00. |
translations (translation results):
{
"type": "voice-translation",
"data": {
"action": "result",
"translations": {
"en-US": {
"sid": 1,
"text": "Hello, nice to meet you",
"is_final": true
}
}
}
}
Translation results are keyed by language code, and each language's translation object contains:
| Field | Type | Description |
|---|---|---|
sid | int | Sentence number |
text | string | The translated text |
is_final | boolean | Whether it is the final result |
is_retranslation | boolean | Whether it is a retranslation result (only during retranslate) |
status - Generic Status Response
Used to confirm operations such as pause, resume, stop, and set_name.
{
"type": "voice-translation",
"data": {
"action": "status",
"message": "Speech recognition paused"
}
}
| Field | Type | Description |
|---|---|---|
message | string | Status description |
task_complete - Task Processing Complete
Triggered after stop when the audio file and transcript have been uploaded. task_id can be used to query task details via the REST API afterward.
{
"type": "voice-translation",
"data": {
"action": "task_complete",
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"message": "Task processing complete"
}
}
| Field | Type | Description |
|---|---|---|
task_id | string | Recording UUID, can be used for subsequent API queries |
message | string | Status description |
config_updated - Settings Update Complete
Triggered after the config action succeeds.
{
"type": "voice-translation",
"data": {
"action": "config_updated",
"updated": ["terminology", "fuzzy_correction", "translation_dict"],
"message": "Settings updated"
}
}
| Field | Type | Description |
|---|---|---|
updated | string | The setting types that were updated (terminology, fuzzy_correction, translation_dict) |
message | string | Status message |
tts_ready - TTS Audio Ready
TTS speech synthesis completion event. Contains the audio data and Word Boundary information (which can be used for a karaoke effect).
{
"type": "voice-translation",
"data": {
"action": "tts_ready",
"sid": 1,
"language": "en-US",
"transcript": "你好,很高興認識你",
"text": "Hello, nice to meet you",
"audio": "Base64EncodedMP3...",
"format": "mp3",
"duration_ms": 2500,
"boundaries": [
{"offset_ms": 0, "duration_ms": 350, "text_offset": 0, "word_length": 5, "text": "Hello"},
{"offset_ms": 350, "duration_ms": 100, "text_offset": 5, "word_length": 1, "text": ","},
{"offset_ms": 500, "duration_ms": 250, "text_offset": 7, "word_length": 4, "text": "nice"},
{"offset_ms": 750, "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
{"offset_ms": 950, "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
{"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
]
}
}
| Field | Type | Description |
|---|---|---|
sid | int | Sentence number |
language | string | TTS language |
transcript | string | The original transcript (STT recognition result) |
text | string | The translated text (TTS synthesis source) |
audio | string | Base64-encoded MP3 audio |
format | string | Audio format (fixed value mp3) |
duration_ms | int | Total audio duration (milliseconds) |
boundaries | array | Array of Word Boundaries |
Word Boundary Field Descriptions
| Field | Type | Description |
|---|---|---|
offset_ms | int | The word's start time in the audio (milliseconds) |
duration_ms | int | The word's duration (milliseconds) |
text_offset | int | Position in the original string (character index) |
word_length | int | Word length (number of characters) |
text | string | The word content |
tts_error - TTS Synthesis Failed
TTS synthesis failure event.
{
"type": "voice-translation",
"data": {
"action": "tts_error",
"sid": 1,
"language": "en-US",
"error": "translation_not_found",
"message": "No translation available for language: en-US"
}
}
| Field | Type | Description |
|---|---|---|
sid | int | Sentence number |
language | string | TTS language |
error | string | Error code |
message | string | Error message |
TTS Error Codes
| Error Code | Description |
|---|---|
translation_not_found | No translation found for that language |
tts_synthesis_failed | TTS synthesis failed |
tts_quota_exceeded | TTS usage has reached the limit |
viewer_count - Viewer Count Update
Broadcast mode only
During a broadcast, the system checks the viewer count every 3 seconds and pushes this event to the host if it changes.
{
"type": "voice-translation",
"data": {
"action": "viewer_count",
"viewer_count": 45,
"queue_count": 8,
"peak_viewers": 50,
"total_viewers": 123
}
}
| Field | Type | Description |
|---|---|---|
viewer_count | int | Current number of online viewers |
queue_count | int | Number of viewers waiting in the queue |
peak_viewers | int | Peak number of viewers for this broadcast |
total_viewers | int | Total cumulative number of viewers who have connected |
Note: This event is pushed only when the viewer count or queue count changes, to avoid unnecessary message traffic.
viewer_joined - Viewer Joined
Broadcast mode only
When a viewer joins the broadcast, the host receives this event.
{
"type": "voice-translation",
"data": {
"action": "viewer_joined",
"viewer_count": 5,
"queue_count": 2
}
}
| Field | Type | Description |
|---|---|---|
viewer_count | number | Current number of viewers |
queue_count | number | Number waiting in the queue |
viewer_left - Viewer Left
Broadcast mode only
When a viewer leaves the broadcast, the host receives this event.
{
"type": "voice-translation",
"data": {
"action": "viewer_left",
"viewer_count": 4,
"queue_count": 1
}
}
| Field | Type | Description |
|---|---|---|
viewer_count | number | Current number of viewers |
queue_count | number | Number waiting in the queue |
broadcast_phase_changed - Broadcast Phase Changed
Triggered when the broadcast phase switches from standby to live.
{
"type": "voice-translation",
"data": {
"action": "broadcast_phase_changed",
"phase": "live",
"message": "Broadcast started"
}
}
| Field | Type | Description |
|---|---|---|
phase | string | The new phase: standby or live |
message | string | Status description message |
speaker_renamed - Speaker Renamed
Speaker global rename completion event.
{
"type": "voice-translation",
"data": {
"action": "speaker_renamed",
"speaker_id": "Guest-1",
"new_label": "Manager Wang",
"affected_sids": [1, 3, 5, 8]
}
}
| Field | Type | Description |
|---|---|---|
speaker_id | string | The resolved original speaker ID (even if the input was a display label, the event returns the original ID) |
new_label | string | The new display label |
affected_sids | int | The list of affected sentence numbers |
speaker_reassigned - Speaker Identity Changed
Single-sentence speaker identity change completion event.
{
"type": "voice-translation",
"data": {
"action": "speaker_reassigned",
"sid": 5,
"old_speaker_id": "Guest-1",
"new_speaker_id": "Guest-2",
"new_speaker_label": "Lee Hsiao-hua"
}
}
| Field | Type | Description |
|---|---|---|
sid | int | The changed sentence number |
old_speaker_id | string | The original speaker ID |
new_speaker_id | string | The new original speaker ID |
new_speaker_label | string | The new speaker display label (after applying speaker_aliases; equals new_speaker_id when no alias exists) |
speakers_merged - Speakers Merged
Speaker merge completion event. After the merge, future recognition results for that source speaker are also automatically converted to the target speaker.
{
"type": "voice-translation",
"data": {
"action": "speakers_merged",
"source_speaker_id": "Guest-2",
"target_speaker_id": "Guest-1",
"target_speaker_label": "Manager Wang",
"affected_sids": [3, 5, 7]
}
}
| Field | Type | Description |
|---|---|---|
source_speaker_id | string | The original ID of the merged speaker |
target_speaker_id | string | The original ID of the merge target |
target_speaker_label | string | The target speaker display label (after applying speaker_aliases; equals the original ID when no alias exists) |
affected_sids | number | The list of affected sentence IDs |
language_switch_start - Language Switch Started
Language switch start event, sent after the switch_language action is triggered.
{
"type": "voice-translation",
"data": {
"action": "language_switch_start",
"translation_language": "ja-JP",
"total_segments": 15,
"message": "Starting language switch and retranslation"
}
}
| Field | Type | Description |
|---|---|---|
translation_language | string | The new translation target language |
total_segments | int | The number of sentences that need retranslation |
message | string | Status description |
batch_retranslation - Batch Retranslation Result
Batch retranslation result event, sent sentence by sentence during the language switch process.
{
"type": "voice-translation",
"data": {
"action": "batch_retranslation",
"sid": 3,
"translations": {
"ja-JP": {
"sid": 3,
"text": "今日はプロジェクトの進捗について話し合いましょう",
"is_final": true,
"is_retranslation": true
}
}
}
}
| Field | Type | Description |
|---|---|---|
sid | int | Sentence number |
translations | object | Translation results (same format as result's translations) |
language_switch_done - Language Switch Complete
Language switch completion event.
{
"type": "voice-translation",
"data": {
"action": "language_switch_done",
"translation_language": "ja-JP",
"success_count": 15,
"failed_count": 0,
"message": "Language switch complete"
}
}
| Field | Type | Description |
|---|---|---|
translation_language | string | The translation target language |
success_count | int | The number of successfully translated sentences |
failed_count | int | The number of sentences that failed to translate |
message | string | Status description |
tts_mode_changed - TTS Mode Changed
TTS playback mode change event.
{
"type": "voice-translation",
"data": {
"action": "tts_mode_changed",
"tts_mode": "async"
}
}
| Field | Type | Description |
|---|---|---|
tts_mode | string | The new mode: sync or async |
language_switched - Two-Way Language Switch Complete
Two-way mode (conversation) language switch completion event. Triggered after switch_language successfully switches the STT source language in two-way mode.
{
"type": "voice-translation",
"data": {
"action": "language_switched",
"language": "en-US",
"translation_language": "zh-TW",
"message": "Language switched"
}
}
| Field | Type | Description |
|---|---|---|
language | string | The new active language (STT source) |
translation_language | string | The new translation target language |
message | string | Status message |
tts_updated - Two-Way TTS Settings Updated
Two-way mode (conversation) TTS settings update event. Triggered after set_tts successfully updates the TTS toggle or voice settings.
{
"type": "voice-translation",
"data": {
"action": "tts_updated",
"tts_enabled": true,
"tts_config": {
"zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
"en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
}
}
}
| Field | Type | Description |
|---|---|---|
tts_enabled | boolean | Whether TTS is enabled |
tts_config | object | The TTS settings for each language (voice, speaking_rate) |
conversation_mode_changed - Conversation Mode Changed
Two-way mode (conversation) conversation mode change event. Triggered after switch_conversation_mode successfully switches between auto/manual mode.
{
"type": "voice-translation",
"data": {
"action": "conversation_mode_changed",
"conversation_mode": "manual"
}
}
| Field | Type | Description |
|---|---|---|
conversation_mode | string | The new conversation mode: auto or manual |
speaker_language_changed - User Language Changed
Two-way mode (conversation) user language change event. Triggered after set_speaker_language successfully changes a user's language, including the complete language mapping after the change.
{
"type": "voice-translation",
"data": {
"action": "speaker_language_changed",
"speaker_language_map": {
"1": "ja-JP",
"2": "en-US"
}
}
}
| Field | Type | Description |
|---|---|---|
speaker_language_map | object | The user language mapping after the change (keys are user number strings) |
segment_uploaded - Audio Segment Upload Complete
Audio segment upload completion event. Triggered each time an audio segment is successfully uploaded to cloud storage; can be used to show upload progress on the frontend.
{
"type": "voice-translation",
"data": {
"action": "segment_uploaded",
"segment_index": 0,
"duration_sec": 30.5
}
}
| Field | Type | Description |
|---|---|---|
segment_index | number | Segment index (starting from 0) |
duration_sec | number | The duration of this segment (seconds) |
stt_event - STT Connection Status Event
STT connection status event. Triggered when the connection status of the speech recognition service changes; can be used to show the STT service status on the frontend.
{
"type": "voice-translation",
"data": {
"action": "stt_event",
"event": "connected",
"message": "STT service connected"
}
}
| Field | Type | Description |
|---|---|---|
event | string | Event type: connected / disconnected / error |
message | string | Event description message |
error - Error Event
Triggered when an operation fails or a system anomaly occurs.
{
"type": "error",
"data": {
"error_code": "session_not_started",
"severity": "error",
"message": "Session not started",
"context": "voice-translation",
"request_id": "req_abc123xyz789",
"timestamp": "2026-01-15T10:30:45.123Z"
}
}
| Field | Type | Description |
|---|---|---|
error_code | string | Error code (for programmatic handling) |
severity | string | Severity: fatal / error / warning |
message | string | Human-readable error message |
context | string | Error source category |
request_id | string | Request tracking ID |
timestamp | string | Time the error occurred (ISO 8601) |
Severity Descriptions
| severity | Description | Recommended Action |
|---|---|---|
fatal | Fatal error | Stop the service and require reconnection |
error | Operation failed | Show an error notice and allow retry |
warning | Warning | Show a warning without blocking the operation |
For the full list of error codes, refer to Error Code Reference.
Version: V1.5.7 Last Updated: 2026-05-20