API Docs

Websocket Api

Note: This is a consolidated document. For detailed specifications, refer to the individual documents under reference/websocket/.

Note: The URL used in this document (vas-poc.vurbo.ai) is the planned deployment address. A separate notice will be issued after the official launch.

Connection Info
Authentication
Message Format
Health - Heartbeat Service
Voice Translation - start
Voice Translation - config
Voice Translation - audio
Voice Translation - pause
Voice Translation - resume
Voice Translation - stop
Voice Translation - retranslate
Voice Translation - switch_language
Voice Translation - set_name
Voice Translation - rename_speaker
Voice Translation - reassign_speaker
Voice Translation - merge_speakers
Voice Translation - tts_play
Voice Translation - tts_stop
Voice Translation - tts_mode
Voice Translation - set_tts
Voice Translation - start_speaking
Voice Translation - stop_speaking
Voice Translation - switch_conversation_mode
Voice Translation - set_speaker_language
Voice Translation - broadcast_go_live
Voice Translation - broadcast_announcement
Voice Translation - set_standby_message
Response Events

Connection Info

Item	Value
Endpoint	`wss://vas-poc.vurbo.ai/ws`
Protocol	WebSocket
Data Format	JSON
Auth Method	Ticket (see below)

Authentication

The VAS WebSocket uses a Ticket mechanism for authentication, passing a one-time Ticket via Sec-WebSocket-Protocol. For details, refer to Authentication.

Step 1: Obtain a Ticket

Exchange your API Key for a one-time Ticket via the REST API:

POST /api/v1/auth/ticket
X-API-Key: vas_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Response:

{
  "ticket": "aBcDeFgHiJkLmNoPqRsTuVwXyZ012345",
  "expires_in": 60
}

Field	Type	Description
`ticket`	string	One-time Ticket (32 chars)
`expires_in`	int	Validity period (seconds)

Step 2: Connect to the WebSocket using the Ticket

Place the Ticket into Sec-WebSocket-Protocol in the format ticket.{TICKET_VALUE}:

// Native browser support
const ws = new WebSocket('wss://vas-poc.vurbo.ai/ws', [`ticket.${ticket}`]);

ws.onopen = () => {
  console.log('Connected! Protocol:', ws.protocol);
  // Start using the WebSocket...
};

ws.onerror = (error) => {
  console.error('Connection failed:', error);
};

Node.js example:

const WebSocket = require('ws');

const ws = new WebSocket('wss://vas-poc.vurbo.ai/ws', [`ticket.${ticket}`]);

Ticket Characteristics

Characteristic	Description
Validity period	60 seconds
Usage count	Can be used only once (deleted immediately after)
Security	The API Key is never exposed in the WebSocket connection
Replay protection	Uses an atomic operation to guarantee single use

Ticket Error Codes

Error Code	HTTP Status	Description
`ticket_invalid`	401	Ticket invalid or expired
`ticket_expired`	401	Ticket expired
`ticket_already_used`	401	Ticket already used
`ticket_validation_failed`	500	Ticket validation failed

For the full API specification, refer to Auth Ticket API.

Message Format

All messages use a unified nested structure:

{
  "type": "service type",
  "data": { ... }
}

Service Types

type	Description
`health`	Heartbeat mechanism
`voice-translation`	Voice translation service
`error`	Error message

Error Message Format

When an error occurs, the server returns a message with type: "error":

{
  "type": "error",
  "data": {
    "error_code": "auth_invalid_api_key",
    "severity": "fatal",
    "message": "Invalid API key",
    "context": "auth",
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-01-15T10:30:45.123Z"
  }
}

Sentence-level errors (such as a translation failure for one language of a sentence) additionally carry sid and details:

{
  "type": "error",
  "data": {
    "error_code": "llm_content_filtered",
    "severity": "warning",
    "message": "Content filtered",
    "context": "translation",
    "sid": 5,
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-01-15T10:30:45.123Z",
    "details": {
      "provider": "azure_openai",
      "translation_language": "ja"
    }
  }
}

Session-level translation service errors (escalated after consecutive failures reach a threshold) do not carry sid. The frontend should display a global notice but does not need to disconnect:

{
  "type": "error",
  "data": {
    "error_code": "translation_service_unavailable",
    "severity": "error",
    "message": "Translation service unavailable",
    "context": "translation",
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-01-15T10:30:45.123Z",
    "details": {
      "provider": "azure_openai",
      "last_error_code": "llm_provider_error",
      "fail_count": 5
    }
  }
}

For the full trigger rules (consecutive failure threshold, error code classification), refer to the translation_service_unavailable section in Error Code Reference.

Single-Message Error (per-message panic recovered)

When the server encounters an internal error (panic) while handling a single WebSocket message (such as set_name, switch_language, tts_play, etc.), it returns internal_error. This error indicates only that the specific message failed to process; the connection is not terminated. The frontend should keep the connection open and may retry the operation:

{
  "type": "error",
  "data": {
    "error_code": "internal_error",
    "severity": "error",
    "message": "Internal server error",
    "context": "general",
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-05-08T10:30:45.123Z",
    "details": {
      "message_type": "voice-translation",
      "action": "set_name"
    }
  }
}

`details` Fields

Field	Type	Description
`message_type`	string	Service type: `voice-translation` / `health`
`action`	string	(Optional) The specific operation that failed, such as `set_name`, `switch_language`, `tts_play`, `tts_mode`, `retranslate`, `config`, `speaker.rename`, etc. This field is absent when the message payload has no `action` field (such as a plain init message).

What the Frontend Should Do

Keep the WebSocket connection open: Do not call ws.close(), navigate away, or return to a history page because of this error. The recording is still in progress.

Decide on follow-up handling based on details.action:

Scenario	Recommended Action
Idempotent operations such as `set_name` / `switch_language` / `tts_mode` / `config`	Simply resend the same message. These operations use a "last write wins" approach, so retrying has no side effects.
`tts_play` / `tts_stop` / `retranslate`	Usually safe to retry directly. If the user is waiting for TTS playback, consider showing a transient toast indicating the retry is in progress.
`speaker.rename` / `speaker.merge`	Before retrying, use the REST API (speakers) to confirm the current DB state and avoid duplicate operations (for example, the rename already succeeded and only the response frame failed).
`details.action` is absent	The server panicked after parsing the message payload and cannot infer the specific operation. The frontend can infer it from "the most recent message the user sent," or display a generic error message such as "Operation failed, please retry."

User experience: Show a transient toast / inline error. Do not interrupt the user flow with a modal or a redirect.
Telemetry / reporting: Report request_id + details to your frontend error tracking (Sentry, Datadog, etc.) to make it easier to correlate with backend logs during troubleshooting.

What Will Not Happen (Guarantees)

The recording will not be interrupted: segment_uploaded, result, origin, and other messages keep arriving.
The connection will not be actively closed by the server.
The session state will not be reset (session_id stays the same).
State already written to the DB will not be rolled back (for example, if set_name was written to the DB successfully and only the response frame failed, the name still takes effect).

Client Handling Example

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type !== 'error') {
    handleNormalMessage(msg);
    return;
  }

  const { error_code, severity, request_id, details } = msg.data;

  // Single-message panic: keep the connection, decide whether to retry based on action
  if (error_code === 'internal_error') {
    console.warn('[ws] message handler panic recovered', {
      request_id,
      message_type: details?.message_type,
      action: details?.action,
    });
    showTransientToast(`Failed to process "${details?.action ?? 'operation'}", please retry`);
    // Note: do not call ws.close() and do not navigate away from the current page
    return;
  }

  // Handle other errors with your existing logic (only fatal errors require disconnecting)
  handleErrorBySeverity(severity, msg.data);
};

Field	Type	Description
`error_code`	string	Error code (for programmatic handling)
`severity`	string	Severity: `fatal` / `error` / `warning`
`message`	string	Human-readable error message
`context`	string	Error source category
`sid`	int	Optional. The sentence number for sentence-level errors (such as a translation failure); absent for non-sentence-level errors
`request_id`	string	Request tracking ID
`timestamp`	string	Time the error occurred (ISO 8601)
`details`	object	Optional. Error context; common keys: `provider`, `translation_language`, `source_lang`, etc.

For the full list of error codes, refer to Error Code Reference.

Health (Heartbeat Service)

Description

Used to confirm that the WebSocket connection is healthy. We recommend sending a ping every 30 seconds; if no pong is received, treat the connection as dropped and reconnect.

Use Cases

Maintaining a long-lived connection
Detecting connection status
Preventing connection timeouts

Request - Ping

{
  "type": "health",
  "data": {
    "action": "ping"
  }
}

Response - Pong

{
  "type": "health",
  "data": {
    "action": "pong"
  }
}

Voice Translation - start (Start Voice Translation)

Description

Starts a new voice translation session and begins processing audio according to the configured parameters.

Use Cases

Starting a meeting record
Starting real-time translation
Starting a voice memo

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `start`
`transcription_languages`	string	Yes	Speech recognition languages (up to 2)
`translation_languages`	string	No	Translation target languages (empty = no translation)
`realtime_translation`	boolean	No	Real-time translation mode (default `false`)
`recognition_mode`	string	No	Recognition mode: `single` (single speaker, default), `multi_speaker` (multiple speakers). Under `multi_speaker`, `transcription_languages` must contain exactly 1 language; otherwise the server returns a `diarization_multilang_conflict` error and refuses to start.
`type`	string	Yes	Recording type: `transcribe`, `conversation`, `record`, `broadcast`
`audio_format`	string	No	Audio format: `pcm` (default), `webm`
`summary_template`	string	Conditional	Summary template. Required for `transcribe` when `summary_mode=builtin`; forbidden when `summary_mode=custom`; optional for `conversation`/`broadcast`.
`options`	object	No	Speech recognition options
`tts_enabled`	boolean	No	Whether to enable TTS speech synthesis (default `false`)
`tts_language`	string	No	TTS output language (must be in `translation_languages`)
`tts_voice`	string	No	TTS voice name (such as `en-US-JennyNeural`)
`tts_mode`	string	No	TTS playback mode: `sync` (synchronous, default), `async` (asynchronous)
`broadcast_token`	string	Conditional	Broadcast token (required for the `broadcast` type, obtained from the REST API)
`active_language`	string	No	Initial active language for two-way mode (default `transcription_languages[0]`)
`speakers`	array	Conditional	User-to-language mapping for two-way mode (required in two-way mode, exactly 2 users)
`conversation_mode`	string	No	Two-way conversation mode: `auto` (auto-detect, default), `manual` (push-to-talk)
`speaker_diarization`	boolean	No	Speaker diarization (forcibly ignored in two-way mode)
`tts_config`	object	No	Multi-language TTS settings (applies to both broadcast mode and two-way mode)
`broadcast_phase`	string	No	Initial broadcast phase: `standby`, `live` (default)
`standby_message`	string	No	The message viewers see during the standby phase (default: "Getting ready, please wait...")
`name`	string	No	Initial default recording name (max 60 chars; the system may still override it; if not provided, auto-generated such as `Transcription #1`)
`summary_language`	string	No	Summary output language (defaults to the recognition language when unspecified; in broadcast mode, read automatically from the channel settings)
`summary_mode`	string	No	Summary mode enum: `builtin` (default) / `custom`. Inferred as `builtin` when omitted.
`summary_prompt`	string	No	Required in custom mode; supplemental instructions in builtin mode. <= 2000 characters.
`summary_prompt_slug`	string	No	Required in custom mode; forbidden in builtin mode. Your own identifier (<= 64 characters, Unicode, no control characters; passed through and stored in the backend record for historical lookup).
`summary_plain_text`	boolean	No	Request plain-text summary output (default `false`; when enabled, the backend performs Markdown post-processing).

Recording Type Descriptions

type	Description	Use Cases
`transcribe`	Speech-to-text	Meeting minutes, interview notes
`conversation`	Conversation record	Two-way communication, customer service conversations
`record`	Plain recording	Voice memos, quick notes
`broadcast`	Broadcast/live	Lectures, talks, live content

Request Example (Basic)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "realtime_translation": false,
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting",
    "options": {
      "speaking_speed": "normal",
      "segmentation_mode": "auto",
      "profanity_handling": "mask"
    }
  }
}

Request Example (Initial Default Name)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting",
    "name": "Product Planning Meeting"
  }
}

Recording Name Rules

Scenario	Name	name_source	System Override?
`start` with a `name` parameter	Initial default name	`default`	Yes
`start` without a `name`	Auto-generated (such as `Transcription #1`, `Broadcast #3`)	`default`	Yes
Set via `set_name`	The name explicitly set by the user	`user`	No
Auto-generated by the system after the session ends	A summary name generated from the transcript content	`llm`	—

Note: The name in start is the initial default name; the system may still override it when the session ends. If you need a fixed name, use set_name.

Default name format (fixed English):

Recording Type	Default Name Format
`transcribe`	`Transcription #N`
`conversation`	`Conversation #N`
`record`	`Recording #N`
`broadcast`	`Broadcast #N`

N is the sequential number for that user's recordings of the same type. Name priority: user > llm > default. Once the user sets a name, the system will not override it when the session ends.

Request Example (With TTS)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "realtime_translation": true,
    "type": "transcribe",
    "tts_enabled": true,
    "tts_language": "en-US",
    "tts_voice": "en-US-JennyNeural",
    "tts_mode": "sync"
  }
}

Request Example (Two-Way Mode - Auto-Detect)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "conversation",
    "transcription_languages": ["zh-TW", "en-US"],
    "active_language": "zh-TW",
    "audio_format": "pcm",
    "realtime_translation": true,
    "speakers": [
      { "id": 1, "language": "zh-TW" },
      { "id": 2, "language": "en-US" }
    ],
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
    }
  }
}

Request Example (Two-Way Mode - Manual Mode)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "conversation",
    "transcription_languages": ["zh-TW", "en-US"],
    "conversation_mode": "manual",
    "audio_format": "pcm",
    "realtime_translation": true,
    "speakers": [
      { "id": 1, "language": "zh-TW" },
      { "id": 2, "language": "en-US" }
    ],
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
    }
  }
}

Request Example (Custom Summary Prompt - custom mode)

In mode=custom, your summary_prompt content completely replaces the built-in template rules, and the backend already adds prompt injection protection. The summary_prompt_slug is metadata for your own identification (stored in the backend record) and does not enter the prompt content.
If you want to keep the built-in template and add your own supplemental instructions afterward, use summary_mode=builtin + summary_template=<slug> + summary_prompt=<supplemental instructions> instead (in builtin mode, summary_prompt is treated as supplemental and appended after the built-in template).

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_language": "zh-TW",
    "summary_mode": "custom",
    "summary_prompt": "You are a meeting-minutes assistant. List every amount and committed date discussed in bullet points, and note the responsible person for each.",
    "summary_prompt_slug": "client_x_finance_v3",
    "summary_plain_text": false
  }
}

Important — How to Retrieve the Summary Result: In WebSocket mode, summaries are non-streaming by design; final_content is not pushed back via a WebSocket event (the summary_done event only signals completion and does not contain the content). The client must retrieve it afterward over HTTP:

After receiving the summary_done event, call GET /api/v1/sse/history/transcribe/{taskId} to retrieve the summary (the init_summary event carries a top-level summary plain string + summary_mode / summary_template / summary_plain_text / summary_prompt_snapshot + the two content-filter fallback audit fields summary_fallback_level / summary_dropped_segments added in v1.5.5).
Or query the summary_mode / summary_template / summary_prompt_slug columns of the recordings table via the REST API.

v1.5.5 Content-Filter Automatic Downgrade: If your prompt or transcript content triggers the LLM service's content filter, the system automatically downgrades (standard mode → neutral mode → segment-omission mode). The summary_fallback_level field of the summary_done event (value 2 or 3; omitted when standard mode succeeds directly) tells the client which path was actually taken, so the frontend can display hints such as "neutral mode in use" / "N segments omitted." See reference/websocket/events.md – summary_done and the V1.5.5 changelog.

Two-Way Mode Special Rules:

Item	Description
`transcription_languages`	Must contain exactly 2 languages, and they cannot be the same.
`translation_languages`	Not required (automatically derived as the non-active language).
`active_language`	Optional, defaults to `transcription_languages[0]`.
`recognition_mode`	Forced to `single` (ignores `speaker_diarization`).
`tts_enabled`	Defaults to `true`; set to `false` to return text translations only.
`tts_config`	Optional; sets the TTS voice for each of the two languages; leave empty to use the default voices automatically.
`summary_template`	Optional; when provided, a summary is automatically generated after stopping.
`speakers`	Required in two-way mode; specifies each user's language (exactly 2 users).
`conversation_mode`	Optional; `auto` (auto-detect, default) or `manual` (push-to-talk).

speakers Field Descriptions:

Field	Type	Required	Description
`id`	int	Yes	User number (1 or 2)
`language`	string	Yes	The user's language code (must be in `transcription_languages`)

conversation_mode Descriptions:

Mode	Description
`auto` (default)	The system automatically detects the spoken language and segments sentences automatically.
`manual`	The user controls speaking periods via `start_speaking` / `stop_speaking`, during which the audio is merged into a single sentence.

Broadcast Mode Description (type: "broadcast")

In broadcast mode, the language settings are automatically obtained from the broadcast channel settings and do not need to be sent in the WebSocket message.

Required parameters:

Parameter	Type	Description
`type`	string	Must be `"broadcast"`
`broadcast_token`	string	Broadcast token (obtained after creating the broadcast via the REST API)
`audio_format`	string	Audio format (`pcm` or `webm`)

Optional parameters (override the broadcast channel settings):

Parameter	Type	Description
`tts_config`	object	Multi-language TTS settings (overrides the settings from creation time)
`summary_template`	string	Summary template slug (overrides the settings from creation time; if not provided, the broadcast channel default is used)

Auto-configured parameters (can be omitted):

transcription_languages: read automatically from the broadcast settings
translation_languages: read automatically from the broadcast settings
realtime_translation: enabled by default in broadcast mode
summary_template: read automatically from the broadcast settings (the value passed via WebSocket takes precedence)
summary_language: read automatically from the broadcast settings (the value passed via WebSocket takes precedence)

Broadcast Phase Descriptions:

broadcast_phase	Description	Behavior
`live` (default)	Live phase	STT/translation results are broadcast to viewers and written to the transcript.
`standby`	Standby phase	STT/translation results go only to the host; viewers see the standby_message.

Standby phase purpose: Lets the host warm up STT/translation before going live, confirm that equipment is working, and then switch to the live phase.

Broadcast Mode Request Example:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "a3f9",
    "audio_format": "pcm"
  }
}

Broadcast Mode Request Example (Standby Phase + Override Summary Template):

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "a3f9",
    "audio_format": "pcm",
    "broadcast_phase": "standby",
    "standby_message": "The talk is about to begin, please wait...",
    "summary_template": "lecture"
  }
}

Summary template priority: The value passed in the WebSocket start > the default set when the broadcast channel was created. If neither is set, no summary is automatically generated.

Broadcast Mode TTS Settings (tts_config):

Use the tts_config parameter to specify which translation languages should produce TTS audio for viewers.

tts_config Field	Type	Description
voice	string	TTS voice name
speaking_rate	number	Speaking rate (0.5–2.0, default 1.0)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "a3f9",
    "audio_format": "pcm",
    "tts_config": {
      "en-US": {
        "voice": "en-US-JennyNeural",
        "speaking_rate": 1.0
      },
      "ja-JP": {
        "voice": "ja-JP-NanamiNeural",
        "speaking_rate": 1.0
      }
    }
  }
}

Note:
TTS languages must be valid languages in translation_languages; invalid languages are automatically ignored.
The host (WebSocket) does not receive TTS audio; only SSE viewers receive the tts_ready event.
TTS is sent only during the live phase; nothing is sent during the standby phase.

TTS Playback Mode Descriptions

Mode	Description	Behavior
`sync`	Synchronous mode (default)	Automatically plays the latest `is_final=true` translated sentence; if the previous sentence is still playing, it enters the queue and waits.
`async`	Asynchronous mode (manual control)	The user can choose any translated sentence for TTS, controlled with the `tts_play` command.

Success Response

After a successful start, a session_started event is returned containing complete session initialization info.

General recordings (transcribe / conversation / record):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "transcribe",
    "recognition_mode": "single",
    "message": "Speech recognition started"
  }
}

Broadcast mode (broadcast):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "broadcast",
    "recognition_mode": "multi_speaker",
    "phase": "standby",
    "viewer_count": 0,
    "queue_count": 0,
    "peak_viewers": 0,
    "total_viewers": 0,
    "message": "Speech recognition started"
  }
}

Field	Type	Description
`session_id`	string	Session ID
`recording_id`	string	Recording ID (can be used for subsequent API queries)
`recording_type`	string	Recording type: `transcribe`, `conversation`, `record`, `broadcast`
`recognition_mode`	string	Recognition mode: `single`, `multi_speaker`
`phase`	string	Broadcast phase: `standby` or `live` (broadcast mode only)
`viewer_count`	int	Current number of online viewers (broadcast mode only)
`queue_count`	int	Number of viewers waiting in the queue (broadcast mode only)
`peak_viewers`	int	Peak number of viewers for this broadcast (broadcast mode only)
`total_viewers`	int	Total cumulative number of viewers who have connected (broadcast mode only)
`message`	string	Status description message

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`missing_transcription_languages`	400	No language parameter provided	Make sure the request includes `transcription_languages`
`invalid_transcription_language`	400	Invalid language code	Confirm the language code format is correct (such as `zh-TW`)
`too_many_languages`	400	Number of languages exceeds the limit	At most 2 languages can be specified
`invalid_recording_type`	400	Invalid recording type	Use a valid type value
`invalid_summary_template`	400	Invalid summary template	Confirm the template identifier is correct
`stt_init_failed`	503	Service initialization failed	Retry later
`auth_budget_exceeded`	402	Monthly budget exceeded	Wait for the next month's budget reset or adjust the budget
`tts_init_failed`	503	TTS service initialization failed	Retry later
`tts_invalid_language`	400	TTS language not in the translation languages	Confirm `tts_language` is in `translation_languages`
`broadcast_token_required`	400	Broadcast mode requires a token	The `broadcast` type must provide a `broadcast_token`
`broadcast_token_invalid`	400	Invalid broadcast token	Confirm the token is correct and not expired
`broadcast_not_ready`	503	Broadcast service not yet started	Retry later
`summary_invalid_mode`	400	`summary_mode` is not `builtin` / `custom`	Use a valid mode
`summary_mode_field_mismatch`	400	The mode and field combination does not match (a required field is missing / a forbidden field was included)	Adjust fields per the mode rules
`summary_prompt_too_long`	400	`summary_prompt` exceeds 2000 characters	Shorten the custom prompt
`summary_prompt_slug_too_long`	400	`summary_prompt_slug` exceeds 64 characters	Shorten the identifier
`summary_prompt_slug_invalid`	400	`summary_prompt_slug` contains control characters (`\n` / `\r` / `\t` / `\0`, etc.)	Remove the control characters

Voice Translation - config (Set Terminology / Correction Rules)

Description

Before or during recording, pass in terminology, fuzzy-word correction rules, and translation dictionary settings. These settings improve STT accuracy, fix homophone errors, and ensure translation consistency.

Auto-generated correction rules: When terminology is passed in, the system automatically generates fuzzy-word correction rules for each term (homophones, near-homophones, Traditional/Simplified variants). The frontend does not need to define fuzzy_correction manually, greatly simplifying the setup process.

Use Cases

Pass in professional terminology (Phrase List) before recording starts
Set fuzzy-word correction rules (homophone correction) - optional, the system generates them automatically
Set a translation dictionary (ensure consistent terminology translation)

Timing

Setting Type	Recommended Timing	Update During Recording
Terminology	Before or during `start`	Supported (takes effect on the next turn)
Fuzzy-word correction	Before or during `start`	Supported
Translation dictionary	Before or during `start`	Supported

Note: When you update terminology during recording, the new terms automatically take effect at the next recognition turn boundary, with no need to reconnect. The response includes a terminology_effective: "next_turn" field as a hint.

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `config`
`terminology`	object	No	Terminology settings
`fuzzy_correction`	object	No	Fuzzy-word correction rules
`translation_dict`	object	No	Translation dictionary

Note: At least one setting item must be provided.

Terminology Format (terminology)

Keyed by language code, with an array of terms as the value:

{
  "zh-TW": [
    { "term": "語者分離", "boost": 1.5 },
    { "term": "WebSocket", "boost": 2.0 }
  ],
  "en-US": [
    { "term": "diarization", "boost": 1.5 }
  ]
}

Field	Type	Required	Description
`term`	string	Yes	The term (max 100 characters)
`boost`	number	No	Weight (default 1.0, range 0.5–5.0)

Limit: Up to 500 terms per language.

Fuzzy-Word Correction Format (fuzzy_correction)

Note: This field usually does not need to be set manually. The system automatically generates correction rules based on terminology. Use it only when you need custom special rules.

Keyed by language code, with an array of correction rules as the value:

{
  "zh-TW": [
    { "correct": "語者分離", "incorrect": ["語這分離", "語者分力"] }
  ]
}

Field	Type	Required	Description
`correct`	string	Yes	The correct word
`incorrect`	string	Yes	List of incorrect variants

Auto-Generated Correction Rule Description

When terminology is passed in, the system automatically generates fuzzy-word correction rules for each term:

Generation Type	Description	Example
Homophone	Alternative characters with the same pinyin	語者 → 語這, 語折
Near-homophone	Alternative characters with similar tones	媽 → 麻, 馬
Traditional/Simplified	Traditional/Simplified conversion	製程 → 制程

Mixed Chinese-English term support: For mixed terms like "CVD製程," the system generates variants only for the Chinese portion and leaves the English unchanged.

Original Term	Auto-Generated Variants
CVD製程	CVD制程, CVD之程, CVD製城
wafer良率	wafer量率, wafer涼率
5nm製程	5nm制程, 5nm製成

Translation Dictionary Format (translation_dict)

Use an array of entries directly:

[
  {
    "source": "語者分離",
    "translations": {
      "en-US": "Speaker Diarization",
      "ja-JP": "話者分離"
    }
  }
]

Field	Type	Required	Description
`source`	string	Yes	The source word (in the STT language)
`translations`	object	Yes	Translation mapping `{ "language code": "translation" }`

Limit: We recommend no more than 50 entries (to avoid degrading processing performance).

Request Example (Recommended: Terminology Only)

{
  "type": "voice-translation",
  "data": {
    "action": "config",
    "terminology": {
      "zh-TW": [
        { "term": "語者分離", "boost": 1.5 },
        { "term": "CVD製程", "boost": 1.5 },
        { "term": "wafer良率", "boost": 1.5 }
      ]
    }
  }
}

Request Example (Full Settings, Including Manual Correction Rules)

{
  "type": "voice-translation",
  "data": {
    "action": "config",
    "terminology": {
      "zh-TW": [
        { "term": "語者分離", "boost": 1.5 },
        { "term": "即時轉錄", "boost": 1.5 }
      ]
    },
    "fuzzy_correction": {
      "zh-TW": [
        { "correct": "語者分離", "incorrect": ["語這分離", "語者分力"] }
      ]
    },
    "translation_dict": [
      { "source": "語者分離", "translations": { "en-US": "Speaker Diarization" } }
    ]
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "config_updated",
    "updated": ["terminology", "fuzzy_correction", "translation_dict"],
    "message": "Settings updated"
  }
}

Field	Type	Description
`updated`	string	The setting types that were updated
`message`	string	Status message

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`config_empty`	400	No settings provided	Provide at least one setting item
`config_term_too_long`	400	Term exceeds 100 characters	Shorten the term length
`config_too_many_entries`	400	More than 500 terms	Reduce the number of terms
`config_too_many_dict_entries`	400	Translation dictionary exceeds 50 entries	Reduce the dictionary entries

Voice Translation - audio (Send Audio)

Description

Sends audio data to the server for speech recognition. The audio must be Base64-encoded before sending.

Use Cases

Continuously sending microphone audio
Sending recorded audio segments

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `audio`
`payload`	string	Yes	Base64-encoded audio data

Audio Format Requirements

PCM format (default):

Item	Specification
Format	PCM (raw audio)
Sample rate	16000 Hz
Bit depth	16-bit
Channels	Mono
Byte order	Little-endian
Transport encoding	Base64

WebM/Opus format:

Item	Specification
Format	WebM container + Opus codec
Sample rate	Any (the server converts automatically)
Channels	Mono or Stereo (the server converts automatically)
Transport encoding	Base64

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "audio",
    "payload": "Base64-encoded PCM audio data"
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`session_not_started`	400	Speech recognition has not started	Call the `start` action first
`audio_invalid_format`	400	Invalid audio data format	Confirm the Base64 encoding is correct
`audio_format_unsupported`	400	Unsupported audio format	Use the `pcm` or `webm` format
`audio_decode_failed`	400	Audio decoding failed	Confirm the audio format is correct

Voice Translation - pause (Pause Translation)

Description

Pauses speech recognition processing. Audio received during the pause is buffered and continues to be processed after resuming.

Use Cases

The user steps away temporarily
You need to pause recording

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "pause"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speech recognition paused"
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`session_not_started`	400	Speech recognition has not started	Call `start` first
`session_already_paused`	400	Already paused	You can ignore this error

Voice Translation - resume (Resume Translation)

Description

Resumes paused speech recognition processing.

Use Cases

The user returns to continue
You need to continue recording

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "resume"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speech recognition resumed"
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`session_not_started`	400	Speech recognition has not started	Call `start` first
`session_not_paused`	400	Not paused	You can ignore this error

Voice Translation - stop (Stop Translation)

Description

Stops speech recognition and ends the session. The system automatically uploads the audio file and transcript, and generates a summary (if configured).

Use Cases

The meeting ends
Recording is complete

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "stop"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speech recognition stopped"
  }
}

Task Complete Event

This event is sent after the audio file and transcript have been uploaded:

{
  "type": "voice-translation",
  "data": {
    "action": "task_complete",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "message": "Task processing complete"
  }
}

Field	Type	Description
`task_id`	string	Recording UUID, can be used for subsequent API queries

Voice Translation - retranslate (Retranslate)

Description

Retranslates a specified sentence, useful when the original text has been corrected and the translation needs to be updated.

Use Cases

The user edits the original text and the translation needs updating
Correcting recognition errors

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `retranslate`
`sid`	int	Yes	The sentence number to retranslate
`translation_languages`	string	Yes	Array of translation language codes
`text`	string	Yes	The original text to translate (the user-corrected text)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "retranslate",
    "sid": 1,
    "translation_languages": ["en-US"],
    "text": "The user-corrected original text"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "translations": {
      "en-US": {
        "sid": 1,
        "text": "The new translation result",
        "is_final": true,
        "is_retranslation": true
      }
    }
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`retranslate_sid_not_found`	400	The specified SID was not found	Confirm the SID exists
`retranslate_session_not_active`	400	The session is not started or has ended	Confirm the session state
`retranslate_no_target_lang`	400	No target language provided	Provide `translation_languages`
`retranslate_no_text`	400	No text to translate provided	Provide the `text` parameter
`retranslate_llm_failed`	500	Translation service failed	Retry later

Voice Translation - switch_language (Switch Language)

Description

Switches the language while real-time translation is in progress. The behavior varies by recording type:

General mode (transcribe, etc.): switches the translation target language and automatically batch-retranslates all already-translated sentences.
Two-way mode (conversation): switches the STT source language (spoken language); the translation target automatically switches to the other language.

Use Cases

Switching the translation target language
A change in language needs mid-meeting

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `switch_language`
`translation_languages`	string	Conditional	Array of translation language codes (required in general mode)
`transcription_languages`	string	Conditional	The target language to switch to (two-way mode; if omitted, automatically toggles to the other language)

Request Example (General Mode)

{
  "type": "voice-translation",
  "data": {
    "action": "switch_language",
    "translation_languages": ["ja-JP"]
  }
}

Request Example (Two-Way Mode)

Specify the target to switch to:

{
  "type": "voice-translation",
  "data": {
    "action": "switch_language",
    "transcription_languages": ["en-US"]
  }
}

Automatic toggle (no parameters):

{
  "type": "voice-translation",
  "data": {
    "action": "switch_language"
  }
}

Two-Way Mode Special Behavior:

Two-way mode uses automatic language detection and usually does not require manually switching the language.
switch_language only updates the internal preference state.
After a successful switch, a language_switched event is returned (not a language_switch_start/done sequence).
Switching to the same language returns a conversation_same_language warning.

Response Sequence (General Mode)

After switching the language, you receive the following events in order:

language_switch_start: notifies that the switch has begun

{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_start",
    "translation_language": "ja-JP",
    "total_segments": 15,
    "message": "Starting language switch and retranslation"
  }
}

batch_retranslation (multiple): returns retranslation results sentence by sentence

{
  "type": "voice-translation",
  "data": {
    "action": "batch_retranslation",
    "sid": 3,
    "translations": {
      "ja-JP": {
        "sid": 3,
        "text": "今日はプロジェクトの進捗について話し合いましょう",
        "is_final": true,
        "is_retranslation": true
      }
    }
  }
}

language_switch_done: notifies that the switch is complete

{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_done",
    "translation_language": "ja-JP",
    "success_count": 15,
    "failed_count": 0,
    "message": "Language switch complete"
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`switch_language_no_target`	400	No target language provided	Provide `translation_languages`
`switch_language_in_progress`	400	The previous switch is not yet complete	Wait for the switch to complete
`switch_language_same_target`	400	The target language is the same as the current one	You can ignore this error
`conversation_requires_two_languages`	400	Two-way mode requires exactly two languages	Confirm transcription_languages has 2
`conversation_languages_identical`	400	The two two-way languages cannot be the same	Provide two different languages
`conversation_invalid_language`	400	Invalid two-way language	Confirm the language is in transcription_languages
`conversation_same_language`	400	Already the current language	You can ignore this warning

Voice Translation - set_name (Set Recording Name)

Description

Sets the name while recording is in progress. After it is set, this name is used when the recording ends and will not be auto-generated.

Tip: You can also set an initial default name via the name parameter at start, but that name may still be overridden by the system when the session ends. If you need a fixed name, use set_name.

Use Cases

Customizing the recording title after recording starts
Overriding an auto-generated name or a previously set name

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `set_name`
`name`	string	Yes	Recording name (max 60 chars)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "set_name",
    "name": "Product Planning Meeting"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Recording name updated"
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`name_too_long`	400	Recording name exceeds the limit	Shorten the name
`session_not_started`	400	Speech recognition has not started	Call `start` first

Voice Translation - rename_speaker (Globally Rename a Speaker)

Description

In multi-speaker diarization mode (multi_speaker), globally renames a speaker. All sentences using that speaker ID are updated in sync.

Use Cases

Changing a system-assigned speaker ID (such as Guest-1) to a meaningful name (such as Manager Wang)
Naming a newly recognized speaker during a meeting

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `rename_speaker`
`speaker_id`	string	Yes	The original speaker ID (such as `Guest-1`); the current display label is also accepted for consecutive renaming; max 100 characters
`new_label`	string	Yes	The new display label; max 100 characters, must not contain control characters (`\x00-\x1F`, `\x7F`) or line breaks

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "rename_speaker",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_renamed",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang",
    "affected_sids": [1, 3, 5, 8]
  }
}

Field	Type	Description
`speaker_id`	string	The resolved original speaker ID (even if the input was a display label, the event returns the original ID)
`new_label`	string	The new display label
`affected_sids`	int	The list of affected sentence numbers

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`speaker_not_found`	400	The specified speaker was not found	Confirm the `speaker_id` or display label exists
`speaker_name_empty`	400	`new_label` is empty	Provide a valid label
`speaker_name_duplicate`	422	The display label is already in use	Use a different label, or first change the conflicting speaker
`session_not_started`	400	Speech recognition has not started	Call `start` first

Voice Translation - reassign_speaker (Change the Speaker of a Single Sentence)

Description

Changes the speaker identity (OriginalSpeakerID) of a specific sentence, assigning the sentence to an existing speaker.

Use Cases

Correcting a speaker identity that the system recognized incorrectly
Reassigning a sentence to another known speaker

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `reassign_speaker`
`sid`	int	Yes	The sentence number to change
`target_speaker_id`	string	Yes	The target speaker's original ID (taken from `init_sentence.speaker_id`; reassign does not accept display labels)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "reassign_speaker",
    "sid": 5,
    "target_speaker_id": "Guest-2"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_reassigned",
    "sid": 5,
    "old_speaker_id": "Guest-1",
    "new_speaker_id": "Guest-2",
    "new_speaker_label": "Lee Hsiao-hua"
  }
}

Field	Type	Description
`sid`	int	The changed sentence number
`old_speaker_id`	string	The original speaker ID
`new_speaker_id`	string	The new original speaker ID
`new_speaker_label`	string	The new speaker display label (after applying `speaker_aliases`; equals `new_speaker_id` when no alias exists)

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`speaker_sid_not_found`	400	The specified sentence was not found	Confirm the SID exists
`speaker_not_found`	400	The target speaker does not exist	Use an existing speaker ID
`speaker_name_empty`	400	The target speaker ID cannot be empty	Provide a valid speaker ID
`session_not_started`	400	Speech recognition has not started	Call `start` first
`invalid_parameter`	400	Creating a new speaker is not supported	Use an existing speaker ID

Voice Translation - merge_speakers (Merge Speakers)

Description

Merges all sentences of one speaker into another speaker. After the merge, future recognition results for that speaker are also automatically converted to the target speaker.

Use Cases

The speech recognition engine sometimes misidentifies the same person's voice as multiple speakers (for example, Guest-1 and Guest-2 are actually the same person)
Use this feature to merge all of Guest-2's sentences into Guest-1
After the merge, future Guest-2 recognition results are automatically displayed as Guest-1

Difference from `reassign_speaker`

Feature	Scope	Future Impact
`reassign_speaker`	A single sentence (1 SID)	None
`merge_speakers`	All sentences of that speaker	Future appearances of the source are also automatically converted to the target

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `merge_speakers`
`source_speaker_id`	string	Yes	The speaker ID to be merged (such as `Guest-2`)
`target_speaker_id`	string	Yes	The merge target speaker ID (such as `Guest-1`)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "merge_speakers",
    "source_speaker_id": "Guest-2",
    "target_speaker_id": "Guest-1"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "speakers_merged",
    "source_speaker_id": "Guest-2",
    "target_speaker_id": "Guest-1",
    "target_speaker_label": "Manager Wang",
    "affected_sids": [3, 5, 7]
  }
}

Field	Type	Description
`source_speaker_id`	string	The original ID of the merged speaker
`target_speaker_id`	string	The original ID of the merge target
`target_speaker_label`	string	The target speaker display label (after applying `speaker_aliases`; equals the original ID when no alias exists)
`affected_sids`	number	The list of affected sentence IDs

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`speaker_not_found`	400	The speaker does not exist	Confirm the speaker ID exists
`merge_speakers_same_id`	400	The source and target speaker are the same	Use different speaker IDs
`speaker_name_empty`	400	The speaker ID cannot be empty	Provide a valid speaker ID
`session_not_started`	400	Speech recognition has not started	Call `start` first

Voice Translation - tts_play (Play TTS)

Description

In async mode, manually plays the TTS audio for a specified sentence.

Use Cases

The user selects a specific sentence for TTS playback
Playing multiple consecutive sentences

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `tts_play`
`sid`	int	Yes	The starting sentence ID
`length`	int	No	Number of sentences to play (default 1, max 20)

Note: The maximum value of length is controlled by the backend environment variable TTS_SSE_MAX_LENGTH (default 20).

Request Example (Single Sentence)

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5
  }
}

Request Example (Multiple Sentences)

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5,
    "length": 3
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`tts_not_enabled`	400	TTS not enabled	Confirm TTS was enabled at start
`tts_sid_not_found`	400	The specified sentence was not found	Confirm the SID exists
`tts_translation_not_found`	400	The sentence has no translation in the specified language	Confirm the translation exists

Voice Translation - tts_stop (Stop TTS)

Description

Stops the TTS audio that is currently playing.

Use Cases

The user manually stops TTS playback
Stopping the current playback before switching to another sentence

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "tts_stop"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "TTS playback stopped"
  }
}

Voice Translation - tts_mode (Switch TTS Mode)

Description

Switches the TTS playback mode (synchronous/asynchronous) while recording is in progress.

Use Cases

Switching from automatic playback to manual control
Switching from manual control to automatic playback

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `tts_mode`
`tts_mode`	string	Yes	Mode: `sync` (synchronous) or `async` (asynchronous)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode_changed",
    "tts_mode": "async"
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`tts_not_enabled`	400	TTS not enabled	Confirm TTS was enabled at start
`tts_invalid_mode`	400	Invalid mode	Use `sync` or `async`

Voice Translation - set_tts (Two-Way TTS Settings)

Description

While a two-way mode (conversation) recording is in progress, dynamically toggles TTS on/off or updates the TTS voice settings. Available only in two-way mode.

Use Cases

Turning the TTS audio response off/on mid-conversation in two-way mode
Changing the TTS voice or speaking rate for a specific language

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `set_tts`
`tts_enabled`	boolean	No	Whether to enable two-way TTS (`true` / `false`)
`tts_config`	object	No	TTS settings per language; the key is the language code, and the value is `{voice, speaking_rate}`

Note: At least one of tts_enabled and tts_config must be provided. tts_config updates only the settings for the specified languages; unspecified languages remain unchanged.

Request Example (Disable TTS)

{
  "type": "voice-translation",
  "data": {
    "action": "set_tts",
    "tts_enabled": false
  }
}

Request Example (Update Voice Settings)

{
  "type": "voice-translation",
  "data": {
    "action": "set_tts",
    "tts_enabled": true,
    "tts_config": {
      "en-US": {
        "voice": "en-US-GuyNeural",
        "speaking_rate": 1.2
      }
    }
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "tts_updated",
    "tts_enabled": true,
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
    }
  }
}

Field	Type	Description
`tts_enabled`	boolean	The current TTS enabled state
`tts_config`	object	The current complete TTS settings (all languages)

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`invalid_action`	400	Not two-way mode	This action is available only in `conversation` mode
`session_not_started`	400	Speech recognition has not started	Call `start` first

Voice Translation - start_speaking (Start Speaking / Manual Mode)

Description

In two-way manual mode (conversation_mode: "manual"), notifies the system that the user has started speaking. From this point on, audio is sent to STT for recognition, and all recognition results accumulate into a single sentence (no automatic segmentation).

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `start_speaking`
`speaker`	int	Yes	User number (1 or 2)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "start_speaking",
    "speaker": 1
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speaking started"
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`invalid_action`	400	Not two-way mode	Use only under the conversation type
`conversation_not_manual_mode`	400	Not manual mode	Use only in manual mode
`conversation_speaking`	400	Already speaking	Call `stop_speaking` first
`conversation_invalid_speaker`	400	Invalid user number	Use 1 or 2

Voice Translation - stop_speaking (Stop Speaking / Manual Mode)

Description

In two-way manual mode, notifies the system that the user has stopped speaking. The system merges the recognition results accumulated during the period into a single complete sentence and performs translation and TTS synthesis.

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `stop_speaking`

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "stop_speaking"
  }
}

Success Response

After stopping speaking, the system sends a complete result event (containing origin and translations):

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "origin": {
      "sid": 1,
      "language": "zh-TW",
      "text": "The complete sentence merged from all recognition during this period",
      "is_final": true,
      "speaker_id": "Speaker-1",
      "start_time": "00:05"
    },
    "translations": {
      "en-US": {
        "sid": 1,
        "text": "The complete merged sentence from this speaking period",
        "is_final": true
      }
    }
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`invalid_action`	400	Not two-way mode	Use only under the conversation type
`conversation_not_speaking`	400	Not in a speaking state	Call `start_speaking` first

Voice Translation - switch_conversation_mode (Switch Conversation Mode)

Description

While two-way mode is in progress, switches between auto-detect mode (auto) and manual mode (manual). If the user is currently speaking when the switch happens, speaking is ended automatically.

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `switch_conversation_mode`
`conversation_mode`	string	Yes	The target mode: `auto` or `manual`

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "switch_conversation_mode",
    "conversation_mode": "manual"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "conversation_mode_changed",
    "conversation_mode": "manual"
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`invalid_action`	400	Not two-way mode	Use only under the conversation type
`conversation_invalid_mode`	400	Invalid conversation mode	Use `auto` or `manual`

Voice Translation - set_speaker_language (Set User Language)

Description

While two-way mode is in progress, changes a specified user's language in real time. The system rebuilds the STT connection to accommodate the new language, and the translation target is also updated automatically. The transcript content before the change keeps its original language, and the timestamp continues to count without resetting.

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `set_speaker_language`
`speaker`	int	Yes	User number (1 or 2)
`language`	string	Yes	The new language code (such as `ja-JP`)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "set_speaker_language",
    "speaker": 1,
    "language": "ja-JP"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_language_changed",
    "speaker_language_map": {
      "1": "ja-JP",
      "2": "en-US"
    }
  }
}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`invalid_action`	400	Not two-way mode	Use only under the conversation type
`conversation_invalid_speaker`	400	Invalid user number	Use 1 or 2
`conversation_invalid_language`	400	Invalid language code	Use a valid BCP 47 language code
`conversation_same_language`	400	Same as the current language	You can ignore this warning
`conversation_language_same_as_peer`	400	The new language is the same as the other user	The two users cannot have the same language
`conversation_speaking`	400	Currently speaking, cannot change language	End speaking before changing
`conversation_language_change_failed`	500	Language change failed (STT rebuild failed)	Retry later

Voice Translation - broadcast_go_live (Switch to the Live Phase)

Description

Switches from the broadcast standby phase (standby) to the live phase (live). After switching, STT/translation results begin broadcasting to viewers and start being written to the transcript.

Use Cases

The host confirms the equipment is working and starts the official broadcast
Switching from the warm-up phase to live streaming

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "broadcast_go_live"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "broadcast_phase_changed",
    "phase": "live",
    "message": "Broadcast started"
  }
}

Field	Type	Description
`phase`	string	The new phase (`live`)
`message`	string	Status description message

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`broadcast_not_enabled`	400	Not broadcast mode	Confirm `type: "broadcast"`
`session_not_started`	400	Speech recognition has not started	Call `start` first

Note: If already in the live phase, a status message "Broadcast is already in progress" is returned and is not treated as an error.

Voice Translation - broadcast_announcement (Send an Announcement)

Description

The host sends a custom message announcement to all viewers. Viewers receive an announcement event via SSE. The announcement message is automatically translated into all translation languages, and the SSE event viewers receive includes a translations field.

Use Cases

Notifying viewers that the meeting is about to end
Sending an important reminder or announcement
One-way communication with viewers

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `broadcast_announcement`
`message`	string	Yes	The announcement message content

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "broadcast_announcement",
    "message": "The meeting will end in 5 minutes"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Announcement sent"
  }
}

The SSE event viewers receive (with translations):

event: announcement
data: {"message":"The meeting will end in 5 minutes","translations":{"en-US":"The meeting will end in 5 minutes","ja-JP":"会議は5分後に終了します"}}

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`broadcast_not_enabled`	400	Not broadcast mode	Confirm `type: "broadcast"`
`invalid_parameter`	400	Message is empty	Provide a valid `message` parameter

Voice Translation - set_standby_message (Set the Standby Phase Message)

Description

During the broadcast standby phase (standby), dynamically sets the message shown to viewers. This allows the host to enter standby mode and then set the waiting message, rather than being required to provide it at start.

The message is automatically translated into all translation languages, and the SSE event viewers receive includes a translations field.

Use Cases

After entering standby mode, dynamically set the waiting message shown to viewers
Update the text on the standby screen before going live
Reduce the required fields before starting the broadcast

Request Parameters

Parameter	Type	Required	Description
`action`	string	Yes	Fixed value `set_standby_message`
`message`	string	Yes	The text displayed during the standby phase (translated for viewers of each language via the existing translation pipeline)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "set_standby_message",
    "message": "The talk is about to begin, please wait..."
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Standby phase text updated"
  }
}

Event Viewers Receive

After a successful setting, all viewers in the standby phase receive an updated standby event via SSE:

event: standby
data: {"message":"The talk is about to begin, please wait...","translations":{"en-US":"The presentation is about to begin, please wait...","ja-JP":"プレゼンテーションがまもなく始まります。お待ちください..."}}

Note: The translations field contains the translation results for all translation languages. The frontend can display the corresponding translation based on the language the viewer selects.

Error Responses

Error Code	HTTP Status	Description	Recommended Action
`broadcast_not_enabled`	400	Not broadcast mode	Confirm `type: "broadcast"`
`broadcast_not_in_standby`	400	Not in the standby phase	Can be used only during the standby phase

Note: This action can be used only during the standby phase (standby). If the broadcast has already entered the live phase (live), an error is returned.

Response Events

The following are all the response events you may receive over the WebSocket.

session_started - Session Started Successfully

After a start action succeeds, the server returns an event containing complete session initialization info. The frontend can distinguish the recording type via recording_type.

General recordings (transcribe / conversation / record):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "transcribe",
    "recognition_mode": "single",
    "message": "Speech recognition started"
  }
}

Broadcast mode (broadcast):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "broadcast",
    "recognition_mode": "multi_speaker",
    "phase": "standby",
    "viewer_count": 0,
    "queue_count": 0,
    "peak_viewers": 0,
    "total_viewers": 0,
    "message": "Speech recognition started"
  }
}

Field	Type	Description
`session_id`	string	Session ID
`recording_id`	string	Recording ID (can be used for subsequent API queries)
`recording_type`	string	Recording type: `transcribe`, `conversation`, `record`, `broadcast`
`recognition_mode`	string	Recognition mode: `single`, `multi_speaker`
`message`	string	Status description message
`phase`	string	Broadcast phase: `standby` or `live` (broadcast mode only)
`viewer_count`	int	Current number of online viewers (broadcast mode only)
`queue_count`	int	Number of viewers waiting in the queue (broadcast mode only)
`peak_viewers`	int	Peak number of viewers for this broadcast (broadcast mode only)
`total_viewers`	int	Total cumulative number of viewers who have connected (broadcast mode only)

result - Recognition/Translation Result

Speech recognition and translation results. A single result event may contain origin (recognition result) and/or translations (translation results).

origin (speech recognition result):

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "origin": {
      "sid": 1,
      "language": "zh-TW",
      "text": "Hello, nice to meet you",
      "is_final": true,
      "speaker_id": "0",
      "detected_language": "zh-TW",
      "start_time": "00:05"
    }
  }
}

Field	Type	Description
`sid`	int	Sentence number, starting from 1
`language`	string	Source language code. In two-way mode, this is the automatically detected language.
`text`	string	The recognized text
`is_final`	boolean	Whether it is the final result
`speaker_id`	string	Speaker ID
`detected_language`	string	The detected language. In two-way mode, this is determined automatically by the system.
`start_time`	string	Sentence start time (mm:ss); not sent during the broadcast standby phase; after going live, counts from `00:00`.

translations (translation results):

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "translations": {
      "en-US": {
        "sid": 1,
        "text": "Hello, nice to meet you",
        "is_final": true
      }
    }
  }
}

Translation results are keyed by language code, and each language's translation object contains:

Field	Type	Description
`sid`	int	Sentence number
`text`	string	The translated text
`is_final`	boolean	Whether it is the final result
`is_retranslation`	boolean	Whether it is a retranslation result (only during retranslate)

status - Generic Status Response

Used to confirm operations such as pause, resume, stop, and set_name.

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speech recognition paused"
  }
}

Field	Type	Description
`message`	string	Status description

task_complete - Task Processing Complete

Triggered after stop when the audio file and transcript have been uploaded. task_id can be used to query task details via the REST API afterward.

{
  "type": "voice-translation",
  "data": {
    "action": "task_complete",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "message": "Task processing complete"
  }
}

Field	Type	Description
`task_id`	string	Recording UUID, can be used for subsequent API queries
`message`	string	Status description

config_updated - Settings Update Complete

Triggered after the config action succeeds.

{
  "type": "voice-translation",
  "data": {
    "action": "config_updated",
    "updated": ["terminology", "fuzzy_correction", "translation_dict"],
    "message": "Settings updated"
  }
}

Field	Type	Description
`updated`	string	The setting types that were updated (`terminology`, `fuzzy_correction`, `translation_dict`)
`message`	string	Status message

tts_ready - TTS Audio Ready

TTS speech synthesis completion event. Contains the audio data and Word Boundary information (which can be used for a karaoke effect).

{
  "type": "voice-translation",
  "data": {
    "action": "tts_ready",
    "sid": 1,
    "language": "en-US",
    "transcript": "你好，很高興認識你",
    "text": "Hello, nice to meet you",
    "audio": "Base64EncodedMP3...",
    "format": "mp3",
    "duration_ms": 2500,
    "boundaries": [
      {"offset_ms": 0, "duration_ms": 350, "text_offset": 0, "word_length": 5, "text": "Hello"},
      {"offset_ms": 350, "duration_ms": 100, "text_offset": 5, "word_length": 1, "text": ","},
      {"offset_ms": 500, "duration_ms": 250, "text_offset": 7, "word_length": 4, "text": "nice"},
      {"offset_ms": 750, "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
      {"offset_ms": 950, "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
      {"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
    ]
  }
}

Field	Type	Description
`sid`	int	Sentence number
`language`	string	TTS language
`transcript`	string	The original transcript (STT recognition result)
`text`	string	The translated text (TTS synthesis source)
`audio`	string	Base64-encoded MP3 audio
`format`	string	Audio format (fixed value `mp3`)
`duration_ms`	int	Total audio duration (milliseconds)
`boundaries`	array	Array of Word Boundaries

Word Boundary Field Descriptions

Field	Type	Description
`offset_ms`	int	The word's start time in the audio (milliseconds)
`duration_ms`	int	The word's duration (milliseconds)
`text_offset`	int	Position in the original string (character index)
`word_length`	int	Word length (number of characters)
`text`	string	The word content

tts_error - TTS Synthesis Failed

TTS synthesis failure event.

{
  "type": "voice-translation",
  "data": {
    "action": "tts_error",
    "sid": 1,
    "language": "en-US",
    "error": "translation_not_found",
    "message": "No translation available for language: en-US"
  }
}

Field	Type	Description
`sid`	int	Sentence number
`language`	string	TTS language
`error`	string	Error code
`message`	string	Error message

TTS Error Codes

Error Code	Description
`translation_not_found`	No translation found for that language
`tts_synthesis_failed`	TTS synthesis failed
`tts_quota_exceeded`	TTS usage has reached the limit

viewer_count - Viewer Count Update

Broadcast mode only

During a broadcast, the system checks the viewer count every 3 seconds and pushes this event to the host if it changes.

{
  "type": "voice-translation",
  "data": {
    "action": "viewer_count",
    "viewer_count": 45,
    "queue_count": 8,
    "peak_viewers": 50,
    "total_viewers": 123
  }
}

Field	Type	Description
`viewer_count`	int	Current number of online viewers
`queue_count`	int	Number of viewers waiting in the queue
`peak_viewers`	int	Peak number of viewers for this broadcast
`total_viewers`	int	Total cumulative number of viewers who have connected

Note: This event is pushed only when the viewer count or queue count changes, to avoid unnecessary message traffic.

viewer_joined - Viewer Joined

Broadcast mode only

When a viewer joins the broadcast, the host receives this event.

{
  "type": "voice-translation",
  "data": {
    "action": "viewer_joined",
    "viewer_count": 5,
    "queue_count": 2
  }
}

Field	Type	Description
`viewer_count`	number	Current number of viewers
`queue_count`	number	Number waiting in the queue

viewer_left - Viewer Left

Broadcast mode only

When a viewer leaves the broadcast, the host receives this event.

{
  "type": "voice-translation",
  "data": {
    "action": "viewer_left",
    "viewer_count": 4,
    "queue_count": 1
  }
}

Field	Type	Description
`viewer_count`	number	Current number of viewers
`queue_count`	number	Number waiting in the queue

broadcast_phase_changed - Broadcast Phase Changed

Triggered when the broadcast phase switches from standby to live.

{
  "type": "voice-translation",
  "data": {
    "action": "broadcast_phase_changed",
    "phase": "live",
    "message": "Broadcast started"
  }
}

Field	Type	Description
`phase`	string	The new phase: `standby` or `live`
`message`	string	Status description message

speaker_renamed - Speaker Renamed

Speaker global rename completion event.

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_renamed",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang",
    "affected_sids": [1, 3, 5, 8]
  }
}

Field	Type	Description
`speaker_id`	string	The resolved original speaker ID (even if the input was a display label, the event returns the original ID)
`new_label`	string	The new display label
`affected_sids`	int	The list of affected sentence numbers

speaker_reassigned - Speaker Identity Changed

Single-sentence speaker identity change completion event.

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_reassigned",
    "sid": 5,
    "old_speaker_id": "Guest-1",
    "new_speaker_id": "Guest-2",
    "new_speaker_label": "Lee Hsiao-hua"
  }
}

Field	Type	Description
`sid`	int	The changed sentence number
`old_speaker_id`	string	The original speaker ID
`new_speaker_id`	string	The new original speaker ID
`new_speaker_label`	string	The new speaker display label (after applying `speaker_aliases`; equals `new_speaker_id` when no alias exists)

speakers_merged - Speakers Merged

Speaker merge completion event. After the merge, future recognition results for that source speaker are also automatically converted to the target speaker.

{
  "type": "voice-translation",
  "data": {
    "action": "speakers_merged",
    "source_speaker_id": "Guest-2",
    "target_speaker_id": "Guest-1",
    "target_speaker_label": "Manager Wang",
    "affected_sids": [3, 5, 7]
  }
}

Field	Type	Description
`source_speaker_id`	string	The original ID of the merged speaker
`target_speaker_id`	string	The original ID of the merge target
`target_speaker_label`	string	The target speaker display label (after applying `speaker_aliases`; equals the original ID when no alias exists)
`affected_sids`	number	The list of affected sentence IDs

language_switch_start - Language Switch Started

Language switch start event, sent after the switch_language action is triggered.

{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_start",
    "translation_language": "ja-JP",
    "total_segments": 15,
    "message": "Starting language switch and retranslation"
  }
}

Field	Type	Description
`translation_language`	string	The new translation target language
`total_segments`	int	The number of sentences that need retranslation
`message`	string	Status description

batch_retranslation - Batch Retranslation Result

Batch retranslation result event, sent sentence by sentence during the language switch process.

{
  "type": "voice-translation",
  "data": {
    "action": "batch_retranslation",
    "sid": 3,
    "translations": {
      "ja-JP": {
        "sid": 3,
        "text": "今日はプロジェクトの進捗について話し合いましょう",
        "is_final": true,
        "is_retranslation": true
      }
    }
  }
}

Field	Type	Description
`sid`	int	Sentence number
`translations`	object	Translation results (same format as result's translations)

language_switch_done - Language Switch Complete

Language switch completion event.

{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_done",
    "translation_language": "ja-JP",
    "success_count": 15,
    "failed_count": 0,
    "message": "Language switch complete"
  }
}

Field	Type	Description
`translation_language`	string	The translation target language
`success_count`	int	The number of successfully translated sentences
`failed_count`	int	The number of sentences that failed to translate
`message`	string	Status description

tts_mode_changed - TTS Mode Changed

TTS playback mode change event.

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode_changed",
    "tts_mode": "async"
  }
}

Field	Type	Description
`tts_mode`	string	The new mode: `sync` or `async`

language_switched - Two-Way Language Switch Complete

Two-way mode (conversation) language switch completion event. Triggered after switch_language successfully switches the STT source language in two-way mode.

{
  "type": "voice-translation",
  "data": {
    "action": "language_switched",
    "language": "en-US",
    "translation_language": "zh-TW",
    "message": "Language switched"
  }
}

Field	Type	Description
`language`	string	The new active language (STT source)
`translation_language`	string	The new translation target language
`message`	string	Status message

tts_updated - Two-Way TTS Settings Updated

Two-way mode (conversation) TTS settings update event. Triggered after set_tts successfully updates the TTS toggle or voice settings.

{
  "type": "voice-translation",
  "data": {
    "action": "tts_updated",
    "tts_enabled": true,
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
    }
  }
}

Field	Type	Description
`tts_enabled`	boolean	Whether TTS is enabled
`tts_config`	object	The TTS settings for each language (voice, speaking_rate)

conversation_mode_changed - Conversation Mode Changed

Two-way mode (conversation) conversation mode change event. Triggered after switch_conversation_mode successfully switches between auto/manual mode.

{
  "type": "voice-translation",
  "data": {
    "action": "conversation_mode_changed",
    "conversation_mode": "manual"
  }
}

Field	Type	Description
`conversation_mode`	string	The new conversation mode: `auto` or `manual`

speaker_language_changed - User Language Changed

Two-way mode (conversation) user language change event. Triggered after set_speaker_language successfully changes a user's language, including the complete language mapping after the change.

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_language_changed",
    "speaker_language_map": {
      "1": "ja-JP",
      "2": "en-US"
    }
  }
}

Field	Type	Description
`speaker_language_map`	object	The user language mapping after the change (keys are user number strings)

segment_uploaded - Audio Segment Upload Complete

Audio segment upload completion event. Triggered each time an audio segment is successfully uploaded to cloud storage; can be used to show upload progress on the frontend.

{
  "type": "voice-translation",
  "data": {
    "action": "segment_uploaded",
    "segment_index": 0,
    "duration_sec": 30.5
  }
}

Field	Type	Description
`segment_index`	number	Segment index (starting from 0)
`duration_sec`	number	The duration of this segment (seconds)

stt_event - STT Connection Status Event

STT connection status event. Triggered when the connection status of the speech recognition service changes; can be used to show the STT service status on the frontend.

{
  "type": "voice-translation",
  "data": {
    "action": "stt_event",
    "event": "connected",
    "message": "STT service connected"
  }
}

Field	Type	Description
`event`	string	Event type: `connected` / `disconnected` / `error`
`message`	string	Event description message

error - Error Event

Triggered when an operation fails or a system anomaly occurs.

{
  "type": "error",
  "data": {
    "error_code": "session_not_started",
    "severity": "error",
    "message": "Session not started",
    "context": "voice-translation",
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-01-15T10:30:45.123Z"
  }
}

Field	Type	Description
`error_code`	string	Error code (for programmatic handling)
`severity`	string	Severity: `fatal` / `error` / `warning`
`message`	string	Human-readable error message
`context`	string	Error source category
`request_id`	string	Request tracking ID
`timestamp`	string	Time the error occurred (ISO 8601)

Severity Descriptions

severity	Description	Recommended Action
`fatal`	Fatal error	Stop the service and require reconnection
`error`	Operation failed	Show an error notice and allow retry
`warning`	Warning	Show a warning without blocking the operation

For the full list of error codes, refer to Error Code Reference.

Version: V1.5.7 Last Updated: 2026-05-20

Sse Api

Websocket Api

details Fields

What the Frontend Should Do

What Will Not Happen (Guarantees)

Client Handling Example

`details` Fields