API Docs

Websocket Api

Note: This is a consolidated document. For detailed specifications, refer to the individual documents under reference/websocket/.

Note: The URL used in this document (vas-poc.vurbo.ai) is the planned deployment address. A separate notice will be issued after the official launch.


Table of Contents

  1. Connection Info
  2. Authentication
  3. Message Format
  4. Health - Heartbeat Service
  5. Voice Translation - start
  6. Voice Translation - config
  7. Voice Translation - audio
  8. Voice Translation - pause
  9. Voice Translation - resume
  10. Voice Translation - stop
  11. Voice Translation - retranslate
  12. Voice Translation - switch_language
  13. Voice Translation - set_name
  14. Voice Translation - rename_speaker
  15. Voice Translation - reassign_speaker
  16. Voice Translation - merge_speakers
  17. Voice Translation - tts_play
  18. Voice Translation - tts_stop
  19. Voice Translation - tts_mode
  20. Voice Translation - set_tts
  21. Voice Translation - start_speaking
  22. Voice Translation - stop_speaking
  23. Voice Translation - switch_conversation_mode
  24. Voice Translation - set_speaker_language
  25. Voice Translation - broadcast_go_live
  26. Voice Translation - broadcast_announcement
  27. Voice Translation - set_standby_message
  28. Response Events

Connection Info

ItemValue
Endpointwss://vas-poc.vurbo.ai/ws
ProtocolWebSocket
Data FormatJSON
Auth MethodTicket (see below)

Authentication

The VAS WebSocket uses a Ticket mechanism for authentication, passing a one-time Ticket via Sec-WebSocket-Protocol. For details, refer to Authentication.

Step 1: Obtain a Ticket

Exchange your API Key for a one-time Ticket via the REST API:

POST /api/v1/auth/ticket
X-API-Key: vas_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Response:

{
  "ticket": "aBcDeFgHiJkLmNoPqRsTuVwXyZ012345",
  "expires_in": 60
}
FieldTypeDescription
ticketstringOne-time Ticket (32 chars)
expires_inintValidity period (seconds)

Step 2: Connect to the WebSocket using the Ticket

Place the Ticket into Sec-WebSocket-Protocol in the format ticket.{TICKET_VALUE}:

// Native browser support
const ws = new WebSocket('wss://vas-poc.vurbo.ai/ws', [`ticket.${ticket}`]);

ws.onopen = () => {
  console.log('Connected! Protocol:', ws.protocol);
  // Start using the WebSocket...
};

ws.onerror = (error) => {
  console.error('Connection failed:', error);
};

Node.js example:

const WebSocket = require('ws');

const ws = new WebSocket('wss://vas-poc.vurbo.ai/ws', [`ticket.${ticket}`]);

Ticket Characteristics

CharacteristicDescription
Validity period60 seconds
Usage countCan be used only once (deleted immediately after)
SecurityThe API Key is never exposed in the WebSocket connection
Replay protectionUses an atomic operation to guarantee single use

Ticket Error Codes

Error CodeHTTP StatusDescription
ticket_invalid401Ticket invalid or expired
ticket_expired401Ticket expired
ticket_already_used401Ticket already used
ticket_validation_failed500Ticket validation failed

For the full API specification, refer to Auth Ticket API.


Message Format

All messages use a unified nested structure:

{
  "type": "service type",
  "data": { ... }
}

Service Types

typeDescription
healthHeartbeat mechanism
voice-translationVoice translation service
errorError message

Error Message Format

When an error occurs, the server returns a message with type: "error":

{
  "type": "error",
  "data": {
    "error_code": "auth_invalid_api_key",
    "severity": "fatal",
    "message": "Invalid API key",
    "context": "auth",
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-01-15T10:30:45.123Z"
  }
}

Sentence-level errors (such as a translation failure for one language of a sentence) additionally carry sid and details:

{
  "type": "error",
  "data": {
    "error_code": "llm_content_filtered",
    "severity": "warning",
    "message": "Content filtered",
    "context": "translation",
    "sid": 5,
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-01-15T10:30:45.123Z",
    "details": {
      "provider": "azure_openai",
      "translation_language": "ja"
    }
  }
}

Session-level translation service errors (escalated after consecutive failures reach a threshold) do not carry sid. The frontend should display a global notice but does not need to disconnect:

{
  "type": "error",
  "data": {
    "error_code": "translation_service_unavailable",
    "severity": "error",
    "message": "Translation service unavailable",
    "context": "translation",
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-01-15T10:30:45.123Z",
    "details": {
      "provider": "azure_openai",
      "last_error_code": "llm_provider_error",
      "fail_count": 5
    }
  }
}

For the full trigger rules (consecutive failure threshold, error code classification), refer to the translation_service_unavailable section in Error Code Reference.

Single-Message Error (per-message panic recovered)

When the server encounters an internal error (panic) while handling a single WebSocket message (such as set_name, switch_language, tts_play, etc.), it returns internal_error. This error indicates only that the specific message failed to process; the connection is not terminated. The frontend should keep the connection open and may retry the operation:

{
  "type": "error",
  "data": {
    "error_code": "internal_error",
    "severity": "error",
    "message": "Internal server error",
    "context": "general",
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-05-08T10:30:45.123Z",
    "details": {
      "message_type": "voice-translation",
      "action": "set_name"
    }
  }
}
details Fields
FieldTypeDescription
message_typestringService type: voice-translation / health
actionstring(Optional) The specific operation that failed, such as set_name, switch_language, tts_play, tts_mode, retranslate, config, speaker.rename, etc. This field is absent when the message payload has no action field (such as a plain init message).
What the Frontend Should Do
  1. Keep the WebSocket connection open: Do not call ws.close(), navigate away, or return to a history page because of this error. The recording is still in progress.
  2. Decide on follow-up handling based on details.action:
    ScenarioRecommended Action
    Idempotent operations such as set_name / switch_language / tts_mode / configSimply resend the same message. These operations use a "last write wins" approach, so retrying has no side effects.
    tts_play / tts_stop / retranslateUsually safe to retry directly. If the user is waiting for TTS playback, consider showing a transient toast indicating the retry is in progress.
    speaker.rename / speaker.mergeBefore retrying, use the REST API (speakers) to confirm the current DB state and avoid duplicate operations (for example, the rename already succeeded and only the response frame failed).
    details.action is absentThe server panicked after parsing the message payload and cannot infer the specific operation. The frontend can infer it from "the most recent message the user sent," or display a generic error message such as "Operation failed, please retry."
  3. User experience: Show a transient toast / inline error. Do not interrupt the user flow with a modal or a redirect.
  4. Telemetry / reporting: Report request_id + details to your frontend error tracking (Sentry, Datadog, etc.) to make it easier to correlate with backend logs during troubleshooting.
What Will Not Happen (Guarantees)
  • The recording will not be interrupted: segment_uploaded, result, origin, and other messages keep arriving.
  • The connection will not be actively closed by the server.
  • The session state will not be reset (session_id stays the same).
  • State already written to the DB will not be rolled back (for example, if set_name was written to the DB successfully and only the response frame failed, the name still takes effect).
Client Handling Example
ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.type !== 'error') {
    handleNormalMessage(msg);
    return;
  }

  const { error_code, severity, request_id, details } = msg.data;

  // Single-message panic: keep the connection, decide whether to retry based on action
  if (error_code === 'internal_error') {
    console.warn('[ws] message handler panic recovered', {
      request_id,
      message_type: details?.message_type,
      action: details?.action,
    });
    showTransientToast(`Failed to process "${details?.action ?? 'operation'}", please retry`);
    // Note: do not call ws.close() and do not navigate away from the current page
    return;
  }

  // Handle other errors with your existing logic (only fatal errors require disconnecting)
  handleErrorBySeverity(severity, msg.data);
};
FieldTypeDescription
error_codestringError code (for programmatic handling)
severitystringSeverity: fatal / error / warning
messagestringHuman-readable error message
contextstringError source category
sidintOptional. The sentence number for sentence-level errors (such as a translation failure); absent for non-sentence-level errors
request_idstringRequest tracking ID
timestampstringTime the error occurred (ISO 8601)
detailsobjectOptional. Error context; common keys: provider, translation_language, source_lang, etc.

For the full list of error codes, refer to Error Code Reference.


Health (Heartbeat Service)

Description

Used to confirm that the WebSocket connection is healthy. We recommend sending a ping every 30 seconds; if no pong is received, treat the connection as dropped and reconnect.

Use Cases

  • Maintaining a long-lived connection
  • Detecting connection status
  • Preventing connection timeouts

Request - Ping

{
  "type": "health",
  "data": {
    "action": "ping"
  }
}

Response - Pong

{
  "type": "health",
  "data": {
    "action": "pong"
  }
}

Voice Translation - start (Start Voice Translation)

Description

Starts a new voice translation session and begins processing audio according to the configured parameters.

Use Cases

  • Starting a meeting record
  • Starting real-time translation
  • Starting a voice memo

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value start
transcription_languagesstringYesSpeech recognition languages (up to 2)
translation_languagesstringNoTranslation target languages (empty = no translation)
realtime_translationbooleanNoReal-time translation mode (default false)
recognition_modestringNoRecognition mode: single (single speaker, default), multi_speaker (multiple speakers). Under multi_speaker, transcription_languages must contain exactly 1 language; otherwise the server returns a diarization_multilang_conflict error and refuses to start.
typestringYesRecording type: transcribe, conversation, record, broadcast
audio_formatstringNoAudio format: pcm (default), webm
summary_templatestringConditionalSummary template. Required for transcribe when summary_mode=builtin; forbidden when summary_mode=custom; optional for conversation/broadcast.
optionsobjectNoSpeech recognition options
tts_enabledbooleanNoWhether to enable TTS speech synthesis (default false)
tts_languagestringNoTTS output language (must be in translation_languages)
tts_voicestringNoTTS voice name (such as en-US-JennyNeural)
tts_modestringNoTTS playback mode: sync (synchronous, default), async (asynchronous)
broadcast_tokenstringConditionalBroadcast token (required for the broadcast type, obtained from the REST API)
active_languagestringNoInitial active language for two-way mode (default transcription_languages[0])
speakersarrayConditionalUser-to-language mapping for two-way mode (required in two-way mode, exactly 2 users)
conversation_modestringNoTwo-way conversation mode: auto (auto-detect, default), manual (push-to-talk)
speaker_diarizationbooleanNoSpeaker diarization (forcibly ignored in two-way mode)
tts_configobjectNoMulti-language TTS settings (applies to both broadcast mode and two-way mode)
broadcast_phasestringNoInitial broadcast phase: standby, live (default)
standby_messagestringNoThe message viewers see during the standby phase (default: "Getting ready, please wait...")
namestringNoInitial default recording name (max 60 chars; the system may still override it; if not provided, auto-generated such as Transcription #1)
summary_languagestringNoSummary output language (defaults to the recognition language when unspecified; in broadcast mode, read automatically from the channel settings)
summary_modestringNoSummary mode enum: builtin (default) / custom. Inferred as builtin when omitted.
summary_promptstringNoRequired in custom mode; supplemental instructions in builtin mode. <= 2000 characters.
summary_prompt_slugstringNoRequired in custom mode; forbidden in builtin mode. Your own identifier (<= 64 characters, Unicode, no control characters; passed through and stored in the backend record for historical lookup).
summary_plain_textbooleanNoRequest plain-text summary output (default false; when enabled, the backend performs Markdown post-processing).

Recording Type Descriptions

typeDescriptionUse Cases
transcribeSpeech-to-textMeeting minutes, interview notes
conversationConversation recordTwo-way communication, customer service conversations
recordPlain recordingVoice memos, quick notes
broadcastBroadcast/liveLectures, talks, live content

Request Example (Basic)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "realtime_translation": false,
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting",
    "options": {
      "speaking_speed": "normal",
      "segmentation_mode": "auto",
      "profanity_handling": "mask"
    }
  }
}

Request Example (Initial Default Name)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_template": "meeting",
    "name": "Product Planning Meeting"
  }
}

Recording Name Rules

ScenarioNamename_sourceSystem Override?
start with a name parameterInitial default namedefaultYes
start without a nameAuto-generated (such as Transcription #1, Broadcast #3)defaultYes
Set via set_nameThe name explicitly set by the useruserNo
Auto-generated by the system after the session endsA summary name generated from the transcript contentllm

Note: The name in start is the initial default name; the system may still override it when the session ends. If you need a fixed name, use set_name.

Default name format (fixed English):

Recording TypeDefault Name Format
transcribeTranscription #N
conversationConversation #N
recordRecording #N
broadcastBroadcast #N

N is the sequential number for that user's recordings of the same type. Name priority: user > llm > default. Once the user sets a name, the system will not override it when the session ends.

Request Example (With TTS)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "realtime_translation": true,
    "type": "transcribe",
    "tts_enabled": true,
    "tts_language": "en-US",
    "tts_voice": "en-US-JennyNeural",
    "tts_mode": "sync"
  }
}

Request Example (Two-Way Mode - Auto-Detect)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "conversation",
    "transcription_languages": ["zh-TW", "en-US"],
    "active_language": "zh-TW",
    "audio_format": "pcm",
    "realtime_translation": true,
    "speakers": [
      { "id": 1, "language": "zh-TW" },
      { "id": 2, "language": "en-US" }
    ],
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
    }
  }
}

Request Example (Two-Way Mode - Manual Mode)

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "conversation",
    "transcription_languages": ["zh-TW", "en-US"],
    "conversation_mode": "manual",
    "audio_format": "pcm",
    "realtime_translation": true,
    "speakers": [
      { "id": 1, "language": "zh-TW" },
      { "id": 2, "language": "en-US" }
    ],
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-JennyNeural", "speaking_rate": 1.0 }
    }
  }
}

Request Example (Custom Summary Prompt - custom mode)

In mode=custom, your summary_prompt content completely replaces the built-in template rules, and the backend already adds prompt injection protection. The summary_prompt_slug is metadata for your own identification (stored in the backend record) and does not enter the prompt content.

If you want to keep the built-in template and add your own supplemental instructions afterward, use summary_mode=builtin + summary_template=<slug> + summary_prompt=<supplemental instructions> instead (in builtin mode, summary_prompt is treated as supplemental and appended after the built-in template).

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "transcribe",
    "audio_format": "pcm",
    "summary_language": "zh-TW",
    "summary_mode": "custom",
    "summary_prompt": "You are a meeting-minutes assistant. List every amount and committed date discussed in bullet points, and note the responsible person for each.",
    "summary_prompt_slug": "client_x_finance_v3",
    "summary_plain_text": false
  }
}

Important — How to Retrieve the Summary Result: In WebSocket mode, summaries are non-streaming by design; final_content is not pushed back via a WebSocket event (the summary_done event only signals completion and does not contain the content). The client must retrieve it afterward over HTTP:

  1. After receiving the summary_done event, call GET /api/v1/sse/history/transcribe/{taskId} to retrieve the summary (the init_summary event carries a top-level summary plain string + summary_mode / summary_template / summary_plain_text / summary_prompt_snapshot + the two content-filter fallback audit fields summary_fallback_level / summary_dropped_segments added in v1.5.5).
  2. Or query the summary_mode / summary_template / summary_prompt_slug columns of the recordings table via the REST API.

v1.5.5 Content-Filter Automatic Downgrade: If your prompt or transcript content triggers the LLM service's content filter, the system automatically downgrades (standard mode → neutral mode → segment-omission mode). The summary_fallback_level field of the summary_done event (value 2 or 3; omitted when standard mode succeeds directly) tells the client which path was actually taken, so the frontend can display hints such as "neutral mode in use" / "N segments omitted." See reference/websocket/events.md – summary_done and the V1.5.5 changelog.

Two-Way Mode Special Rules:

ItemDescription
transcription_languagesMust contain exactly 2 languages, and they cannot be the same.
translation_languagesNot required (automatically derived as the non-active language).
active_languageOptional, defaults to transcription_languages[0].
recognition_modeForced to single (ignores speaker_diarization).
tts_enabledDefaults to true; set to false to return text translations only.
tts_configOptional; sets the TTS voice for each of the two languages; leave empty to use the default voices automatically.
summary_templateOptional; when provided, a summary is automatically generated after stopping.
speakersRequired in two-way mode; specifies each user's language (exactly 2 users).
conversation_modeOptional; auto (auto-detect, default) or manual (push-to-talk).

speakers Field Descriptions:

FieldTypeRequiredDescription
idintYesUser number (1 or 2)
languagestringYesThe user's language code (must be in transcription_languages)

conversation_mode Descriptions:

ModeDescription
auto (default)The system automatically detects the spoken language and segments sentences automatically.
manualThe user controls speaking periods via start_speaking / stop_speaking, during which the audio is merged into a single sentence.

Broadcast Mode Description (type: "broadcast")

In broadcast mode, the language settings are automatically obtained from the broadcast channel settings and do not need to be sent in the WebSocket message.

Required parameters:

ParameterTypeDescription
typestringMust be "broadcast"
broadcast_tokenstringBroadcast token (obtained after creating the broadcast via the REST API)
audio_formatstringAudio format (pcm or webm)

Optional parameters (override the broadcast channel settings):

ParameterTypeDescription
tts_configobjectMulti-language TTS settings (overrides the settings from creation time)
summary_templatestringSummary template slug (overrides the settings from creation time; if not provided, the broadcast channel default is used)

Auto-configured parameters (can be omitted):

  • transcription_languages: read automatically from the broadcast settings
  • translation_languages: read automatically from the broadcast settings
  • realtime_translation: enabled by default in broadcast mode
  • summary_template: read automatically from the broadcast settings (the value passed via WebSocket takes precedence)
  • summary_language: read automatically from the broadcast settings (the value passed via WebSocket takes precedence)

Broadcast Phase Descriptions:

broadcast_phaseDescriptionBehavior
live (default)Live phaseSTT/translation results are broadcast to viewers and written to the transcript.
standbyStandby phaseSTT/translation results go only to the host; viewers see the standby_message.

Standby phase purpose: Lets the host warm up STT/translation before going live, confirm that equipment is working, and then switch to the live phase.

Broadcast Mode Request Example:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "a3f9",
    "audio_format": "pcm"
  }
}

Broadcast Mode Request Example (Standby Phase + Override Summary Template):

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "a3f9",
    "audio_format": "pcm",
    "broadcast_phase": "standby",
    "standby_message": "The talk is about to begin, please wait...",
    "summary_template": "lecture"
  }
}

Summary template priority: The value passed in the WebSocket start > the default set when the broadcast channel was created. If neither is set, no summary is automatically generated.

Broadcast Mode TTS Settings (tts_config):

Use the tts_config parameter to specify which translation languages should produce TTS audio for viewers.

tts_config FieldTypeDescription
voicestringTTS voice name
speaking_ratenumberSpeaking rate (0.5–2.0, default 1.0)
{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "a3f9",
    "audio_format": "pcm",
    "tts_config": {
      "en-US": {
        "voice": "en-US-JennyNeural",
        "speaking_rate": 1.0
      },
      "ja-JP": {
        "voice": "ja-JP-NanamiNeural",
        "speaking_rate": 1.0
      }
    }
  }
}

Note:

  • TTS languages must be valid languages in translation_languages; invalid languages are automatically ignored.
  • The host (WebSocket) does not receive TTS audio; only SSE viewers receive the tts_ready event.
  • TTS is sent only during the live phase; nothing is sent during the standby phase.

TTS Playback Mode Descriptions

ModeDescriptionBehavior
syncSynchronous mode (default)Automatically plays the latest is_final=true translated sentence; if the previous sentence is still playing, it enters the queue and waits.
asyncAsynchronous mode (manual control)The user can choose any translated sentence for TTS, controlled with the tts_play command.

Success Response

After a successful start, a session_started event is returned containing complete session initialization info.

General recordings (transcribe / conversation / record):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "transcribe",
    "recognition_mode": "single",
    "message": "Speech recognition started"
  }
}

Broadcast mode (broadcast):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "broadcast",
    "recognition_mode": "multi_speaker",
    "phase": "standby",
    "viewer_count": 0,
    "queue_count": 0,
    "peak_viewers": 0,
    "total_viewers": 0,
    "message": "Speech recognition started"
  }
}
FieldTypeDescription
session_idstringSession ID
recording_idstringRecording ID (can be used for subsequent API queries)
recording_typestringRecording type: transcribe, conversation, record, broadcast
recognition_modestringRecognition mode: single, multi_speaker
phasestringBroadcast phase: standby or live (broadcast mode only)
viewer_countintCurrent number of online viewers (broadcast mode only)
queue_countintNumber of viewers waiting in the queue (broadcast mode only)
peak_viewersintPeak number of viewers for this broadcast (broadcast mode only)
total_viewersintTotal cumulative number of viewers who have connected (broadcast mode only)
messagestringStatus description message

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
missing_transcription_languages400No language parameter providedMake sure the request includes transcription_languages
invalid_transcription_language400Invalid language codeConfirm the language code format is correct (such as zh-TW)
too_many_languages400Number of languages exceeds the limitAt most 2 languages can be specified
invalid_recording_type400Invalid recording typeUse a valid type value
invalid_summary_template400Invalid summary templateConfirm the template identifier is correct
stt_init_failed503Service initialization failedRetry later
auth_budget_exceeded402Monthly budget exceededWait for the next month's budget reset or adjust the budget
tts_init_failed503TTS service initialization failedRetry later
tts_invalid_language400TTS language not in the translation languagesConfirm tts_language is in translation_languages
broadcast_token_required400Broadcast mode requires a tokenThe broadcast type must provide a broadcast_token
broadcast_token_invalid400Invalid broadcast tokenConfirm the token is correct and not expired
broadcast_not_ready503Broadcast service not yet startedRetry later
summary_invalid_mode400summary_mode is not builtin / customUse a valid mode
summary_mode_field_mismatch400The mode and field combination does not match (a required field is missing / a forbidden field was included)Adjust fields per the mode rules
summary_prompt_too_long400summary_prompt exceeds 2000 charactersShorten the custom prompt
summary_prompt_slug_too_long400summary_prompt_slug exceeds 64 charactersShorten the identifier
summary_prompt_slug_invalid400summary_prompt_slug contains control characters (\n / \r / \t / \0, etc.)Remove the control characters

Voice Translation - config (Set Terminology / Correction Rules)

Description

Before or during recording, pass in terminology, fuzzy-word correction rules, and translation dictionary settings. These settings improve STT accuracy, fix homophone errors, and ensure translation consistency.

Auto-generated correction rules: When terminology is passed in, the system automatically generates fuzzy-word correction rules for each term (homophones, near-homophones, Traditional/Simplified variants). The frontend does not need to define fuzzy_correction manually, greatly simplifying the setup process.

Use Cases

  • Pass in professional terminology (Phrase List) before recording starts
  • Set fuzzy-word correction rules (homophone correction) - optional, the system generates them automatically
  • Set a translation dictionary (ensure consistent terminology translation)

Timing

Setting TypeRecommended TimingUpdate During Recording
TerminologyBefore or during startSupported (takes effect on the next turn)
Fuzzy-word correctionBefore or during startSupported
Translation dictionaryBefore or during startSupported

Note: When you update terminology during recording, the new terms automatically take effect at the next recognition turn boundary, with no need to reconnect. The response includes a terminology_effective: "next_turn" field as a hint.

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value config
terminologyobjectNoTerminology settings
fuzzy_correctionobjectNoFuzzy-word correction rules
translation_dictobjectNoTranslation dictionary

Note: At least one setting item must be provided.

Terminology Format (terminology)

Keyed by language code, with an array of terms as the value:

{
  "zh-TW": [
    { "term": "語者分離", "boost": 1.5 },
    { "term": "WebSocket", "boost": 2.0 }
  ],
  "en-US": [
    { "term": "diarization", "boost": 1.5 }
  ]
}
FieldTypeRequiredDescription
termstringYesThe term (max 100 characters)
boostnumberNoWeight (default 1.0, range 0.5–5.0)

Limit: Up to 500 terms per language.

Fuzzy-Word Correction Format (fuzzy_correction)

Note: This field usually does not need to be set manually. The system automatically generates correction rules based on terminology. Use it only when you need custom special rules.

Keyed by language code, with an array of correction rules as the value:

{
  "zh-TW": [
    { "correct": "語者分離", "incorrect": ["語這分離", "語者分力"] }
  ]
}
FieldTypeRequiredDescription
correctstringYesThe correct word
incorrectstringYesList of incorrect variants

Auto-Generated Correction Rule Description

When terminology is passed in, the system automatically generates fuzzy-word correction rules for each term:

Generation TypeDescriptionExample
HomophoneAlternative characters with the same pinyin語者 → 語這, 語折
Near-homophoneAlternative characters with similar tones媽 → 麻, 馬
Traditional/SimplifiedTraditional/Simplified conversion製程 → 制程

Mixed Chinese-English term support: For mixed terms like "CVD製程," the system generates variants only for the Chinese portion and leaves the English unchanged.

Original TermAuto-Generated Variants
CVD製程CVD制程, CVD之程, CVD製城
wafer良率wafer量率, wafer涼率
5nm製程5nm制程, 5nm製成

Translation Dictionary Format (translation_dict)

Use an array of entries directly:

[
  {
    "source": "語者分離",
    "translations": {
      "en-US": "Speaker Diarization",
      "ja-JP": "話者分離"
    }
  }
]
FieldTypeRequiredDescription
sourcestringYesThe source word (in the STT language)
translationsobjectYesTranslation mapping { "language code": "translation" }

Limit: We recommend no more than 50 entries (to avoid degrading processing performance).

{
  "type": "voice-translation",
  "data": {
    "action": "config",
    "terminology": {
      "zh-TW": [
        { "term": "語者分離", "boost": 1.5 },
        { "term": "CVD製程", "boost": 1.5 },
        { "term": "wafer良率", "boost": 1.5 }
      ]
    }
  }
}

Request Example (Full Settings, Including Manual Correction Rules)

{
  "type": "voice-translation",
  "data": {
    "action": "config",
    "terminology": {
      "zh-TW": [
        { "term": "語者分離", "boost": 1.5 },
        { "term": "即時轉錄", "boost": 1.5 }
      ]
    },
    "fuzzy_correction": {
      "zh-TW": [
        { "correct": "語者分離", "incorrect": ["語這分離", "語者分力"] }
      ]
    },
    "translation_dict": [
      { "source": "語者分離", "translations": { "en-US": "Speaker Diarization" } }
    ]
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "config_updated",
    "updated": ["terminology", "fuzzy_correction", "translation_dict"],
    "message": "Settings updated"
  }
}
FieldTypeDescription
updatedstringThe setting types that were updated
messagestringStatus message

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
config_empty400No settings providedProvide at least one setting item
config_term_too_long400Term exceeds 100 charactersShorten the term length
config_too_many_entries400More than 500 termsReduce the number of terms
config_too_many_dict_entries400Translation dictionary exceeds 50 entriesReduce the dictionary entries

Voice Translation - audio (Send Audio)

Description

Sends audio data to the server for speech recognition. The audio must be Base64-encoded before sending.

Use Cases

  • Continuously sending microphone audio
  • Sending recorded audio segments

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value audio
payloadstringYesBase64-encoded audio data

Audio Format Requirements

PCM format (default):

ItemSpecification
FormatPCM (raw audio)
Sample rate16000 Hz
Bit depth16-bit
ChannelsMono
Byte orderLittle-endian
Transport encodingBase64

WebM/Opus format:

ItemSpecification
FormatWebM container + Opus codec
Sample rateAny (the server converts automatically)
ChannelsMono or Stereo (the server converts automatically)
Transport encodingBase64

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "audio",
    "payload": "Base64-encoded PCM audio data"
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
session_not_started400Speech recognition has not startedCall the start action first
audio_invalid_format400Invalid audio data formatConfirm the Base64 encoding is correct
audio_format_unsupported400Unsupported audio formatUse the pcm or webm format
audio_decode_failed400Audio decoding failedConfirm the audio format is correct

Voice Translation - pause (Pause Translation)

Description

Pauses speech recognition processing. Audio received during the pause is buffered and continues to be processed after resuming.

Use Cases

  • The user steps away temporarily
  • You need to pause recording

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "pause"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speech recognition paused"
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
session_not_started400Speech recognition has not startedCall start first
session_already_paused400Already pausedYou can ignore this error

Voice Translation - resume (Resume Translation)

Description

Resumes paused speech recognition processing.

Use Cases

  • The user returns to continue
  • You need to continue recording

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "resume"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speech recognition resumed"
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
session_not_started400Speech recognition has not startedCall start first
session_not_paused400Not pausedYou can ignore this error

Voice Translation - stop (Stop Translation)

Description

Stops speech recognition and ends the session. The system automatically uploads the audio file and transcript, and generates a summary (if configured).

Use Cases

  • The meeting ends
  • Recording is complete

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "stop"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speech recognition stopped"
  }
}

Task Complete Event

This event is sent after the audio file and transcript have been uploaded:

{
  "type": "voice-translation",
  "data": {
    "action": "task_complete",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "message": "Task processing complete"
  }
}
FieldTypeDescription
task_idstringRecording UUID, can be used for subsequent API queries

Voice Translation - retranslate (Retranslate)

Description

Retranslates a specified sentence, useful when the original text has been corrected and the translation needs to be updated.

Use Cases

  • The user edits the original text and the translation needs updating
  • Correcting recognition errors

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value retranslate
sidintYesThe sentence number to retranslate
translation_languagesstringYesArray of translation language codes
textstringYesThe original text to translate (the user-corrected text)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "retranslate",
    "sid": 1,
    "translation_languages": ["en-US"],
    "text": "The user-corrected original text"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "translations": {
      "en-US": {
        "sid": 1,
        "text": "The new translation result",
        "is_final": true,
        "is_retranslation": true
      }
    }
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
retranslate_sid_not_found400The specified SID was not foundConfirm the SID exists
retranslate_session_not_active400The session is not started or has endedConfirm the session state
retranslate_no_target_lang400No target language providedProvide translation_languages
retranslate_no_text400No text to translate providedProvide the text parameter
retranslate_llm_failed500Translation service failedRetry later

Voice Translation - switch_language (Switch Language)

Description

Switches the language while real-time translation is in progress. The behavior varies by recording type:

  • General mode (transcribe, etc.): switches the translation target language and automatically batch-retranslates all already-translated sentences.
  • Two-way mode (conversation): switches the STT source language (spoken language); the translation target automatically switches to the other language.

Use Cases

  • Switching the translation target language
  • A change in language needs mid-meeting

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value switch_language
translation_languagesstringConditionalArray of translation language codes (required in general mode)
transcription_languagesstringConditionalThe target language to switch to (two-way mode; if omitted, automatically toggles to the other language)

Request Example (General Mode)

{
  "type": "voice-translation",
  "data": {
    "action": "switch_language",
    "translation_languages": ["ja-JP"]
  }
}

Request Example (Two-Way Mode)

Specify the target to switch to:

{
  "type": "voice-translation",
  "data": {
    "action": "switch_language",
    "transcription_languages": ["en-US"]
  }
}

Automatic toggle (no parameters):

{
  "type": "voice-translation",
  "data": {
    "action": "switch_language"
  }
}

Two-Way Mode Special Behavior:

  • Two-way mode uses automatic language detection and usually does not require manually switching the language.
  • switch_language only updates the internal preference state.
  • After a successful switch, a language_switched event is returned (not a language_switch_start/done sequence).
  • Switching to the same language returns a conversation_same_language warning.

Response Sequence (General Mode)

After switching the language, you receive the following events in order:

  1. language_switch_start: notifies that the switch has begun
{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_start",
    "translation_language": "ja-JP",
    "total_segments": 15,
    "message": "Starting language switch and retranslation"
  }
}
  1. batch_retranslation (multiple): returns retranslation results sentence by sentence
{
  "type": "voice-translation",
  "data": {
    "action": "batch_retranslation",
    "sid": 3,
    "translations": {
      "ja-JP": {
        "sid": 3,
        "text": "今日はプロジェクトの進捗について話し合いましょう",
        "is_final": true,
        "is_retranslation": true
      }
    }
  }
}
  1. language_switch_done: notifies that the switch is complete
{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_done",
    "translation_language": "ja-JP",
    "success_count": 15,
    "failed_count": 0,
    "message": "Language switch complete"
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
switch_language_no_target400No target language providedProvide translation_languages
switch_language_in_progress400The previous switch is not yet completeWait for the switch to complete
switch_language_same_target400The target language is the same as the current oneYou can ignore this error
conversation_requires_two_languages400Two-way mode requires exactly two languagesConfirm transcription_languages has 2
conversation_languages_identical400The two two-way languages cannot be the sameProvide two different languages
conversation_invalid_language400Invalid two-way languageConfirm the language is in transcription_languages
conversation_same_language400Already the current languageYou can ignore this warning

Voice Translation - set_name (Set Recording Name)

Description

Sets the name while recording is in progress. After it is set, this name is used when the recording ends and will not be auto-generated.

Tip: You can also set an initial default name via the name parameter at start, but that name may still be overridden by the system when the session ends. If you need a fixed name, use set_name.

Use Cases

  • Customizing the recording title after recording starts
  • Overriding an auto-generated name or a previously set name

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value set_name
namestringYesRecording name (max 60 chars)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "set_name",
    "name": "Product Planning Meeting"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Recording name updated"
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
name_too_long400Recording name exceeds the limitShorten the name
session_not_started400Speech recognition has not startedCall start first

Voice Translation - rename_speaker (Globally Rename a Speaker)

Description

In multi-speaker diarization mode (multi_speaker), globally renames a speaker. All sentences using that speaker ID are updated in sync.

Use Cases

  • Changing a system-assigned speaker ID (such as Guest-1) to a meaningful name (such as Manager Wang)
  • Naming a newly recognized speaker during a meeting

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value rename_speaker
speaker_idstringYesThe original speaker ID (such as Guest-1); the current display label is also accepted for consecutive renaming; max 100 characters
new_labelstringYesThe new display label; max 100 characters, must not contain control characters (\x00-\x1F, \x7F) or line breaks

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "rename_speaker",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_renamed",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang",
    "affected_sids": [1, 3, 5, 8]
  }
}
FieldTypeDescription
speaker_idstringThe resolved original speaker ID (even if the input was a display label, the event returns the original ID)
new_labelstringThe new display label
affected_sidsintThe list of affected sentence numbers

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
speaker_not_found400The specified speaker was not foundConfirm the speaker_id or display label exists
speaker_name_empty400new_label is emptyProvide a valid label
speaker_name_duplicate422The display label is already in useUse a different label, or first change the conflicting speaker
session_not_started400Speech recognition has not startedCall start first

Voice Translation - reassign_speaker (Change the Speaker of a Single Sentence)

Description

Changes the speaker identity (OriginalSpeakerID) of a specific sentence, assigning the sentence to an existing speaker.

Use Cases

  • Correcting a speaker identity that the system recognized incorrectly
  • Reassigning a sentence to another known speaker

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value reassign_speaker
sidintYesThe sentence number to change
target_speaker_idstringYesThe target speaker's original ID (taken from init_sentence.speaker_id; reassign does not accept display labels)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "reassign_speaker",
    "sid": 5,
    "target_speaker_id": "Guest-2"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_reassigned",
    "sid": 5,
    "old_speaker_id": "Guest-1",
    "new_speaker_id": "Guest-2",
    "new_speaker_label": "Lee Hsiao-hua"
  }
}
FieldTypeDescription
sidintThe changed sentence number
old_speaker_idstringThe original speaker ID
new_speaker_idstringThe new original speaker ID
new_speaker_labelstringThe new speaker display label (after applying speaker_aliases; equals new_speaker_id when no alias exists)

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
speaker_sid_not_found400The specified sentence was not foundConfirm the SID exists
speaker_not_found400The target speaker does not existUse an existing speaker ID
speaker_name_empty400The target speaker ID cannot be emptyProvide a valid speaker ID
session_not_started400Speech recognition has not startedCall start first
invalid_parameter400Creating a new speaker is not supportedUse an existing speaker ID

Voice Translation - merge_speakers (Merge Speakers)

Description

Merges all sentences of one speaker into another speaker. After the merge, future recognition results for that speaker are also automatically converted to the target speaker.

Use Cases

  • The speech recognition engine sometimes misidentifies the same person's voice as multiple speakers (for example, Guest-1 and Guest-2 are actually the same person)
  • Use this feature to merge all of Guest-2's sentences into Guest-1
  • After the merge, future Guest-2 recognition results are automatically displayed as Guest-1

Difference from reassign_speaker

FeatureScopeFuture Impact
reassign_speakerA single sentence (1 SID)None
merge_speakersAll sentences of that speakerFuture appearances of the source are also automatically converted to the target

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value merge_speakers
source_speaker_idstringYesThe speaker ID to be merged (such as Guest-2)
target_speaker_idstringYesThe merge target speaker ID (such as Guest-1)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "merge_speakers",
    "source_speaker_id": "Guest-2",
    "target_speaker_id": "Guest-1"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "speakers_merged",
    "source_speaker_id": "Guest-2",
    "target_speaker_id": "Guest-1",
    "target_speaker_label": "Manager Wang",
    "affected_sids": [3, 5, 7]
  }
}
FieldTypeDescription
source_speaker_idstringThe original ID of the merged speaker
target_speaker_idstringThe original ID of the merge target
target_speaker_labelstringThe target speaker display label (after applying speaker_aliases; equals the original ID when no alias exists)
affected_sidsnumberThe list of affected sentence IDs

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
speaker_not_found400The speaker does not existConfirm the speaker ID exists
merge_speakers_same_id400The source and target speaker are the sameUse different speaker IDs
speaker_name_empty400The speaker ID cannot be emptyProvide a valid speaker ID
session_not_started400Speech recognition has not startedCall start first

Voice Translation - tts_play (Play TTS)

Description

In async mode, manually plays the TTS audio for a specified sentence.

Use Cases

  • The user selects a specific sentence for TTS playback
  • Playing multiple consecutive sentences

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value tts_play
sidintYesThe starting sentence ID
lengthintNoNumber of sentences to play (default 1, max 20)

Note: The maximum value of length is controlled by the backend environment variable TTS_SSE_MAX_LENGTH (default 20).

Request Example (Single Sentence)

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5
  }
}

Request Example (Multiple Sentences)

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5,
    "length": 3
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
tts_not_enabled400TTS not enabledConfirm TTS was enabled at start
tts_sid_not_found400The specified sentence was not foundConfirm the SID exists
tts_translation_not_found400The sentence has no translation in the specified languageConfirm the translation exists

Voice Translation - tts_stop (Stop TTS)

Description

Stops the TTS audio that is currently playing.

Use Cases

  • The user manually stops TTS playback
  • Stopping the current playback before switching to another sentence

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "tts_stop"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "TTS playback stopped"
  }
}

Voice Translation - tts_mode (Switch TTS Mode)

Description

Switches the TTS playback mode (synchronous/asynchronous) while recording is in progress.

Use Cases

  • Switching from automatic playback to manual control
  • Switching from manual control to automatic playback

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value tts_mode
tts_modestringYesMode: sync (synchronous) or async (asynchronous)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode_changed",
    "tts_mode": "async"
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
tts_not_enabled400TTS not enabledConfirm TTS was enabled at start
tts_invalid_mode400Invalid modeUse sync or async

Voice Translation - set_tts (Two-Way TTS Settings)

Description

While a two-way mode (conversation) recording is in progress, dynamically toggles TTS on/off or updates the TTS voice settings. Available only in two-way mode.

Use Cases

  • Turning the TTS audio response off/on mid-conversation in two-way mode
  • Changing the TTS voice or speaking rate for a specific language

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value set_tts
tts_enabledbooleanNoWhether to enable two-way TTS (true / false)
tts_configobjectNoTTS settings per language; the key is the language code, and the value is {voice, speaking_rate}

Note: At least one of tts_enabled and tts_config must be provided. tts_config updates only the settings for the specified languages; unspecified languages remain unchanged.

Request Example (Disable TTS)

{
  "type": "voice-translation",
  "data": {
    "action": "set_tts",
    "tts_enabled": false
  }
}

Request Example (Update Voice Settings)

{
  "type": "voice-translation",
  "data": {
    "action": "set_tts",
    "tts_enabled": true,
    "tts_config": {
      "en-US": {
        "voice": "en-US-GuyNeural",
        "speaking_rate": 1.2
      }
    }
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "tts_updated",
    "tts_enabled": true,
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
    }
  }
}
FieldTypeDescription
tts_enabledbooleanThe current TTS enabled state
tts_configobjectThe current complete TTS settings (all languages)

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
invalid_action400Not two-way modeThis action is available only in conversation mode
session_not_started400Speech recognition has not startedCall start first

Voice Translation - start_speaking (Start Speaking / Manual Mode)

Description

In two-way manual mode (conversation_mode: "manual"), notifies the system that the user has started speaking. From this point on, audio is sent to STT for recognition, and all recognition results accumulate into a single sentence (no automatic segmentation).

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value start_speaking
speakerintYesUser number (1 or 2)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "start_speaking",
    "speaker": 1
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speaking started"
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
invalid_action400Not two-way modeUse only under the conversation type
conversation_not_manual_mode400Not manual modeUse only in manual mode
conversation_speaking400Already speakingCall stop_speaking first
conversation_invalid_speaker400Invalid user numberUse 1 or 2

Voice Translation - stop_speaking (Stop Speaking / Manual Mode)

Description

In two-way manual mode, notifies the system that the user has stopped speaking. The system merges the recognition results accumulated during the period into a single complete sentence and performs translation and TTS synthesis.

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value stop_speaking

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "stop_speaking"
  }
}

Success Response

After stopping speaking, the system sends a complete result event (containing origin and translations):

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "origin": {
      "sid": 1,
      "language": "zh-TW",
      "text": "The complete sentence merged from all recognition during this period",
      "is_final": true,
      "speaker_id": "Speaker-1",
      "start_time": "00:05"
    },
    "translations": {
      "en-US": {
        "sid": 1,
        "text": "The complete merged sentence from this speaking period",
        "is_final": true
      }
    }
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
invalid_action400Not two-way modeUse only under the conversation type
conversation_not_speaking400Not in a speaking stateCall start_speaking first

Voice Translation - switch_conversation_mode (Switch Conversation Mode)

Description

While two-way mode is in progress, switches between auto-detect mode (auto) and manual mode (manual). If the user is currently speaking when the switch happens, speaking is ended automatically.

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value switch_conversation_mode
conversation_modestringYesThe target mode: auto or manual

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "switch_conversation_mode",
    "conversation_mode": "manual"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "conversation_mode_changed",
    "conversation_mode": "manual"
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
invalid_action400Not two-way modeUse only under the conversation type
conversation_invalid_mode400Invalid conversation modeUse auto or manual

Voice Translation - set_speaker_language (Set User Language)

Description

While two-way mode is in progress, changes a specified user's language in real time. The system rebuilds the STT connection to accommodate the new language, and the translation target is also updated automatically. The transcript content before the change keeps its original language, and the timestamp continues to count without resetting.

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value set_speaker_language
speakerintYesUser number (1 or 2)
languagestringYesThe new language code (such as ja-JP)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "set_speaker_language",
    "speaker": 1,
    "language": "ja-JP"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_language_changed",
    "speaker_language_map": {
      "1": "ja-JP",
      "2": "en-US"
    }
  }
}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
invalid_action400Not two-way modeUse only under the conversation type
conversation_invalid_speaker400Invalid user numberUse 1 or 2
conversation_invalid_language400Invalid language codeUse a valid BCP 47 language code
conversation_same_language400Same as the current languageYou can ignore this warning
conversation_language_same_as_peer400The new language is the same as the other userThe two users cannot have the same language
conversation_speaking400Currently speaking, cannot change languageEnd speaking before changing
conversation_language_change_failed500Language change failed (STT rebuild failed)Retry later

Voice Translation - broadcast_go_live (Switch to the Live Phase)

Description

Switches from the broadcast standby phase (standby) to the live phase (live). After switching, STT/translation results begin broadcasting to viewers and start being written to the transcript.

Use Cases

  • The host confirms the equipment is working and starts the official broadcast
  • Switching from the warm-up phase to live streaming

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "broadcast_go_live"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "broadcast_phase_changed",
    "phase": "live",
    "message": "Broadcast started"
  }
}
FieldTypeDescription
phasestringThe new phase (live)
messagestringStatus description message

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
broadcast_not_enabled400Not broadcast modeConfirm type: "broadcast"
session_not_started400Speech recognition has not startedCall start first

Note: If already in the live phase, a status message "Broadcast is already in progress" is returned and is not treated as an error.


Voice Translation - broadcast_announcement (Send an Announcement)

Description

The host sends a custom message announcement to all viewers. Viewers receive an announcement event via SSE. The announcement message is automatically translated into all translation languages, and the SSE event viewers receive includes a translations field.

Use Cases

  • Notifying viewers that the meeting is about to end
  • Sending an important reminder or announcement
  • One-way communication with viewers

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value broadcast_announcement
messagestringYesThe announcement message content

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "broadcast_announcement",
    "message": "The meeting will end in 5 minutes"
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Announcement sent"
  }
}

The SSE event viewers receive (with translations):

event: announcement
data: {"message":"The meeting will end in 5 minutes","translations":{"en-US":"The meeting will end in 5 minutes","ja-JP":"会議は5分後に終了します"}}

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
broadcast_not_enabled400Not broadcast modeConfirm type: "broadcast"
invalid_parameter400Message is emptyProvide a valid message parameter

Voice Translation - set_standby_message (Set the Standby Phase Message)

Description

During the broadcast standby phase (standby), dynamically sets the message shown to viewers. This allows the host to enter standby mode and then set the waiting message, rather than being required to provide it at start.

The message is automatically translated into all translation languages, and the SSE event viewers receive includes a translations field.

Use Cases

  • After entering standby mode, dynamically set the waiting message shown to viewers
  • Update the text on the standby screen before going live
  • Reduce the required fields before starting the broadcast

Request Parameters

ParameterTypeRequiredDescription
actionstringYesFixed value set_standby_message
messagestringYesThe text displayed during the standby phase (translated for viewers of each language via the existing translation pipeline)

Request Example

{
  "type": "voice-translation",
  "data": {
    "action": "set_standby_message",
    "message": "The talk is about to begin, please wait..."
  }
}

Success Response

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Standby phase text updated"
  }
}

Event Viewers Receive

After a successful setting, all viewers in the standby phase receive an updated standby event via SSE:

event: standby
data: {"message":"The talk is about to begin, please wait...","translations":{"en-US":"The presentation is about to begin, please wait...","ja-JP":"プレゼンテーションがまもなく始まります。お待ちください..."}}

Note: The translations field contains the translation results for all translation languages. The frontend can display the corresponding translation based on the language the viewer selects.

Error Responses

Error CodeHTTP StatusDescriptionRecommended Action
broadcast_not_enabled400Not broadcast modeConfirm type: "broadcast"
broadcast_not_in_standby400Not in the standby phaseCan be used only during the standby phase

Note: This action can be used only during the standby phase (standby). If the broadcast has already entered the live phase (live), an error is returned.


Response Events

The following are all the response events you may receive over the WebSocket.

session_started - Session Started Successfully

After a start action succeeds, the server returns an event containing complete session initialization info. The frontend can distinguish the recording type via recording_type.

General recordings (transcribe / conversation / record):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "transcribe",
    "recognition_mode": "single",
    "message": "Speech recognition started"
  }
}

Broadcast mode (broadcast):

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "broadcast",
    "recognition_mode": "multi_speaker",
    "phase": "standby",
    "viewer_count": 0,
    "queue_count": 0,
    "peak_viewers": 0,
    "total_viewers": 0,
    "message": "Speech recognition started"
  }
}
FieldTypeDescription
session_idstringSession ID
recording_idstringRecording ID (can be used for subsequent API queries)
recording_typestringRecording type: transcribe, conversation, record, broadcast
recognition_modestringRecognition mode: single, multi_speaker
messagestringStatus description message
phasestringBroadcast phase: standby or live (broadcast mode only)
viewer_countintCurrent number of online viewers (broadcast mode only)
queue_countintNumber of viewers waiting in the queue (broadcast mode only)
peak_viewersintPeak number of viewers for this broadcast (broadcast mode only)
total_viewersintTotal cumulative number of viewers who have connected (broadcast mode only)

result - Recognition/Translation Result

Speech recognition and translation results. A single result event may contain origin (recognition result) and/or translations (translation results).

origin (speech recognition result):

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "origin": {
      "sid": 1,
      "language": "zh-TW",
      "text": "Hello, nice to meet you",
      "is_final": true,
      "speaker_id": "0",
      "detected_language": "zh-TW",
      "start_time": "00:05"
    }
  }
}
FieldTypeDescription
sidintSentence number, starting from 1
languagestringSource language code. In two-way mode, this is the automatically detected language.
textstringThe recognized text
is_finalbooleanWhether it is the final result
speaker_idstringSpeaker ID
detected_languagestringThe detected language. In two-way mode, this is determined automatically by the system.
start_timestringSentence start time (mm:ss); not sent during the broadcast standby phase; after going live, counts from 00:00.

translations (translation results):

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "translations": {
      "en-US": {
        "sid": 1,
        "text": "Hello, nice to meet you",
        "is_final": true
      }
    }
  }
}

Translation results are keyed by language code, and each language's translation object contains:

FieldTypeDescription
sidintSentence number
textstringThe translated text
is_finalbooleanWhether it is the final result
is_retranslationbooleanWhether it is a retranslation result (only during retranslate)

status - Generic Status Response

Used to confirm operations such as pause, resume, stop, and set_name.

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speech recognition paused"
  }
}
FieldTypeDescription
messagestringStatus description

task_complete - Task Processing Complete

Triggered after stop when the audio file and transcript have been uploaded. task_id can be used to query task details via the REST API afterward.

{
  "type": "voice-translation",
  "data": {
    "action": "task_complete",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "message": "Task processing complete"
  }
}
FieldTypeDescription
task_idstringRecording UUID, can be used for subsequent API queries
messagestringStatus description

config_updated - Settings Update Complete

Triggered after the config action succeeds.

{
  "type": "voice-translation",
  "data": {
    "action": "config_updated",
    "updated": ["terminology", "fuzzy_correction", "translation_dict"],
    "message": "Settings updated"
  }
}
FieldTypeDescription
updatedstringThe setting types that were updated (terminology, fuzzy_correction, translation_dict)
messagestringStatus message

tts_ready - TTS Audio Ready

TTS speech synthesis completion event. Contains the audio data and Word Boundary information (which can be used for a karaoke effect).

{
  "type": "voice-translation",
  "data": {
    "action": "tts_ready",
    "sid": 1,
    "language": "en-US",
    "transcript": "你好,很高興認識你",
    "text": "Hello, nice to meet you",
    "audio": "Base64EncodedMP3...",
    "format": "mp3",
    "duration_ms": 2500,
    "boundaries": [
      {"offset_ms": 0, "duration_ms": 350, "text_offset": 0, "word_length": 5, "text": "Hello"},
      {"offset_ms": 350, "duration_ms": 100, "text_offset": 5, "word_length": 1, "text": ","},
      {"offset_ms": 500, "duration_ms": 250, "text_offset": 7, "word_length": 4, "text": "nice"},
      {"offset_ms": 750, "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
      {"offset_ms": 950, "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
      {"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
    ]
  }
}
FieldTypeDescription
sidintSentence number
languagestringTTS language
transcriptstringThe original transcript (STT recognition result)
textstringThe translated text (TTS synthesis source)
audiostringBase64-encoded MP3 audio
formatstringAudio format (fixed value mp3)
duration_msintTotal audio duration (milliseconds)
boundariesarrayArray of Word Boundaries

Word Boundary Field Descriptions

FieldTypeDescription
offset_msintThe word's start time in the audio (milliseconds)
duration_msintThe word's duration (milliseconds)
text_offsetintPosition in the original string (character index)
word_lengthintWord length (number of characters)
textstringThe word content

tts_error - TTS Synthesis Failed

TTS synthesis failure event.

{
  "type": "voice-translation",
  "data": {
    "action": "tts_error",
    "sid": 1,
    "language": "en-US",
    "error": "translation_not_found",
    "message": "No translation available for language: en-US"
  }
}
FieldTypeDescription
sidintSentence number
languagestringTTS language
errorstringError code
messagestringError message

TTS Error Codes

Error CodeDescription
translation_not_foundNo translation found for that language
tts_synthesis_failedTTS synthesis failed
tts_quota_exceededTTS usage has reached the limit

viewer_count - Viewer Count Update

Broadcast mode only

During a broadcast, the system checks the viewer count every 3 seconds and pushes this event to the host if it changes.

{
  "type": "voice-translation",
  "data": {
    "action": "viewer_count",
    "viewer_count": 45,
    "queue_count": 8,
    "peak_viewers": 50,
    "total_viewers": 123
  }
}
FieldTypeDescription
viewer_countintCurrent number of online viewers
queue_countintNumber of viewers waiting in the queue
peak_viewersintPeak number of viewers for this broadcast
total_viewersintTotal cumulative number of viewers who have connected

Note: This event is pushed only when the viewer count or queue count changes, to avoid unnecessary message traffic.


viewer_joined - Viewer Joined

Broadcast mode only

When a viewer joins the broadcast, the host receives this event.

{
  "type": "voice-translation",
  "data": {
    "action": "viewer_joined",
    "viewer_count": 5,
    "queue_count": 2
  }
}
FieldTypeDescription
viewer_countnumberCurrent number of viewers
queue_countnumberNumber waiting in the queue

viewer_left - Viewer Left

Broadcast mode only

When a viewer leaves the broadcast, the host receives this event.

{
  "type": "voice-translation",
  "data": {
    "action": "viewer_left",
    "viewer_count": 4,
    "queue_count": 1
  }
}
FieldTypeDescription
viewer_countnumberCurrent number of viewers
queue_countnumberNumber waiting in the queue

broadcast_phase_changed - Broadcast Phase Changed

Triggered when the broadcast phase switches from standby to live.

{
  "type": "voice-translation",
  "data": {
    "action": "broadcast_phase_changed",
    "phase": "live",
    "message": "Broadcast started"
  }
}
FieldTypeDescription
phasestringThe new phase: standby or live
messagestringStatus description message

speaker_renamed - Speaker Renamed

Speaker global rename completion event.

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_renamed",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang",
    "affected_sids": [1, 3, 5, 8]
  }
}
FieldTypeDescription
speaker_idstringThe resolved original speaker ID (even if the input was a display label, the event returns the original ID)
new_labelstringThe new display label
affected_sidsintThe list of affected sentence numbers

speaker_reassigned - Speaker Identity Changed

Single-sentence speaker identity change completion event.

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_reassigned",
    "sid": 5,
    "old_speaker_id": "Guest-1",
    "new_speaker_id": "Guest-2",
    "new_speaker_label": "Lee Hsiao-hua"
  }
}
FieldTypeDescription
sidintThe changed sentence number
old_speaker_idstringThe original speaker ID
new_speaker_idstringThe new original speaker ID
new_speaker_labelstringThe new speaker display label (after applying speaker_aliases; equals new_speaker_id when no alias exists)

speakers_merged - Speakers Merged

Speaker merge completion event. After the merge, future recognition results for that source speaker are also automatically converted to the target speaker.

{
  "type": "voice-translation",
  "data": {
    "action": "speakers_merged",
    "source_speaker_id": "Guest-2",
    "target_speaker_id": "Guest-1",
    "target_speaker_label": "Manager Wang",
    "affected_sids": [3, 5, 7]
  }
}
FieldTypeDescription
source_speaker_idstringThe original ID of the merged speaker
target_speaker_idstringThe original ID of the merge target
target_speaker_labelstringThe target speaker display label (after applying speaker_aliases; equals the original ID when no alias exists)
affected_sidsnumberThe list of affected sentence IDs

language_switch_start - Language Switch Started

Language switch start event, sent after the switch_language action is triggered.

{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_start",
    "translation_language": "ja-JP",
    "total_segments": 15,
    "message": "Starting language switch and retranslation"
  }
}
FieldTypeDescription
translation_languagestringThe new translation target language
total_segmentsintThe number of sentences that need retranslation
messagestringStatus description

batch_retranslation - Batch Retranslation Result

Batch retranslation result event, sent sentence by sentence during the language switch process.

{
  "type": "voice-translation",
  "data": {
    "action": "batch_retranslation",
    "sid": 3,
    "translations": {
      "ja-JP": {
        "sid": 3,
        "text": "今日はプロジェクトの進捗について話し合いましょう",
        "is_final": true,
        "is_retranslation": true
      }
    }
  }
}
FieldTypeDescription
sidintSentence number
translationsobjectTranslation results (same format as result's translations)

language_switch_done - Language Switch Complete

Language switch completion event.

{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_done",
    "translation_language": "ja-JP",
    "success_count": 15,
    "failed_count": 0,
    "message": "Language switch complete"
  }
}
FieldTypeDescription
translation_languagestringThe translation target language
success_countintThe number of successfully translated sentences
failed_countintThe number of sentences that failed to translate
messagestringStatus description

tts_mode_changed - TTS Mode Changed

TTS playback mode change event.

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode_changed",
    "tts_mode": "async"
  }
}
FieldTypeDescription
tts_modestringThe new mode: sync or async

language_switched - Two-Way Language Switch Complete

Two-way mode (conversation) language switch completion event. Triggered after switch_language successfully switches the STT source language in two-way mode.

{
  "type": "voice-translation",
  "data": {
    "action": "language_switched",
    "language": "en-US",
    "translation_language": "zh-TW",
    "message": "Language switched"
  }
}
FieldTypeDescription
languagestringThe new active language (STT source)
translation_languagestringThe new translation target language
messagestringStatus message

tts_updated - Two-Way TTS Settings Updated

Two-way mode (conversation) TTS settings update event. Triggered after set_tts successfully updates the TTS toggle or voice settings.

{
  "type": "voice-translation",
  "data": {
    "action": "tts_updated",
    "tts_enabled": true,
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
    }
  }
}
FieldTypeDescription
tts_enabledbooleanWhether TTS is enabled
tts_configobjectThe TTS settings for each language (voice, speaking_rate)

conversation_mode_changed - Conversation Mode Changed

Two-way mode (conversation) conversation mode change event. Triggered after switch_conversation_mode successfully switches between auto/manual mode.

{
  "type": "voice-translation",
  "data": {
    "action": "conversation_mode_changed",
    "conversation_mode": "manual"
  }
}
FieldTypeDescription
conversation_modestringThe new conversation mode: auto or manual

speaker_language_changed - User Language Changed

Two-way mode (conversation) user language change event. Triggered after set_speaker_language successfully changes a user's language, including the complete language mapping after the change.

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_language_changed",
    "speaker_language_map": {
      "1": "ja-JP",
      "2": "en-US"
    }
  }
}
FieldTypeDescription
speaker_language_mapobjectThe user language mapping after the change (keys are user number strings)

segment_uploaded - Audio Segment Upload Complete

Audio segment upload completion event. Triggered each time an audio segment is successfully uploaded to cloud storage; can be used to show upload progress on the frontend.

{
  "type": "voice-translation",
  "data": {
    "action": "segment_uploaded",
    "segment_index": 0,
    "duration_sec": 30.5
  }
}
FieldTypeDescription
segment_indexnumberSegment index (starting from 0)
duration_secnumberThe duration of this segment (seconds)

stt_event - STT Connection Status Event

STT connection status event. Triggered when the connection status of the speech recognition service changes; can be used to show the STT service status on the frontend.

{
  "type": "voice-translation",
  "data": {
    "action": "stt_event",
    "event": "connected",
    "message": "STT service connected"
  }
}
FieldTypeDescription
eventstringEvent type: connected / disconnected / error
messagestringEvent description message

error - Error Event

Triggered when an operation fails or a system anomaly occurs.

{
  "type": "error",
  "data": {
    "error_code": "session_not_started",
    "severity": "error",
    "message": "Session not started",
    "context": "voice-translation",
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-01-15T10:30:45.123Z"
  }
}
FieldTypeDescription
error_codestringError code (for programmatic handling)
severitystringSeverity: fatal / error / warning
messagestringHuman-readable error message
contextstringError source category
request_idstringRequest tracking ID
timestampstringTime the error occurred (ISO 8601)

Severity Descriptions

severityDescriptionRecommended Action
fatalFatal errorStop the service and require reconnection
errorOperation failedShow an error notice and allow retry
warningWarningShow a warning without blocking the operation

For the full list of error codes, refer to Error Code Reference.


Version: V1.5.7 Last Updated: 2026-05-20

Copyright © 2026