WebSocket API

Events

Overview

A reference for all response event formats you may receive over WebSocket. For connection and authentication, see Connection and Authentication; for request operations, see Voice Translation Actions.


Table of Contents

  1. session_started - Session started successfully
  2. result - Recognition/translation result
  3. status - Generic status response
  4. task_complete - Task processing complete
  5. config_updated - Configuration update complete
  6. tts_ready - TTS audio ready
  7. tts_error - TTS synthesis failed
  8. viewer_count - Viewer count update
  9. broadcast_phase_changed - Broadcast phase changed
  10. speaker_renamed - Speaker renamed
  11. speaker_reassigned - Speaker identity reassigned
  12. speakers_merged - Speakers merged
  13. language_switch_start - Language switch started
  14. batch_retranslation - Batch retranslation result
  15. language_switch_done - Language switch complete
  16. tts_mode_changed - TTS mode changed
  17. language_switched - Conversation language switch complete
  18. tts_updated - Conversation TTS settings updated
  19. conversation_mode_changed - Conversation mode changed
  20. speaker_language_changed - Speaker language changed
  21. error - Error event
  22. segment_uploaded - Audio segment upload complete
  23. stt_event - STT connection status event
  24. viewer_joined - Viewer joined event
  25. viewer_left - Viewer left event
  26. upload_error - Upload error
  27. summary_done - Summary generation complete
  28. summary_error - Summary generation failed

session_started

Description

After a start action succeeds, the server returns an event containing the complete initial session information. The frontend can use recording_type to distinguish the recording type.

Standard recording (transcribe / conversation / record)

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "task_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "transcribe",
    "recognition_mode": "single",
    "message": "Speech recognition started"
  }
}

Broadcast mode (broadcast)

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "task_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "broadcast",
    "recognition_mode": "multi_speaker",
    "phase": "standby",
    "viewer_count": 0,
    "queue_count": 0,
    "peak_viewers": 0,
    "total_viewers": 0,
    "message": "Speech recognition started"
  }
}

Field descriptions

FieldTypeDescription
session_idstringSession ID (WS connection scope; invalid once the connection ends)
task_idstringTask ID (the same identifier as REST /api/v1/tasks/{taskId} and Webhook data.task_id)
recording_idstringDeprecated (since V1.4.1): same value as task_id, will be removed in V2.0.0; use task_id instead
recording_typestringRecording type: transcribe, conversation, record, broadcast
recognition_modestringRecognition mode: single, multi_speaker
messagestringStatus description message
phasestringBroadcast phase: standby or live (broadcast mode only)
viewer_countintCurrent number of online viewers (broadcast mode only)
queue_countintNumber of viewers waiting in the queue (broadcast mode only)
peak_viewersintPeak viewer count for this broadcast (broadcast mode only)
total_viewersintCumulative total of viewers that have ever connected (broadcast mode only)

ID alignment tip: The WebSocket, REST, and Webhook interfaces all use task_id (a UUID) as the unified identifier for a task. session_id is a WS connection-scope identifier and is a different concept from a task. recording_id is the old name used before V1.4.1; its value is exactly the same as task_id and it is retained only for backward compatibility.


result

Description

Speech recognition and translation results. A single result event may contain origin (the recognition result) and/or translations (the translation results).

origin (speech recognition result)

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "origin": {
      "sid": 1,
      "language": "zh-TW",
      "text": "Hello, nice to meet you",
      "is_final": true,
      "speaker_id": "0",
      "detected_language": "zh-TW",
      "start_time": "00:05"
    }
  }
}

origin field descriptions

FieldTypeDescription
sidintSentence number, starting from 1
languagestringSource language code. In conversation mode, this is the automatically detected language
textstringThe recognized text
is_finalbooleanWhether this is the final result
speaker_idstringOriginal speaker ID
speaker_labelstring(Multi-speaker mode) Display label (after applying the alias; equals speaker_id when no alias exists)
detected_languagestringThe detected language. In conversation mode, determined automatically by the system
start_timestringSentence start time (mm:ss); not sent during the broadcast standby phase, and counted from 00:00 once live begins

translations (translation result)

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "translations": {
      "en-US": {
        "sid": 1,
        "text": "Hello, nice to meet you",
        "is_final": true
      }
    }
  }
}

translations field descriptions

Translation results are keyed by language code; each language's translation object contains:

FieldTypeDescription
sidintSentence number
textstringThe translated text
is_finalbooleanWhether this is the final result
is_retranslationbooleanWhether this is a retranslation result (only for retranslate)
speaker_idstring(Multi-speaker mode) Original speaker ID (aligned with origin since v1.5.3)
speaker_labelstring(Multi-speaker mode) Display label (after applying the alias; equals speaker_id when no alias exists)

Important: The success response for the retranslate action uses a separate action: "translation" event (not result); the payload structure is the same as the table above. See voice-translation.md retranslate success response.


status

Description

A generic status response, used to confirm operations such as pause, resume, stop, and set_name.

{
  "type": "voice-translation",
  "data": {
    "action": "status",
    "message": "Speech recognition paused"
  }
}

Field descriptions

FieldTypeDescription
messagestringStatus description

task_complete

Description

Triggered after stop, once the audio file and transcript have finished uploading. The task_id can be used for subsequent REST API queries about the task details.

{
  "type": "voice-translation",
  "data": {
    "action": "task_complete",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "message": "Task processing complete"
  }
}

Field descriptions

FieldTypeDescription
task_idstringRecording UUID, usable for subsequent API queries
messagestringStatus description

config_updated

Description

Configuration update complete event, triggered after a config action succeeds.

{
  "type": "voice-translation",
  "data": {
    "action": "config_updated",
    "updated": ["terminology", "fuzzy_correction", "translation_dict"],
    "message": "Configuration updated"
  }
}

Field descriptions

FieldTypeDescription
updatedstringThe configuration types that were updated (terminology, fuzzy_correction, translation_dict)
terminology_effectivestring(Optional) If the terminology is updated during recording, a value of "next_turn" indicates the new terminology takes effect from the next sentence; this field does not appear in the initial config
messagestringStatus message

tts_ready

Description

TTS speech synthesis complete event. It contains the audio data and Word Boundary information (which can be used for a karaoke effect).

{
  "type": "voice-translation",
  "data": {
    "action": "tts_ready",
    "sid": 1,
    "language": "en-US",
    "transcript": "Hello, nice to meet you",
    "text": "Hello, nice to meet you",
    "audio": "Base64EncodedMP3...",
    "format": "mp3",
    "duration_ms": 2500,
    "boundaries": [
      {"offset_ms": 0, "duration_ms": 350, "text_offset": 0, "word_length": 5, "text": "Hello", "boundary_type": "WordBoundary"},
      {"offset_ms": 350, "duration_ms": 100, "text_offset": 5, "word_length": 1, "text": ",", "boundary_type": "PunctuationBoundary"},
      {"offset_ms": 500, "duration_ms": 250, "text_offset": 7, "word_length": 4, "text": "nice", "boundary_type": "WordBoundary"},
      {"offset_ms": 750, "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to", "boundary_type": "WordBoundary"},
      {"offset_ms": 950, "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet", "boundary_type": "WordBoundary"},
      {"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you", "boundary_type": "WordBoundary"}
    ]
  }
}

Field descriptions

FieldTypeDescription
sidintSentence number
languagestringTTS language
transcriptstringOriginal transcript (STT recognition result)
textstringTranslated text (the source for TTS synthesis)
audiostringBase64-encoded MP3 audio
formatstringAudio format (always mp3)
duration_msintTotal audio duration (milliseconds)
boundariesarrayWord Boundary array

Word Boundary field descriptions

FieldTypeDescription
offset_msintThe word's start time within the audio (milliseconds)
duration_msintThe word's duration (milliseconds)
text_offsetintThe position within the original string (character index)
word_lengthintWord length (number of characters)
textstringThe word content
boundary_typestringBoundary type; common values: WordBoundary, PunctuationBoundary, SentenceBoundary, etc.

tts_error

Description

TTS synthesis failed event.

{
  "type": "voice-translation",
  "data": {
    "action": "tts_error",
    "sid": 1,
    "language": "en-US",
    "error": "translation_not_found",
    "message": "No translation available for language: en-US"
  }
}

Field descriptions

FieldTypeDescription
sidintSentence number
languagestringTTS language
errorstringError code
messagestringError message
transcriptstring(Optional) The corresponding original transcript, to help the frontend locate the point of failure

TTS error codes

Error codeDescription
translation_not_foundNo translation found for this language
tts_synthesis_failedTTS synthesis failed
tts_quota_exceededTTS usage has reached its limit

viewer_count

Broadcast mode only

Description

While a broadcast is in progress, the system checks the viewer count every 3 seconds and pushes this event to the host whenever it changes.

{
  "type": "voice-translation",
  "data": {
    "action": "viewer_count",
    "viewer_count": 45,
    "queue_count": 8,
    "peak_viewers": 50,
    "total_viewers": 123
  }
}

Field descriptions

FieldTypeDescription
viewer_countintCurrent number of online viewers
queue_countintNumber of viewers waiting in the queue
peak_viewersintPeak viewer count for this broadcast
total_viewersintCumulative total of viewers that have ever connected

Note: This event is pushed only when the viewer count or queue count changes, to avoid unnecessary message transmission.


broadcast_phase_changed

Description

Triggered when the broadcast phase switches from standby to live.

{
  "type": "voice-translation",
  "data": {
    "action": "broadcast_phase_changed",
    "phase": "live",
    "message": "Broadcast started"
  }
}

Field descriptions

FieldTypeDescription
phasestringThe new phase: standby or live
messagestringStatus description message

speaker_renamed

Description

Global speaker rename complete event.

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_renamed",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang",
    "affected_sids": [1, 3, 5, 8]
  }
}

Field descriptions

FieldTypeDescription
speaker_idstringThe resolved original speaker ID (even if the input was a display label, the event returns the original ID)
new_labelstringThe new display label
affected_sidsintList of affected sentence numbers

speaker_reassigned

Description

Single-sentence speaker reassignment complete event.

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_reassigned",
    "sid": 5,
    "old_speaker_id": "Guest-1",
    "new_speaker_id": "Guest-2",
    "new_speaker_label": "Lisa Lee"
  }
}

Field descriptions

FieldTypeDescription
sidintThe sentence number that was changed
old_speaker_idstringThe original speaker ID
new_speaker_idstringThe new original speaker ID
new_speaker_labelstringThe new speaker display label (after applying speaker_aliases; equals new_speaker_id when no alias exists)

speakers_merged

Description

Speaker merge complete event. After the merge, future recognition results produced by the source speaker are also automatically converted to the target speaker.

{
  "type": "voice-translation",
  "data": {
    "action": "speakers_merged",
    "source_speaker_id": "Guest-2",
    "target_speaker_id": "Guest-1",
    "affected_sids": [3, 5, 7]
  }
}

Field descriptions

FieldTypeDescription
source_speaker_idstringThe original ID of the merged-away speaker
target_speaker_idstringThe original ID of the merge target speaker
affected_sidsnumberList of affected sentence IDs

To obtain the target speaker's display label, query speaker_aliases or the next init_metadata event.


language_switch_start

Description

Language switch started event, sent after the switch_language action is triggered.

{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_start",
    "translation_language": "ja-JP",
    "total_segments": 15,
    "message": "Starting language switch and retranslation"
  }
}

Field descriptions

FieldTypeDescription
translation_languagestringThe new translation target language
total_segmentsintThe number of sentences to retranslate
messagestringStatus description

batch_retranslation

Description

Batch retranslation result event, sent sentence by sentence during the language switch process.

{
  "type": "voice-translation",
  "data": {
    "action": "batch_retranslation",
    "sid": 3,
    "translations": {
      "ja-JP": {
        "sid": 3,
        "text": "今日はプロジェクトの進捗について話し合いましょう",
        "is_final": true,
        "is_retranslation": true
      }
    }
  }
}

Field descriptions

FieldTypeDescription
sidintSentence number
translationsobjectTranslation result (same format as the translations in result)

language_switch_done

Description

Language switch complete event.

{
  "type": "voice-translation",
  "data": {
    "action": "language_switch_done",
    "translation_language": "ja-JP",
    "success_count": 15,
    "failed_count": 2,
    "failed_sids": [3, 7],
    "message": "Language switch complete"
  }
}

Field descriptions

FieldTypeDescription
translation_languagestringThe translation target language
success_countintThe number of sentences successfully translated
failed_countintThe number of sentences that failed to translate
failed_sidsintList of sentence numbers that failed to translate (included only when failed_count > 0)
messagestringStatus description

tts_mode_changed

Description

TTS playback mode changed event.

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode_changed",
    "tts_mode": "async"
  }
}

Field descriptions

FieldTypeDescription
tts_modestringThe new mode: sync or async

language_switched

Description

Conversation-mode (conversation) language switch complete event. Triggered when switch_language successfully switches the STT source language in conversation mode.

{
  "type": "voice-translation",
  "data": {
    "action": "language_switched",
    "language": "en-US",
    "translation_language": "zh-TW",
    "message": "Language switched"
  }
}

Field descriptions

FieldTypeDescription
languagestringThe new active language (STT source)
translation_languagestringThe new translation target language
messagestringStatus message

tts_updated

Description

Conversation-mode (conversation) TTS settings updated event. Triggered when set_tts successfully updates the TTS toggle or voice settings.

{
  "type": "voice-translation",
  "data": {
    "action": "tts_updated",
    "tts_enabled": true,
    "tts_config": {
      "zh-TW": { "voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0 },
      "en-US": { "voice": "en-US-GuyNeural", "speaking_rate": 1.2 }
    }
  }
}

Field descriptions

FieldTypeDescription
tts_enabledbooleanWhether TTS is enabled
tts_configobjectThe TTS settings for each language (voice, speaking_rate)

conversation_mode_changed

Description

Conversation-mode (conversation) mode changed event. Triggered when switch_conversation_mode successfully switches between auto and manual mode.

{
  "type": "voice-translation",
  "data": {
    "action": "conversation_mode_changed",
    "conversation_mode": "manual"
  }
}

Field descriptions

FieldTypeDescription
conversation_modestringThe new conversation mode: auto or manual

speaker_language_changed

Description

Conversation-mode (conversation) speaker language changed event. Triggered when set_speaker_language successfully changes a speaker's language; it includes the complete language map after the change.

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_language_changed",
    "speaker_language_map": {
      "1": "ja-JP",
      "2": "en-US"
    }
  }
}

Field descriptions

FieldTypeDescription
speaker_language_mapobjectThe speaker language map after the change (the key is the speaker number as a string)

segment_uploaded

Description

Audio segment upload complete event. Triggered whenever an audio segment is successfully uploaded to cloud storage; it can be used to display upload progress in the frontend.

{
  "type": "voice-translation",
  "data": {
    "action": "segment_uploaded",
    "segment_index": 0,
    "duration_sec": 30.5
  }
}

Field descriptions

FieldTypeDescription
segment_indexnumberSegment index (starting from 0)
duration_secnumberThe duration of this segment (seconds)

stt_event

Description

STT connection status event. Triggered when the connection status of the speech recognition service changes; it can be used to display the STT service status in the frontend.

{
  "type": "voice-translation",
  "data": {
    "action": "stt_event",
    "event": "connected",
    "message": "STT service connected"
  }
}

Field descriptions

FieldTypeDescription
eventstringEvent type: connected / disconnected / error
messagestringEvent description message

viewer_joined

Description

Viewer joined event (broadcast mode only). When a viewer joins the broadcast, the host receives this event.

{
  "type": "voice-translation",
  "data": {
    "action": "viewer_joined",
    "viewer": {
      "id": "viewer_abc123",
      "ip": "192.168.1.100",
      "language": "zh-TW"
    },
    "viewer_count": 5,
    "queue_count": 2
  }
}

Field descriptions

FieldTypeDescription
viewerobjectInformation about the viewer who joined
viewer.idstringViewer ID
viewer.ipstringViewer IP address
viewer.languagestringThe language the viewer selected
viewer_countnumberCurrent viewer count
queue_countnumberNumber of viewers in the queue

viewer_left

Description

Viewer left event (broadcast mode only). When a viewer leaves the broadcast, the host receives this event.

{
  "type": "voice-translation",
  "data": {
    "action": "viewer_left",
    "viewer_id": "viewer_abc123",
    "viewer_count": 4,
    "queue_count": 1
  }
}

Field descriptions

FieldTypeDescription
viewer_idstringThe ID of the viewer who left
viewer_countnumberCurrent viewer count
queue_countnumberNumber of viewers in the queue

error

Description

Error event. Triggered when an operation fails or a system error occurs.

{
  "type": "error",
  "data": {
    "error_code": "session_not_started",
    "severity": "error",
    "message": "Session not started",
    "context": "voice-translation",
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-01-15T10:30:45.123Z"
  }
}

A sentence-level error (such as a translation failure for one language of a sentence) additionally carries sid and details:

{
  "type": "error",
  "data": {
    "error_code": "llm_content_filtered",
    "severity": "warning",
    "message": "Content filtered",
    "context": "translation",
    "sid": 5,
    "request_id": "req_abc123xyz789",
    "timestamp": "2026-04-26T10:30:45.123Z",
    "details": {
      "provider": "llm_service",
      "source_lang": "zh-TW",
      "translation_language": "ja"
    }
  }
}

Field descriptions

FieldTypeDescription
error_codestringError code (for programmatic handling)
severitystringSeverity: fatal / error / warning
messagestringHuman-readable error message
contextstringError source category
sidintOptional. The sentence number for a sentence-level error (such as a translation failure for that sentence); not included for non-sentence-level errors
request_idstringRequest tracing ID
timestampstringThe time the error occurred (ISO 8601)
detailsobjectOptional. Error context; common keys: provider, translation_language, source_lang. internal_error (a panic recovered for a single message) additionally carries message_type (always present) and action (best-effort; absent when parsing fails). See websocket-api.md Per-Message Errors

Severity descriptions

severityDescriptionRecommended handling
fatalFatal errorStop the service and require reconnection
errorOperation failedShow an error prompt and allow retry
warningWarningShow a warning without blocking the operation

For the complete list of error codes, see Error Code Reference.


upload_error

v1.5.6 documentation fix: Earlier documentation described a standalone event format of type: "voice-translation" + action: "upload_error", but in practice this format was never sent on the wire. Storage upload failures always use the unified error envelope, with one of the three error codes in the table below.

If your client listens for action === "upload_error", switch to listening for type === "error" and matching on error_code.

Storage-layer error codes (sent via the error event)

Error codeDescription
storage_connection_failedStorage service connection failed
storage_upload_failedFile upload failed
storage_queue_fullUpload queue full

summary_done

Description

An event pushed after recording stops, once server-side non-streaming summary generation is complete. After receiving this event, the client can call GET /api/v1/sse/history/transcribe/{taskId} to retrieve the summary content (the payload does not include final_content, to avoid bloating the WebSocket message).

v1.5.5 adds two fallback audit fields, summary_fallback_level / summary_dropped_segments: when a custom prompt or transcript content triggers the LLM service content filter, the backend automatically downgrades (L1→L2→L3) and uses these two fields to notify the client of the path actually taken.

Examples

L1 succeeds directly (no fallback, no filtering triggered):

{
  "type": "voice-translation",
  "data": {
    "action": "summary_done",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "summary_id": "sum_a1b2c3d4e5f6g7h8",
    "summary_mode": "custom",
    "summary_template": "skin-clinic-acme-v2",
    "summary_plain_text": true,
    "tokens_used": { "input": 1234, "output": 567 }
  }
}

L3 triggered (summary produced after transcript segments were trimmed):

{
  "type": "voice-translation",
  "data": {
    "action": "summary_done",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "summary_id": "sum_a1b2c3d4e5f6g7h8",
    "summary_mode": "custom",
    "summary_template": "skin-clinic-acme-v2",
    "summary_plain_text": true,
    "tokens_used": { "input": 3456, "output": 789 },
    "summary_fallback_level": 3,
    "summary_dropped_segments": [3, 7]
  }
}

Field descriptions

FieldTypeDescription
actionstringAlways summary_done
task_idstringRecording UUID
summary_idstringThe internal ID of this summary
summary_modestring"builtin" or "custom"
summary_templatestringeffective slug — builtin → the built-in template slug (such as meeting); custom → the customer slug
summary_plain_textbooleanWhether the output is plain text
tokens_used.input / .outputintToken usage (a cumulative value across all calls when an L2/L3 fallback is triggered)
summary_fallback_levelint (omit)Present only when a fallback was triggered (2 or 3); omitted when L1 succeeds directly. 2 = L2 neutral prompt; 3 = L3 segment trimming
summary_dropped_segmentsint (omit)Present only when fallback_level=3; the indices of the trimmed transcript segments (in original order)

Interpreting the fallback level (for frontend UI hints)

summary_fallback_levelMeaningSuggested UI hint
(field omitted)L1 succeeds directly, no fallbackDo not show a hint
2The customer prompt triggered filtering, so a neutral fallback prompt was used instead"Your custom instructions contained terms the content filter could not process; the summary was generated using neutral mode"
3The transcript content triggered filtering; the offending segments were trimmed before producing the summary"The transcript contained N segments that could not be processed; the summary was generated after omitting the relevant content" (N = summary_dropped_segments.length)

If L3 fails, summary_done is not sent; instead, summary_error is sent with error_code=llm_content_filtered (see §summary_error below).

Note: The payload deliberately does not include final_content. The client must call GET /api/v1/sse/history/transcribe/{taskId} itself to retrieve the full summary text. summary_fallback_level and summary_dropped_segments are also provided as top-level fields of the init_summary event during history playback.


summary_error

Description

An event pushed when summary generation fails, so the client does not need to keep polling to find out.

Example

{
  "type": "voice-translation",
  "data": {
    "action": "summary_error",
    "task_id": "550e8400-e29b-41d4-a716-446655440000",
    "error_code": "summary_failed",
    "message": "Summary generation failed"
  }
}

Field descriptions

FieldTypeDescription
actionstringAlways summary_error
task_idstringRecording UUID
error_codestringSummary error code (such as summary_failed / summary_timeout / summary_mode_field_mismatch, etc.)
messagestringHuman-readable error message (already sanitized; does not include the LLM raw error)

Version: V1.5.7 Last Updated: 2026-05-20

Copyright © 2026