Concepts
Task
A Task is the top-level concept in VAS, representing one complete speech-processing job. Each Task has a unique task_id (UUID) and, once completed, can be queried and managed through the REST API.
A Task can be created in the following ways:
- Live recording: Created automatically after voice translation over WebSocket
- Audio import: Created automatically after an uploaded audio file finishes processing
Recording
A Recording is the core data of a Task, containing the audio file and the transcript. Each Task corresponds to one Recording.
Recording Types
| Type | Description | Use Case |
|---|---|---|
transcribe | Speech-to-text | Meeting notes, interview records |
conversation | Bilingual real-time interpretation | Two-person cross-language conversations, live interpretation |
record | Plain recording | Voice memos, quick notes |
broadcast | Broadcast / live streaming | Lectures, talks, live content |
Type Differences
| Feature | transcribe | conversation | record | broadcast |
|---|---|---|---|---|
| Speech recognition | v | v | v | v |
| Translation | v | v (two-way) | - | v |
| TTS audio output | v (manual trigger) | v (automatic) | - | v (broadcast viewers) |
| Summary | v (template required) | v (optional) | - | v (optional) |
| Broadcast | - | - | - | v |
Processing Status
A Recording goes through several processing statuses from creation to completion. You can query active tasks with GET /api/v1/tasks?status=active.
| Status | Description | Trigger |
|---|---|---|
recording | Live recording in progress | WebSocket recording started, broadcast started |
importing | Audio import being processed | After audio file is uploaded |
uploading | Uploading to cloud storage | After recording stops, after import completes |
processing | Speech recognition and translation in progress | After upload completes |
completed | Processing complete | Final status |
failed | Processing failed | Final status |
Live recording / broadcast: recording → uploading → processing → completed / failed
Audio import: importing → uploading → processing → completed / failed
Session
A Session is an active working session within a single WebSocket connection, from start to stop.
Session Lifecycle
Connect WebSocket
│
▼
start ──→ session_started (obtain session_id + recording_id)
│
▼
Send audio (audio) ──→ Receive recognition results (result)
│
├── pause / resume (optional)
├── config (optional, update terminology)
├── retranslate (optional, re-translate)
├── switch_language (optional, switch language)
├── set_tts (interpretation mode, toggle TTS on/off and settings)
│
▼
stop ──→ status (stop confirmation)
│
▼
task_complete (audio and transcript upload complete, obtain task_id)
Speaker Recognition Modes
| Mode | Description | Use Case |
|---|---|---|
single | Single-speaker mode (default) | Solo talks, memos |
multi_speaker | Multi-speaker conversation mode | Meetings, interviews, conversations (supports 31 languages) |
In multi-speaker mode, the system automatically recognizes different speakers and labels them as Guest-1, Guest-2, and so on. You can manage speakers with the following operations:
- Rename: Rename
Guest-1to the speaker's actual name - Reassign: Change which speaker a particular sentence belongs to
- Merge: Merge two speakers into one
For details, see the Speaker Management Guide.
Data Flow
Microphone audio
│
▼
Speech recognition (STT) ──→ Original text (origin)
│
▼
Translation ──→ Translation result (translation)
│
├── Real-time translation: translate immediately after each sentence is recognized
└── Non-real-time translation: translate only after is_final
│
▼
TTS speech synthesis (optional) ──→ Audio (tts_ready)
│
▼
Meeting summary (optional) ──→ Summary result
Real-Time vs. Non-Real-Time Translation
| Mode | realtime_translation | Behavior |
|---|---|---|
| Non-real-time (default) | false | Translate after the sentence is confirmed (is_final: true); results are more accurate |
| Real-time | true | Trigger translation on every recognition update; lower latency but may update multiple times |
Interpretation Mode
Interpretation mode lets two people who speak different languages hold a real-time interpreted conversation over a single WebSocket connection. The system automatically detects the language of each sentence and maps the translation direction accordingly, and the entire process is completely transparent to users.
Person A speaks Chinese → auto-detect zh-TW → translate to en-US → TTS (en-US voice) → tts_ready
Person B speaks English → auto-detect en-US → translate to zh-TW → TTS (zh-TW voice) → tts_ready
Interpretation Mode Characteristics
| Item | Description |
|---|---|
| Connection model | Single WebSocket connection, two people sharing one device |
| Number of languages | Exactly 2 (e.g., zh-TW + en-US) |
| Recognition mode | Automatic language detection |
| Language detection | Detected automatically per sentence; no manual switching needed |
| TTS output | Translation results are synthesized to TTS automatically (can be disabled, and toggled mid-session via set_tts) |
| Summary | Supports optional automatic summary (summary_template) |
Interpretation Flow Overview
- WebSocket: Start with the
conversationtype and specify the two languages - Send audio: The system automatically detects the language, translates to the other language, and returns TTS
- Automatic language detection: Each sentence is detected independently;
origin.languagereflects the detected language - Stop: Stop recording; a Task is created automatically once processing completes
For details, see the interpretation mode section of the Real-Time Voice Translation Guide.
Broadcast Architecture
The broadcast feature delivers one presenter's speech to many viewers in real time.
Presenter (WebSocket)
│
├── Send audio ──→ STT ──→ Translation ──→ TTS (optional)
│
▼
VAS server
│
▼
Viewer 1 (SSE) ──→ Live subtitles + TTS audio
Viewer 2 (SSE) ──→ Live subtitles + TTS audio
Viewer N (SSE) ──→ Live subtitles + TTS audio
Broadcast Phases
| Phase | Description |
|---|---|
standby | The presenter can test equipment; viewers see a waiting message |
live | Live subtitles are broadcast to all viewers |
Broadcast Flow Overview
- REST API: Create a broadcast channel and obtain a
broadcast_token - WebSocket: The presenter starts recording with the
broadcasttype - SSE: Viewers connect through the share link to receive subtitles
- WebSocket: The presenter stops recording, and the broadcast ends
For details, see the Broadcast Guide.
Summary Templates
A summary template defines the output format for automatic summaries. In transcribe recordings, summary_template is a required parameter; in conversation and broadcast recordings, it is optional.
You can query the list of available templates via the Summary Templates API.
Terminology and Text Processing
VAS provides three text-processing settings that can be configured before or dynamically during recording:
| Setting | Description |
|---|---|
| Terminology (terminology) | Improves recognition accuracy for specific terms |
| Fuzzy correction (fuzzy_correction) | Automatically corrects homophone and near-homophone errors (can be generated automatically by the system) |
| Translation dictionary (translation_dict) | Ensures consistent translation of proper nouns |
For details, see the
configaction in the WebSocket voice-translation reference.
Webhook Callbacks
VAS supports Webhook notifications that proactively push events to a URL you specify when recording processing completes or fails, eliminating the need for polling.
Supported Events
| Event | Description |
|---|---|
recording.completed | Recording processing complete |
recording.failed | Recording processing failed |
import.completed | Audio import complete |
import.failed | Audio import failed |
Configuration
- Per request: Add the
callback_urlparameter to the API request (audio import, broadcast) - Per API Key: Specify
webhook_urlin the API Key settings (applies to all requests)
For details, see the Webhook Callback Guide.
Version: V1.5.7 Last Updated: 2026-05-20