Getting Started

Concepts

Task

A Task is the top-level concept in VAS, representing one complete speech-processing job. Each Task has a unique task_id (UUID) and, once completed, can be queried and managed through the REST API.

A Task can be created in the following ways:

  • Live recording: Created automatically after voice translation over WebSocket
  • Audio import: Created automatically after an uploaded audio file finishes processing

Recording

A Recording is the core data of a Task, containing the audio file and the transcript. Each Task corresponds to one Recording.

Recording Types

TypeDescriptionUse Case
transcribeSpeech-to-textMeeting notes, interview records
conversationBilingual real-time interpretationTwo-person cross-language conversations, live interpretation
recordPlain recordingVoice memos, quick notes
broadcastBroadcast / live streamingLectures, talks, live content

Type Differences

Featuretranscribeconversationrecordbroadcast
Speech recognitionvvvv
Translationvv (two-way)-v
TTS audio outputv (manual trigger)v (automatic)-v (broadcast viewers)
Summaryv (template required)v (optional)-v (optional)
Broadcast---v

Processing Status

A Recording goes through several processing statuses from creation to completion. You can query active tasks with GET /api/v1/tasks?status=active.

StatusDescriptionTrigger
recordingLive recording in progressWebSocket recording started, broadcast started
importingAudio import being processedAfter audio file is uploaded
uploadingUploading to cloud storageAfter recording stops, after import completes
processingSpeech recognition and translation in progressAfter upload completes
completedProcessing completeFinal status
failedProcessing failedFinal status
Live recording / broadcast: recording → uploading → processing → completed / failed
Audio import:               importing → uploading → processing → completed / failed

Session

A Session is an active working session within a single WebSocket connection, from start to stop.

Session Lifecycle

Connect WebSocket
    │
    ▼
  start  ──→  session_started (obtain session_id + recording_id)
    │
    ▼
  Send audio (audio) ──→ Receive recognition results (result)
    │
    ├── pause / resume (optional)
    ├── config (optional, update terminology)
    ├── retranslate (optional, re-translate)
    ├── switch_language (optional, switch language)
    ├── set_tts (interpretation mode, toggle TTS on/off and settings)
    │
    ▼
  stop   ──→  status (stop confirmation)
    │
    ▼
  task_complete (audio and transcript upload complete, obtain task_id)

Speaker Recognition Modes

ModeDescriptionUse Case
singleSingle-speaker mode (default)Solo talks, memos
multi_speakerMulti-speaker conversation modeMeetings, interviews, conversations (supports 31 languages)

In multi-speaker mode, the system automatically recognizes different speakers and labels them as Guest-1, Guest-2, and so on. You can manage speakers with the following operations:

  • Rename: Rename Guest-1 to the speaker's actual name
  • Reassign: Change which speaker a particular sentence belongs to
  • Merge: Merge two speakers into one

For details, see the Speaker Management Guide.


Data Flow

Microphone audio
    │
    ▼
  Speech recognition (STT) ──→ Original text (origin)
    │
    ▼
  Translation    ──→ Translation result (translation)
    │
    ├── Real-time translation: translate immediately after each sentence is recognized
    └── Non-real-time translation: translate only after is_final
    │
    ▼
  TTS speech synthesis (optional) ──→ Audio (tts_ready)
    │
    ▼
  Meeting summary (optional) ──→ Summary result

Real-Time vs. Non-Real-Time Translation

Moderealtime_translationBehavior
Non-real-time (default)falseTranslate after the sentence is confirmed (is_final: true); results are more accurate
Real-timetrueTrigger translation on every recognition update; lower latency but may update multiple times

Interpretation Mode

Interpretation mode lets two people who speak different languages hold a real-time interpreted conversation over a single WebSocket connection. The system automatically detects the language of each sentence and maps the translation direction accordingly, and the entire process is completely transparent to users.

Person A speaks Chinese → auto-detect zh-TW → translate to en-US → TTS (en-US voice) → tts_ready
Person B speaks English → auto-detect en-US → translate to zh-TW → TTS (zh-TW voice) → tts_ready

Interpretation Mode Characteristics

ItemDescription
Connection modelSingle WebSocket connection, two people sharing one device
Number of languagesExactly 2 (e.g., zh-TW + en-US)
Recognition modeAutomatic language detection
Language detectionDetected automatically per sentence; no manual switching needed
TTS outputTranslation results are synthesized to TTS automatically (can be disabled, and toggled mid-session via set_tts)
SummarySupports optional automatic summary (summary_template)

Interpretation Flow Overview

  1. WebSocket: Start with the conversation type and specify the two languages
  2. Send audio: The system automatically detects the language, translates to the other language, and returns TTS
  3. Automatic language detection: Each sentence is detected independently; origin.language reflects the detected language
  4. Stop: Stop recording; a Task is created automatically once processing completes

For details, see the interpretation mode section of the Real-Time Voice Translation Guide.


Broadcast Architecture

The broadcast feature delivers one presenter's speech to many viewers in real time.

Presenter (WebSocket)
    │
    ├── Send audio ──→ STT ──→ Translation ──→ TTS (optional)
    │
    ▼
  VAS server
    │
    ▼
Viewer 1 (SSE) ──→ Live subtitles + TTS audio
Viewer 2 (SSE) ──→ Live subtitles + TTS audio
Viewer N (SSE) ──→ Live subtitles + TTS audio

Broadcast Phases

PhaseDescription
standbyThe presenter can test equipment; viewers see a waiting message
liveLive subtitles are broadcast to all viewers

Broadcast Flow Overview

  1. REST API: Create a broadcast channel and obtain a broadcast_token
  2. WebSocket: The presenter starts recording with the broadcast type
  3. SSE: Viewers connect through the share link to receive subtitles
  4. WebSocket: The presenter stops recording, and the broadcast ends

For details, see the Broadcast Guide.


Summary Templates

A summary template defines the output format for automatic summaries. In transcribe recordings, summary_template is a required parameter; in conversation and broadcast recordings, it is optional.

You can query the list of available templates via the Summary Templates API.


Terminology and Text Processing

VAS provides three text-processing settings that can be configured before or dynamically during recording:

SettingDescription
Terminology (terminology)Improves recognition accuracy for specific terms
Fuzzy correction (fuzzy_correction)Automatically corrects homophone and near-homophone errors (can be generated automatically by the system)
Translation dictionary (translation_dict)Ensures consistent translation of proper nouns

For details, see the config action in the WebSocket voice-translation reference.


Webhook Callbacks

VAS supports Webhook notifications that proactively push events to a URL you specify when recording processing completes or fails, eliminating the need for polling.

Supported Events

EventDescription
recording.completedRecording processing complete
recording.failedRecording processing failed
import.completedAudio import complete
import.failedAudio import failed

Configuration

  • Per request: Add the callback_url parameter to the API request (audio import, broadcast)
  • Per API Key: Specify webhook_url in the API Key settings (applies to all requests)

For details, see the Webhook Callback Guide.


Version: V1.5.7 Last Updated: 2026-05-20

Copyright © 2026