Getting Started

Concepts

Task

A Task is the top-level concept in VAS, representing one complete speech-processing job. Each Task has a unique task_id (UUID) and, once completed, can be queried and managed through the REST API.

A Task can be created in the following ways:

Live recording: Created automatically after voice translation over WebSocket
Audio import: Created automatically after an uploaded audio file finishes processing

Recording

A Recording is the core data of a Task, containing the audio file and the transcript. Each Task corresponds to one Recording.

Recording Types

Type	Description	Use Case
`transcribe`	Speech-to-text	Meeting notes, interview records
`conversation`	Bilingual real-time interpretation	Two-person cross-language conversations, live interpretation
`record`	Plain recording	Voice memos, quick notes
`broadcast`	Broadcast / live streaming	Lectures, talks, live content

Type Differences

Feature	transcribe	conversation	record	broadcast
Speech recognition	v	v	v	v
Translation	v	v (two-way)	-	v
TTS audio output	v (manual trigger)	v (automatic)	-	v (broadcast viewers)
Summary	v (template required)	v (optional)	-	v (optional)
Broadcast	-	-	-	v

Processing Status

A Recording goes through several processing statuses from creation to completion. You can query active tasks with GET /api/v1/tasks?status=active.

Status	Description	Trigger
`recording`	Live recording in progress	WebSocket recording started, broadcast started
`importing`	Audio import being processed	After audio file is uploaded
`uploading`	Uploading to cloud storage	After recording stops, after import completes
`processing`	Speech recognition and translation in progress	After upload completes
`completed`	Processing complete	Final status
`failed`	Processing failed	Final status

Live recording / broadcast: recording → uploading → processing → completed / failed
Audio import:               importing → uploading → processing → completed / failed

Session

A Session is an active working session within a single WebSocket connection, from start to stop.

Session Lifecycle

Connect WebSocket
    │
    ▼
  start  ──→  session_started (obtain session_id + recording_id)
    │
    ▼
  Send audio (audio) ──→ Receive recognition results (result)
    │
    ├── pause / resume (optional)
    ├── config (optional, update terminology)
    ├── retranslate (optional, re-translate)
    ├── switch_language (optional, switch language)
    ├── set_tts (interpretation mode, toggle TTS on/off and settings)
    │
    ▼
  stop   ──→  status (stop confirmation)
    │
    ▼
  task_complete (audio and transcript upload complete, obtain task_id)

Speaker Recognition Modes

Mode	Description	Use Case
`single`	Single-speaker mode (default)	Solo talks, memos
`multi_speaker`	Multi-speaker conversation mode	Meetings, interviews, conversations (supports 31 languages)

In multi-speaker mode, the system automatically recognizes different speakers and labels them as Guest-1, Guest-2, and so on. You can manage speakers with the following operations:

Rename: Rename Guest-1 to the speaker's actual name
Reassign: Change which speaker a particular sentence belongs to
Merge: Merge two speakers into one

For details, see the Speaker Management Guide.

Data Flow

Microphone audio
    │
    ▼
  Speech recognition (STT) ──→ Original text (origin)
    │
    ▼
  Translation    ──→ Translation result (translation)
    │
    ├── Real-time translation: translate immediately after each sentence is recognized
    └── Non-real-time translation: translate only after is_final
    │
    ▼
  TTS speech synthesis (optional) ──→ Audio (tts_ready)
    │
    ▼
  Meeting summary (optional) ──→ Summary result

Real-Time vs. Non-Real-Time Translation

Mode	`realtime_translation`	Behavior
Non-real-time (default)	`false`	Translate after the sentence is confirmed (`is_final: true`); results are more accurate
Real-time	`true`	Trigger translation on every recognition update; lower latency but may update multiple times

Interpretation mode lets two people who speak different languages hold a real-time interpreted conversation over a single WebSocket connection. The system automatically detects the language of each sentence and maps the translation direction accordingly, and the entire process is completely transparent to users.

Person A speaks Chinese → auto-detect zh-TW → translate to en-US → TTS (en-US voice) → tts_ready
Person B speaks English → auto-detect en-US → translate to zh-TW → TTS (zh-TW voice) → tts_ready

Interpretation Mode Characteristics

Item	Description
Connection model	Single WebSocket connection, two people sharing one device
Number of languages	Exactly 2 (e.g., zh-TW + en-US)
Recognition mode	Automatic language detection
Language detection	Detected automatically per sentence; no manual switching needed
TTS output	Translation results are synthesized to TTS automatically (can be disabled, and toggled mid-session via `set_tts`)
Summary	Supports optional automatic summary (`summary_template`)

Interpretation Flow Overview

WebSocket: Start with the conversation type and specify the two languages
Send audio: The system automatically detects the language, translates to the other language, and returns TTS
Automatic language detection: Each sentence is detected independently; origin.language reflects the detected language
Stop: Stop recording; a Task is created automatically once processing completes

For details, see the interpretation mode section of the Real-Time Voice Translation Guide.

Broadcast Architecture

The broadcast feature delivers one presenter's speech to many viewers in real time.

Presenter (WebSocket)
    │
    ├── Send audio ──→ STT ──→ Translation ──→ TTS (optional)
    │
    ▼
  VAS server
    │
    ▼
Viewer 1 (SSE) ──→ Live subtitles + TTS audio
Viewer 2 (SSE) ──→ Live subtitles + TTS audio
Viewer N (SSE) ──→ Live subtitles + TTS audio

Broadcast Phases

Phase	Description
`standby`	The presenter can test equipment; viewers see a waiting message
`live`	Live subtitles are broadcast to all viewers

Broadcast Flow Overview

REST API: Create a broadcast channel and obtain a broadcast_token
WebSocket: The presenter starts recording with the broadcast type
SSE: Viewers connect through the share link to receive subtitles
WebSocket: The presenter stops recording, and the broadcast ends

For details, see the Broadcast Guide.

Summary Templates

A summary template defines the output format for automatic summaries. In transcribe recordings, summary_template is a required parameter; in conversation and broadcast recordings, it is optional.

You can query the list of available templates via the Summary Templates API.

Terminology and Text Processing

VAS provides three text-processing settings that can be configured before or dynamically during recording:

Setting	Description
Terminology (terminology)	Improves recognition accuracy for specific terms
Fuzzy correction (fuzzy_correction)	Automatically corrects homophone and near-homophone errors (can be generated automatically by the system)
Translation dictionary (translation_dict)	Ensures consistent translation of proper nouns

For details, see the config action in the WebSocket voice-translation reference.

Webhook Callbacks

VAS supports Webhook notifications that proactively push events to a URL you specify when recording processing completes or fails, eliminating the need for polling.

Supported Events

Event	Description
`recording.completed`	Recording processing complete
`recording.failed`	Recording processing failed
`import.completed`	Audio import complete
`import.failed`	Audio import failed

Configuration

Per request: Add the callback_url parameter to the API request (audio import, broadcast)
Per API Key: Specify webhook_url in the API Key settings (applies to all requests)

For details, see the Webhook Callback Guide.

Version: V1.5.7 Last Updated: 2026-05-20

Authentication

Quickstart