Speaker Management
Table of Contents
- Overview
- Enabling Speaker Recognition
- Receiving Speaker Information
- Renaming a Speaker
- Reassigning a Speaker
- Merging Speakers
- Realtime Mode vs. Offline Mode
- Best Practices
- Related Reference Documents
Overview
VAS's Speaker Diarization feature automatically identifies different speakers in multi-party conversations and tags each sentence with the speaker's identity. The system supports speaker recognition in 31 languages.
Core Features
| Feature | Description | API Type |
|---|---|---|
| Speaker recognition | Automatically identifies and distinguishes different speakers | WebSocket |
| Rename speaker | Changes Guest-1 to a real name | WebSocket / REST |
| Reassign | Corrects the speaker identity of a single sentence | WebSocket / REST |
| Merge speakers | Merges the same speaker that was mistakenly recognized as multiple people | WebSocket / REST |
Use Cases
- Meeting minutes: Automatically distinguish between participants' remarks
- Interview transcription: Tag the host and the interviewee
- Conversation records: Identify two-party or multi-party conversations
Authentication
All speaker management REST APIs require API Key authentication. See Authentication for details.
Enabling Speaker Recognition
To use the speaker recognition feature, set the following parameters in the WebSocket start action:
{
"type": "voice-translation",
"data": {
"action": "start",
"transcription_languages": ["zh-TW"],
"translation_languages": ["en-US"],
"type": "conversation",
"recognition_mode": "multi_speaker",
"audio_format": "pcm"
}
}
Key Parameters
| Parameter | Value | Description |
|---|---|---|
type | conversation | Use the conversation record type |
recognition_mode | multi_speaker | Enable multi-party speaker recognition |
Note:
typecan also be set totranscribeorbroadcast. Speaker recognition is enabled as long asrecognition_modeis set tomulti_speaker.Restriction: In
multi_speakermode,transcription_languagesmust contain exactly 1 language. If you provide multiple languages, you will receive adiarization_multilang_conflicterror and the session will be refused. You must switch to a single language or disable speaker diarization.
Successful Response
After starting successfully, you will receive a session_started event confirming that the recognition mode is multi_speaker:
{
"type": "voice-translation",
"data": {
"action": "session_started",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"recording_type": "conversation",
"recognition_mode": "multi_speaker",
"message": "Speech recognition started"
}
}
Receiving Speaker Information
Once multi-party speaker recognition is enabled, every recognition result (the result event) includes speaker information.
Recognition Result Format
{
"type": "voice-translation",
"data": {
"action": "result",
"origin": {
"sid": 1,
"language": "zh-TW",
"text": "Today's meeting mainly discusses the project progress",
"is_final": true,
"speaker_id": "Guest-1",
"detected_language": "zh-TW",
"start_time": "00:05"
}
}
}
Speaker-Related Fields
| Field | Type | Description |
|---|---|---|
speaker_id | string | Speaker ID (automatically assigned by the system, e.g., Guest-1) |
sid | int | Sentence number, unique per sentence |
is_final | boolean | Whether this is the final result |
Speaker ID Naming Rules
- The system automatically assigns IDs in the format
Guest-{N}(N starts at 1 and increments) - The same speaker uses the same ID throughout the entire session
- After renaming, subsequent recognition results use the new name
Renaming a Speaker
Change a system-assigned speaker ID (such as Guest-1) to a meaningful name (such as Manager Wang). Renaming is a global operation; all sentences that use that speaker ID are updated simultaneously.
Method 1: WebSocket (Realtime Mode)
For realtime renaming while a recording is in progress.
{
"type": "voice-translation",
"data": {
"action": "rename_speaker",
"speaker_id": "Guest-1",
"new_label": "Manager Wang"
}
}
Successful response:
{
"type": "voice-translation",
"data": {
"action": "speaker_renamed",
"speaker_id": "Guest-1",
"new_label": "Manager Wang",
"affected_sids": [1, 3, 5, 8]
}
}
affected_sids lists all affected sentence numbers, so the frontend can update the UI based on this information.
Method 2: REST API (Offline Mode)
For offline editing after a recording has ended.
curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/tasks/{taskId}/speakers/rename" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"speaker_id": "Guest-1",
"new_label": "Manager Wang"
}'
Successful response (HTTP 200):
{
"data": {
"speaker_id": "Guest-1",
"new_label": "Manager Wang",
"affected_sids": [1, 3, 5, 8, 12]
}
}
Renaming Restrictions
speaker_idmust be an original speaker ID currently present in the recording or its current display label; if it still cannot be resolved,speaker_not_foundis returnednew_labelcannot be empty, has a maximum of 100 characters, and must not contain control characters (\x00-\x1F,\x7F) or newlines- The new label cannot duplicate the label of another existing speaker (a
speaker_name_duplicateerror is returned) - The REST API applies only to recordings in
multi_speakermode
Reassigning a Speaker
Change the speaker identity of a single sentence, assigning the sentence to another existing speaker. This is useful for correcting speaker recognition errors.
Method 1: WebSocket (Realtime Mode)
{
"type": "voice-translation",
"data": {
"action": "reassign_speaker",
"sid": 5,
"target_speaker_id": "Guest-2"
}
}
Successful response:
{
"type": "voice-translation",
"data": {
"action": "speaker_reassigned",
"sid": 5,
"old_speaker_id": "Guest-1",
"new_speaker_id": "Guest-2",
"new_speaker_label": "Lisa Lee"
}
}
Method 2: REST API (Offline Mode)
curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/tasks/{taskId}/speakers/reassign" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"sid": 5,
"target_speaker_id": "Guest-2"
}'
Reassignment Restrictions
target_speaker_idmust be the original ID of an existing speaker (creating a new speaker is not supported, and display labels are not accepted)- If that speaker has been renamed,
new_speaker_labelreflects the display label after applyingspeaker_aliases
Merging Speakers
Merge all sentences of one speaker into another speaker. This is useful when the system mistakenly recognizes the same person's voice as multiple speakers.
Use Case
The speech recognition engine sometimes recognizes the same person's voice at different times as different speakers (for example, Guest-1 and Guest-3 are actually the same person). After merging:
- All
Guest-3sentences are attributed toGuest-1 - In WebSocket mode: future recognition results identified as
Guest-3are also automatically converted toGuest-1(continuous interception) - In REST mode: historical recordings have no new sentences, so only the existing sentences are merged once
Method 1: WebSocket (Realtime Mode)
{
"type": "voice-translation",
"data": {
"action": "merge_speakers",
"source_speaker_id": "Guest-3",
"target_speaker_id": "Guest-1"
}
}
Successful response:
{
"type": "voice-translation",
"data": {
"action": "speakers_merged",
"source_speaker_id": "Guest-3",
"target_speaker_id": "Guest-1",
"affected_sids": [3, 5, 7]
}
}
Method 2: REST API (Offline Mode)
curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/tasks/{taskId}/speakers/merge" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"source_speaker_id": "Guest-3",
"target_speaker_id": "Guest-1"
}'
Successful response (HTTP 200):
{
"data": {
"source_speaker_id": "Guest-3",
"target_speaker_id": "Guest-1",
"target_speaker_label": "Manager Wang",
"affected_sids": [3, 5, 7]
}
}
Merge vs. Reassign Comparison
| Feature | Scope | Affects Future Recognition Results (WS) |
|---|---|---|
reassign_speaker | A single sentence (1 SID) | No |
merge_speakers | All sentences of the speaker | Yes (future occurrences of the source are automatically converted to the target) |
Merge Restrictions
source_speaker_idandtarget_speaker_idcannot be the same (amerge_speakers_same_iderror is returned)- Both speaker IDs must exist in the recording
- REST mode applies only to recordings with
recognition_mode: multi_speaker
Realtime Mode vs. Offline Mode
Speaker management offers two usage modes. The following is a complete comparison:
| Operation | Realtime Mode (WebSocket) | Offline Mode (REST API) |
|---|---|---|
| Rename speaker | rename_speaker action | PATCH /api/v1/tasks/{taskId}/speakers/rename |
| Reassign | reassign_speaker action | PATCH /api/v1/tasks/{taskId}/speakers/reassign |
| Merge speakers | merge_speakers action | PATCH /api/v1/tasks/{taskId}/speakers/merge |
| When to use | While recording is in progress | After recording has ended |
| Broadcast sync | Automatically pushed to SSE viewers | Not applicable |
REST vs. WebSocket merge difference: Both merge existing sentences; however, the WebSocket version additionally creates a continuous mapping that "automatically converts future source IDs to the target." This does not apply to historical recordings (which have no new sentences).
Speaker Management in Broadcast Mode
In broadcast mode, speaker management operations are automatically synced to SSE viewers:
| WebSocket Operation | SSE Event Received by Viewers |
|---|---|
rename_speaker | speaker_renamed |
reassign_speaker | speaker_reassigned |
merge_speakers | speakers_merged |
Viewers can update their UI in real time based on these events:
eventSource.addEventListener('speaker_renamed', (e) => {
const data = JSON.parse(e.data);
// Update the display labels of all affected_sids
data.affected_sids.forEach(sid => {
updateSpeakerLabel(sid, data.new_label);
});
});
eventSource.addEventListener('speaker_reassigned', (e) => {
const data = JSON.parse(e.data);
// Update the speaker of a single sentence (speaker_id is the original ID, speaker_label is the display label)
updateSpeakerForSentence(data.sid, data.new_speaker_id, data.new_speaker_label);
});
eventSource.addEventListener('speakers_merged', (e) => {
const data = JSON.parse(e.data);
// Update the display labels of all affected sentences
data.affected_sids.forEach(sid => {
updateSpeakerLabel(sid, data.target_speaker_label);
});
});
Best Practices
1. Recognize First, Then Name
Let the system recognize the different speakers first (Guest-1, Guest-2, ...), and rename them only after confirming that recognition is stable.
2. Make Good Use of the Merge Feature
If you find that the same person has been recognized as multiple speakers (for example, they left and came back midway), using merge_speakers is more efficient than reassigning sentence by sentence with reassign_speaker, and it can also affect future recognition results.
3. Offline Editing for Correction
After a recording ends, perform final corrections on the transcript through the REST API to ensure that the speaker tags of all sentences are correct.
4. Error Handling
| Error Code | Description | Suggested Action |
|---|---|---|
speaker_not_found | The specified speaker was not found | Confirm that the speaker ID exists |
speaker_name_empty | The name cannot be empty | Provide a valid name |
speaker_name_duplicate | The name is already in use | Use a different name |
speaker_sid_not_found | The specified sentence was not found | Confirm that the SID exists |
speaker_diarization_required | Only diarization recordings are supported | Confirm that multi_speaker mode is used |
merge_speakers_same_id | Source and target are the same | Use different speaker IDs |
Related Reference Documents
- REST API - Recording Speaker Editing
- WebSocket - Voice Translation (rename_speaker / reassign_speaker / merge_speakers)
- SSE - Broadcast Viewer (speaker_renamed / speaker_reassigned / speakers_merged events)
Version: V1.5.7 Last Updated: 2026-05-20