Guides

Speaker Management

Overview
Enabling Speaker Recognition
Receiving Speaker Information
Renaming a Speaker
Reassigning a Speaker
Merging Speakers
Realtime Mode vs. Offline Mode
Best Practices
Related Reference Documents

Overview

VAS's Speaker Diarization feature automatically identifies different speakers in multi-party conversations and tags each sentence with the speaker's identity. The system supports speaker recognition in 31 languages.

Core Features

Feature	Description	API Type
Speaker recognition	Automatically identifies and distinguishes different speakers	WebSocket
Rename speaker	Changes `Guest-1` to a real name	WebSocket / REST
Reassign	Corrects the speaker identity of a single sentence	WebSocket / REST
Merge speakers	Merges the same speaker that was mistakenly recognized as multiple people	WebSocket / REST

Use Cases

Meeting minutes: Automatically distinguish between participants' remarks
Interview transcription: Tag the host and the interviewee
Conversation records: Identify two-party or multi-party conversations

Authentication

All speaker management REST APIs require API Key authentication. See Authentication for details.

Enabling Speaker Recognition

To use the speaker recognition feature, set the following parameters in the WebSocket start action:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "conversation",
    "recognition_mode": "multi_speaker",
    "audio_format": "pcm"
  }
}

Key Parameters

Parameter	Value	Description
`type`	`conversation`	Use the conversation record type
`recognition_mode`	`multi_speaker`	Enable multi-party speaker recognition

Note: type can also be set to transcribe or broadcast. Speaker recognition is enabled as long as recognition_mode is set to multi_speaker.
Restriction: In multi_speaker mode, transcription_languages must contain exactly 1 language. If you provide multiple languages, you will receive a diarization_multilang_conflict error and the session will be refused. You must switch to a single language or disable speaker diarization.

Successful Response

After starting successfully, you will receive a session_started event confirming that the recognition mode is multi_speaker:

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "conversation",
    "recognition_mode": "multi_speaker",
    "message": "Speech recognition started"
  }
}

Receiving Speaker Information

Once multi-party speaker recognition is enabled, every recognition result (the result event) includes speaker information.

Recognition Result Format

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "origin": {
      "sid": 1,
      "language": "zh-TW",
      "text": "Today's meeting mainly discusses the project progress",
      "is_final": true,
      "speaker_id": "Guest-1",
      "detected_language": "zh-TW",
      "start_time": "00:05"
    }
  }
}

Field	Type	Description
`speaker_id`	string	Speaker ID (automatically assigned by the system, e.g., `Guest-1`)
`sid`	int	Sentence number, unique per sentence
`is_final`	boolean	Whether this is the final result

Speaker ID Naming Rules

The system automatically assigns IDs in the format Guest-{N} (N starts at 1 and increments)
The same speaker uses the same ID throughout the entire session
After renaming, subsequent recognition results use the new name

Renaming a Speaker

Change a system-assigned speaker ID (such as Guest-1) to a meaningful name (such as Manager Wang). Renaming is a global operation; all sentences that use that speaker ID are updated simultaneously.

Method 1: WebSocket (Realtime Mode)

For realtime renaming while a recording is in progress.

{
  "type": "voice-translation",
  "data": {
    "action": "rename_speaker",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang"
  }
}

Successful response:

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_renamed",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang",
    "affected_sids": [1, 3, 5, 8]
  }
}

affected_sids lists all affected sentence numbers, so the frontend can update the UI based on this information.

Method 2: REST API (Offline Mode)

For offline editing after a recording has ended.

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/tasks/{taskId}/speakers/rename" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang"
  }'

Successful response (HTTP 200):

{
  "data": {
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang",
    "affected_sids": [1, 3, 5, 8, 12]
  }
}

Renaming Restrictions

speaker_id must be an original speaker ID currently present in the recording or its current display label; if it still cannot be resolved, speaker_not_found is returned
new_label cannot be empty, has a maximum of 100 characters, and must not contain control characters (\x00-\x1F, \x7F) or newlines
The new label cannot duplicate the label of another existing speaker (a speaker_name_duplicate error is returned)
The REST API applies only to recordings in multi_speaker mode

Reassigning a Speaker

Change the speaker identity of a single sentence, assigning the sentence to another existing speaker. This is useful for correcting speaker recognition errors.

Method 1: WebSocket (Realtime Mode)

{
  "type": "voice-translation",
  "data": {
    "action": "reassign_speaker",
    "sid": 5,
    "target_speaker_id": "Guest-2"
  }
}

Successful response:

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_reassigned",
    "sid": 5,
    "old_speaker_id": "Guest-1",
    "new_speaker_id": "Guest-2",
    "new_speaker_label": "Lisa Lee"
  }
}

Method 2: REST API (Offline Mode)

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/tasks/{taskId}/speakers/reassign" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "sid": 5,
    "target_speaker_id": "Guest-2"
  }'

Reassignment Restrictions

target_speaker_id must be the original ID of an existing speaker (creating a new speaker is not supported, and display labels are not accepted)
If that speaker has been renamed, new_speaker_label reflects the display label after applying speaker_aliases

Merging Speakers

Merge all sentences of one speaker into another speaker. This is useful when the system mistakenly recognizes the same person's voice as multiple speakers.

Use Case

The speech recognition engine sometimes recognizes the same person's voice at different times as different speakers (for example, Guest-1 and Guest-3 are actually the same person). After merging:

All Guest-3 sentences are attributed to Guest-1
In WebSocket mode: future recognition results identified as Guest-3 are also automatically converted to Guest-1 (continuous interception)
In REST mode: historical recordings have no new sentences, so only the existing sentences are merged once

Method 1: WebSocket (Realtime Mode)

{
  "type": "voice-translation",
  "data": {
    "action": "merge_speakers",
    "source_speaker_id": "Guest-3",
    "target_speaker_id": "Guest-1"
  }
}

Successful response:

{
  "type": "voice-translation",
  "data": {
    "action": "speakers_merged",
    "source_speaker_id": "Guest-3",
    "target_speaker_id": "Guest-1",
    "affected_sids": [3, 5, 7]
  }
}

Method 2: REST API (Offline Mode)

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/tasks/{taskId}/speakers/merge" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "source_speaker_id": "Guest-3",
    "target_speaker_id": "Guest-1"
  }'

Successful response (HTTP 200):

{
  "data": {
    "source_speaker_id": "Guest-3",
    "target_speaker_id": "Guest-1",
    "target_speaker_label": "Manager Wang",
    "affected_sids": [3, 5, 7]
  }
}

Merge vs. Reassign Comparison

Feature	Scope	Affects Future Recognition Results (WS)
`reassign_speaker`	A single sentence (1 SID)	No
`merge_speakers`	All sentences of the speaker	Yes (future occurrences of the source are automatically converted to the target)

Merge Restrictions

source_speaker_id and target_speaker_id cannot be the same (a merge_speakers_same_id error is returned)
Both speaker IDs must exist in the recording
REST mode applies only to recordings with recognition_mode: multi_speaker

Realtime Mode vs. Offline Mode

Speaker management offers two usage modes. The following is a complete comparison:

Operation	Realtime Mode (WebSocket)	Offline Mode (REST API)
Rename speaker	`rename_speaker` action	`PATCH /api/v1/tasks/{taskId}/speakers/rename`
Reassign	`reassign_speaker` action	`PATCH /api/v1/tasks/{taskId}/speakers/reassign`
Merge speakers	`merge_speakers` action	`PATCH /api/v1/tasks/{taskId}/speakers/merge`
When to use	While recording is in progress	After recording has ended
Broadcast sync	Automatically pushed to SSE viewers	Not applicable

REST vs. WebSocket merge difference: Both merge existing sentences; however, the WebSocket version additionally creates a continuous mapping that "automatically converts future source IDs to the target." This does not apply to historical recordings (which have no new sentences).

Speaker Management in Broadcast Mode

In broadcast mode, speaker management operations are automatically synced to SSE viewers:

WebSocket Operation	SSE Event Received by Viewers
`rename_speaker`	`speaker_renamed`
`reassign_speaker`	`speaker_reassigned`
`merge_speakers`	`speakers_merged`

Viewers can update their UI in real time based on these events:

eventSource.addEventListener('speaker_renamed', (e) => {
  const data = JSON.parse(e.data);
  // Update the display labels of all affected_sids
  data.affected_sids.forEach(sid => {
    updateSpeakerLabel(sid, data.new_label);
  });
});

eventSource.addEventListener('speaker_reassigned', (e) => {
  const data = JSON.parse(e.data);
  // Update the speaker of a single sentence (speaker_id is the original ID, speaker_label is the display label)
  updateSpeakerForSentence(data.sid, data.new_speaker_id, data.new_speaker_label);
});

eventSource.addEventListener('speakers_merged', (e) => {
  const data = JSON.parse(e.data);
  // Update the display labels of all affected sentences
  data.affected_sids.forEach(sid => {
    updateSpeakerLabel(sid, data.target_speaker_label);
  });
});

Best Practices

1. Recognize First, Then Name

Let the system recognize the different speakers first (Guest-1, Guest-2, ...), and rename them only after confirming that recognition is stable.

2. Make Good Use of the Merge Feature

If you find that the same person has been recognized as multiple speakers (for example, they left and came back midway), using merge_speakers is more efficient than reassigning sentence by sentence with reassign_speaker, and it can also affect future recognition results.

3. Offline Editing for Correction

After a recording ends, perform final corrections on the transcript through the REST API to ensure that the speaker tags of all sentences are correct.

4. Error Handling

Error Code	Description	Suggested Action
`speaker_not_found`	The specified speaker was not found	Confirm that the speaker ID exists
`speaker_name_empty`	The name cannot be empty	Provide a valid name
`speaker_name_duplicate`	The name is already in use	Use a different name
`speaker_sid_not_found`	The specified sentence was not found	Confirm that the SID exists
`speaker_diarization_required`	Only diarization recordings are supported	Confirm that `multi_speaker` mode is used
`merge_speakers_same_id`	Source and target are the same	Use different speaker IDs

Version: V1.5.7 Last Updated: 2026-05-20

History Playback

Summary Customization