Guides

Speaker Management

Table of Contents

  1. Overview
  2. Enabling Speaker Recognition
  3. Receiving Speaker Information
  4. Renaming a Speaker
  5. Reassigning a Speaker
  6. Merging Speakers
  7. Realtime Mode vs. Offline Mode
  8. Best Practices
  9. Related Reference Documents

Overview

VAS's Speaker Diarization feature automatically identifies different speakers in multi-party conversations and tags each sentence with the speaker's identity. The system supports speaker recognition in 31 languages.

Core Features

FeatureDescriptionAPI Type
Speaker recognitionAutomatically identifies and distinguishes different speakersWebSocket
Rename speakerChanges Guest-1 to a real nameWebSocket / REST
ReassignCorrects the speaker identity of a single sentenceWebSocket / REST
Merge speakersMerges the same speaker that was mistakenly recognized as multiple peopleWebSocket / REST

Use Cases

  • Meeting minutes: Automatically distinguish between participants' remarks
  • Interview transcription: Tag the host and the interviewee
  • Conversation records: Identify two-party or multi-party conversations

Authentication

All speaker management REST APIs require API Key authentication. See Authentication for details.


Enabling Speaker Recognition

To use the speaker recognition feature, set the following parameters in the WebSocket start action:

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "type": "conversation",
    "recognition_mode": "multi_speaker",
    "audio_format": "pcm"
  }
}

Key Parameters

ParameterValueDescription
typeconversationUse the conversation record type
recognition_modemulti_speakerEnable multi-party speaker recognition

Note: type can also be set to transcribe or broadcast. Speaker recognition is enabled as long as recognition_mode is set to multi_speaker.

Restriction: In multi_speaker mode, transcription_languages must contain exactly 1 language. If you provide multiple languages, you will receive a diarization_multilang_conflict error and the session will be refused. You must switch to a single language or disable speaker diarization.

Successful Response

After starting successfully, you will receive a session_started event confirming that the recognition mode is multi_speaker:

{
  "type": "voice-translation",
  "data": {
    "action": "session_started",
    "session_id": "550e8400-e29b-41d4-a716-446655440000",
    "recording_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "recording_type": "conversation",
    "recognition_mode": "multi_speaker",
    "message": "Speech recognition started"
  }
}

Receiving Speaker Information

Once multi-party speaker recognition is enabled, every recognition result (the result event) includes speaker information.

Recognition Result Format

{
  "type": "voice-translation",
  "data": {
    "action": "result",
    "origin": {
      "sid": 1,
      "language": "zh-TW",
      "text": "Today's meeting mainly discusses the project progress",
      "is_final": true,
      "speaker_id": "Guest-1",
      "detected_language": "zh-TW",
      "start_time": "00:05"
    }
  }
}
FieldTypeDescription
speaker_idstringSpeaker ID (automatically assigned by the system, e.g., Guest-1)
sidintSentence number, unique per sentence
is_finalbooleanWhether this is the final result

Speaker ID Naming Rules

  • The system automatically assigns IDs in the format Guest-{N} (N starts at 1 and increments)
  • The same speaker uses the same ID throughout the entire session
  • After renaming, subsequent recognition results use the new name

Renaming a Speaker

Change a system-assigned speaker ID (such as Guest-1) to a meaningful name (such as Manager Wang). Renaming is a global operation; all sentences that use that speaker ID are updated simultaneously.

Method 1: WebSocket (Realtime Mode)

For realtime renaming while a recording is in progress.

{
  "type": "voice-translation",
  "data": {
    "action": "rename_speaker",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang"
  }
}

Successful response:

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_renamed",
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang",
    "affected_sids": [1, 3, 5, 8]
  }
}

affected_sids lists all affected sentence numbers, so the frontend can update the UI based on this information.

Method 2: REST API (Offline Mode)

For offline editing after a recording has ended.

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/tasks/{taskId}/speakers/rename" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang"
  }'

Successful response (HTTP 200):

{
  "data": {
    "speaker_id": "Guest-1",
    "new_label": "Manager Wang",
    "affected_sids": [1, 3, 5, 8, 12]
  }
}

Renaming Restrictions

  • speaker_id must be an original speaker ID currently present in the recording or its current display label; if it still cannot be resolved, speaker_not_found is returned
  • new_label cannot be empty, has a maximum of 100 characters, and must not contain control characters (\x00-\x1F, \x7F) or newlines
  • The new label cannot duplicate the label of another existing speaker (a speaker_name_duplicate error is returned)
  • The REST API applies only to recordings in multi_speaker mode

Reassigning a Speaker

Change the speaker identity of a single sentence, assigning the sentence to another existing speaker. This is useful for correcting speaker recognition errors.

Method 1: WebSocket (Realtime Mode)

{
  "type": "voice-translation",
  "data": {
    "action": "reassign_speaker",
    "sid": 5,
    "target_speaker_id": "Guest-2"
  }
}

Successful response:

{
  "type": "voice-translation",
  "data": {
    "action": "speaker_reassigned",
    "sid": 5,
    "old_speaker_id": "Guest-1",
    "new_speaker_id": "Guest-2",
    "new_speaker_label": "Lisa Lee"
  }
}

Method 2: REST API (Offline Mode)

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/tasks/{taskId}/speakers/reassign" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "sid": 5,
    "target_speaker_id": "Guest-2"
  }'

Reassignment Restrictions

  • target_speaker_id must be the original ID of an existing speaker (creating a new speaker is not supported, and display labels are not accepted)
  • If that speaker has been renamed, new_speaker_label reflects the display label after applying speaker_aliases

Merging Speakers

Merge all sentences of one speaker into another speaker. This is useful when the system mistakenly recognizes the same person's voice as multiple speakers.

Use Case

The speech recognition engine sometimes recognizes the same person's voice at different times as different speakers (for example, Guest-1 and Guest-3 are actually the same person). After merging:

  • All Guest-3 sentences are attributed to Guest-1
  • In WebSocket mode: future recognition results identified as Guest-3 are also automatically converted to Guest-1 (continuous interception)
  • In REST mode: historical recordings have no new sentences, so only the existing sentences are merged once

Method 1: WebSocket (Realtime Mode)

{
  "type": "voice-translation",
  "data": {
    "action": "merge_speakers",
    "source_speaker_id": "Guest-3",
    "target_speaker_id": "Guest-1"
  }
}

Successful response:

{
  "type": "voice-translation",
  "data": {
    "action": "speakers_merged",
    "source_speaker_id": "Guest-3",
    "target_speaker_id": "Guest-1",
    "affected_sids": [3, 5, 7]
  }
}

Method 2: REST API (Offline Mode)

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/tasks/{taskId}/speakers/merge" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "source_speaker_id": "Guest-3",
    "target_speaker_id": "Guest-1"
  }'

Successful response (HTTP 200):

{
  "data": {
    "source_speaker_id": "Guest-3",
    "target_speaker_id": "Guest-1",
    "target_speaker_label": "Manager Wang",
    "affected_sids": [3, 5, 7]
  }
}

Merge vs. Reassign Comparison

FeatureScopeAffects Future Recognition Results (WS)
reassign_speakerA single sentence (1 SID)No
merge_speakersAll sentences of the speakerYes (future occurrences of the source are automatically converted to the target)

Merge Restrictions

  • source_speaker_id and target_speaker_id cannot be the same (a merge_speakers_same_id error is returned)
  • Both speaker IDs must exist in the recording
  • REST mode applies only to recordings with recognition_mode: multi_speaker

Realtime Mode vs. Offline Mode

Speaker management offers two usage modes. The following is a complete comparison:

OperationRealtime Mode (WebSocket)Offline Mode (REST API)
Rename speakerrename_speaker actionPATCH /api/v1/tasks/{taskId}/speakers/rename
Reassignreassign_speaker actionPATCH /api/v1/tasks/{taskId}/speakers/reassign
Merge speakersmerge_speakers actionPATCH /api/v1/tasks/{taskId}/speakers/merge
When to useWhile recording is in progressAfter recording has ended
Broadcast syncAutomatically pushed to SSE viewersNot applicable

REST vs. WebSocket merge difference: Both merge existing sentences; however, the WebSocket version additionally creates a continuous mapping that "automatically converts future source IDs to the target." This does not apply to historical recordings (which have no new sentences).

Speaker Management in Broadcast Mode

In broadcast mode, speaker management operations are automatically synced to SSE viewers:

WebSocket OperationSSE Event Received by Viewers
rename_speakerspeaker_renamed
reassign_speakerspeaker_reassigned
merge_speakersspeakers_merged

Viewers can update their UI in real time based on these events:

eventSource.addEventListener('speaker_renamed', (e) => {
  const data = JSON.parse(e.data);
  // Update the display labels of all affected_sids
  data.affected_sids.forEach(sid => {
    updateSpeakerLabel(sid, data.new_label);
  });
});

eventSource.addEventListener('speaker_reassigned', (e) => {
  const data = JSON.parse(e.data);
  // Update the speaker of a single sentence (speaker_id is the original ID, speaker_label is the display label)
  updateSpeakerForSentence(data.sid, data.new_speaker_id, data.new_speaker_label);
});

eventSource.addEventListener('speakers_merged', (e) => {
  const data = JSON.parse(e.data);
  // Update the display labels of all affected sentences
  data.affected_sids.forEach(sid => {
    updateSpeakerLabel(sid, data.target_speaker_label);
  });
});

Best Practices

1. Recognize First, Then Name

Let the system recognize the different speakers first (Guest-1, Guest-2, ...), and rename them only after confirming that recognition is stable.

2. Make Good Use of the Merge Feature

If you find that the same person has been recognized as multiple speakers (for example, they left and came back midway), using merge_speakers is more efficient than reassigning sentence by sentence with reassign_speaker, and it can also affect future recognition results.

3. Offline Editing for Correction

After a recording ends, perform final corrections on the transcript through the REST API to ensure that the speaker tags of all sentences are correct.

4. Error Handling

Error CodeDescriptionSuggested Action
speaker_not_foundThe specified speaker was not foundConfirm that the speaker ID exists
speaker_name_emptyThe name cannot be emptyProvide a valid name
speaker_name_duplicateThe name is already in useUse a different name
speaker_sid_not_foundThe specified sentence was not foundConfirm that the SID exists
speaker_diarization_requiredOnly diarization recordings are supportedConfirm that multi_speaker mode is used
merge_speakers_same_idSource and target are the sameUse different speaker IDs


Version: V1.5.7 Last Updated: 2026-05-20

Copyright © 2026