使用指南

Tts

概述

VAS 提供 TTS（Text-to-Speech）語音合成功能，可將翻譯後的文字轉換為語音播放。系統支援 154 種語言，共 325 種語音可供選擇（與語音服務供應商的 Monolingual Neural Voice 完全對齊）。

支援語言

語言代碼	語言名稱	語音數量
zh-TW	繁體中文	3
zh-CN	簡體中文	4
en-US	英語（美國）	6
en-GB	英語（英國）	3
ja-JP	日語	4
ko-KR	韓語	4
fr-FR	法語	3
de-DE	德語	3
es-ES	西班牙語	3
it-IT	義大利語	3
pt-BR	葡萄牙語（巴西）	3
th-TH	泰語	3
vi-VN	越南語	2
id-ID	印尼語	2

上表為熱門 locale 摘要（共 14 個 locale、46 個語音）。完整 154 個 locale × 325 個語音請以 GET /api/v1/tts/voices?language={code} 查詢為準。

核心特色

多場景支援：即時 TTS（WebSocket）、歷史 TTS（SSE）、廣播 TTS 三種場景
Word Boundary：每個字詞附帶精確的時間戳記，支援卡拉 OK 逐字高亮效果
同步/非同步模式：sync 模式自動播放最新翻譯，async 模式手動控制播放
多語言廣播 TTS：廣播模式可為不同翻譯語言分別設定 TTS 語音

認證方式

所有 TTS 相關的 REST API 需透過 API Key 認證。詳見認證說明。

查詢可用語音

在使用 TTS 之前，先查詢指定語言有哪些可用語音。

支援語言

VAS 目前支援 154 種語言的 TTS 語音合成，完整清單請參考附錄 - 支援語言。

取得指定語言的語音列表

GET https://vas-poc.vurbo.ai/api/v1/tts/voices?language={language}

範例：查詢英文語音

curl -X GET "https://vas-poc.vurbo.ai/api/v1/tts/voices?language=en-US" \
  -H "X-API-Key: YOUR_API_KEY"

回應重點欄位：

欄位	說明
`voice_name`	語音識別碼，用於 API 呼叫
`display_name`	語音顯示名稱
`gender`	性別：`Female` / `Male`
`is_default`	是否為該語言的預設語音
`sample_url`	試聽音訊 URL

完整參數與回應格式請參考 TTS REST API。

語音試聽

查詢語音列表後，可使用 sample URL 試聽各語音的效果。

GET https://vas-poc.vurbo.ai/api/v1/tts/voices/{voiceName}/sample

重點：

回應為 MP3 音訊二進位資料（非 JSON）
首次請求會即時合成並快取，後續請求直接從快取返回
不計入 TTS 費用
限流：每分鐘 30 次/每用戶

前端試聽範例：

// 直接使用 Audio 元素播放
const audio = new Audio(
  'https://vas-poc.vurbo.ai/api/v1/tts/voices/en-US-JennyNeural/sample'
);
audio.play();

即時 TTS（WebSocket）

即時 TTS 在語音辨識進行中，即時將翻譯結果轉為語音。透過 WebSocket 發送與接收。

啟用 TTS

在 start action 中加入 TTS 相關參數：

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "transcription_languages": ["zh-TW"],
    "translation_languages": ["en-US"],
    "realtime_translation": true,
    "type": "transcribe",
    "tts_enabled": true,
    "tts_language": "en-US",
    "tts_voice": "en-US-JennyNeural",
    "tts_mode": "sync"
  }
}

參數	說明
`tts_enabled`	設為 `true` 啟用 TTS
`tts_language`	TTS 輸出語言（必須在 `translation_languages` 中）
`tts_voice`	TTS 語音名稱（如 `en-US-JennyNeural`）
`tts_mode`	`sync`（同步，自動播放）或 `async`（非同步，手動控制）

同步模式（sync）

系統自動播放最新的 is_final=true 翻譯句子
若前一句仍在播放，後續句子進入佇列等待
適合不需手動控制的場景

非同步模式（async）

用戶可手動選擇任意已翻譯的句子進行 TTS 播放。支援對同一個 sid 重複請求（重播）。

**互譯模式（conversation）**同樣支援 tts_mode: "async"。設定後翻譯完成時不會自動推送 tts_ready，需透過 tts_play 手動觸發。互譯模式下會自動根據 tts_config 合成對應語言。

播放指定句子：

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5
  }
}

播放多句（從 sid 5 開始播放 3 句）：

{
  "type": "voice-translation",
  "data": {
    "action": "tts_play",
    "sid": 5,
    "length": 3
  }
}

length 最大值為 20（由後端 TTS_SSE_MAX_LENGTH 控制）。

停止播放：

{
  "type": "voice-translation",
  "data": {
    "action": "tts_stop"
  }
}

接收 TTS 音訊

TTS 合成完成時，伺服器推送 tts_ready 事件：

{
  "type": "voice-translation",
  "data": {
    "action": "tts_ready",
    "sid": 1,
    "language": "en-US",
    "transcript": "你好，很高興認識你",
    "text": "Hello, nice to meet you",
    "audio": "Base64EncodedMP3...",
    "format": "mp3",
    "duration_ms": 2500,
    "boundaries": [
      {"offset_ms": 0, "duration_ms": 350, "text_offset": 0, "word_length": 5, "text": "Hello"},
      {"offset_ms": 500, "duration_ms": 250, "text_offset": 7, "word_length": 4, "text": "nice"},
      {"offset_ms": 750, "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
      {"offset_ms": 950, "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
      {"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
    ]
  }
}

前端播放範例：

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  if (msg.data?.action === 'tts_ready') {
    const { audio, boundaries, text } = msg.data;

    // Base64 轉 Blob
    const byteChars = atob(audio);
    const byteArray = new Uint8Array(byteChars.length);
    for (let i = 0; i < byteChars.length; i++) {
      byteArray[i] = byteChars.charCodeAt(i);
    }
    const blob = new Blob([byteArray], { type: 'audio/mp3' });
    const audioEl = new Audio(URL.createObjectURL(blob));
    audioEl.play();
  }
};

切換 TTS 模式

錄音進行中可動態切換同步/非同步模式：

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}

歷史 TTS（SSE）

歷史 TTS 用於播放已完成錄音的翻譯語音，透過 SSE 串流逐句發送音訊。

請求格式

GET https://vas-poc.vurbo.ai/api/v1/sse/tts/{taskId}?language={language}&sid={sid}&length={length}

參數	必填	說明
`taskId`	是	錄音 ID
`language`	是	TTS 輸出語言（如 `en-US`）
`voice`	否	指定語音名稱（如 `en-US-JennyNeural`）
`sid`	否	起始句子 ID（預設 1）
`length`	否	回傳句子數量（預設 1，最大 20）

事件序列

connected  ->  tts_audio (重複 N 次)  ->  tts_done

connected：連線確認，包含語音資訊
tts_audio：逐句發送 TTS 音訊（含 Word Boundary）
tts_done：所有句子發送完成

多句播放範例

async function playTTS(taskId, language, apiKey, startSid = 1, length = 3) {
  const url = new URL(`https://vas-poc.vurbo.ai/api/v1/sse/tts/${taskId}`);
  url.searchParams.set('language', language);
  url.searchParams.set('sid', startSid);
  url.searchParams.set('length', length);

  const response = await fetch(url, {
    headers: { 'X-API-Key': apiKey }
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const events = parseSSE(decoder.decode(value));
    for (const event of events) {
      if (event.type === 'tts_audio') {
        // 播放音訊並設定卡拉 OK 效果
        const blob = base64ToBlob(event.data.audio, 'audio/mp3');
        const audio = new Audio(URL.createObjectURL(blob));
        setupKaraoke(audio, event.data.boundaries, event.data.text);
        audio.play();
      }
    }
  }
}

注意：瀏覽器原生 EventSource 不支援自訂 Header，需使用 fetch API 搭配 ReadableStream。

廣播 TTS

廣播模式的 TTS 讓觀眾端可以接收翻譯語音。TTS 音訊透過 SSE 推送給觀眾。

主講者端設定

在建立廣播（REST API）或啟動 WebSocket 時，透過 tts_config 指定哪些語言啟用 TTS：

建立廣播時設定（REST API）：

curl -X POST "https://vas-poc.vurbo.ai/api/v1/broadcasts" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "transcription_language": "zh-TW",
    "translation_languages": ["en-US", "ja-JP"],
    "tts_config": {
      "en-US": {"voice": "en-US-JennyNeural", "speaking_rate": 1.0},
      "ja-JP": {"voice": "ja-JP-NanamiNeural", "speaking_rate": 1.0}
    }
  }'

WebSocket start 時設定：

{
  "type": "voice-translation",
  "data": {
    "action": "start",
    "type": "broadcast",
    "broadcast_token": "YOUR_BROADCAST_TOKEN",
    "audio_format": "pcm",
    "tts_config": {
      "en-US": {
        "voice": "en-US-JennyNeural",
        "speaking_rate": 1.0
      },
      "ja-JP": {
        "voice": "ja-JP-NanamiNeural",
        "speaking_rate": 1.0
      }
    }
  }
}

tts_config 參數

欄位	類型	說明
`voice`	string	TTS 語音名稱
`speaking_rate`	number	語速（0.5 ~ 2.0，預設 1.0）

觀眾端接收 TTS

觀眾連線 SSE 時加入 tts=true 參數：

const eventSource = new EventSource(
  'https://vas-poc.vurbo.ai/broadcast/{token}/text?lang=en-US&tts=true'
);

eventSource.addEventListener('tts_ready', (e) => {
  const data = JSON.parse(e.data);
  // data.audio 為 Base64 編碼的 MP3
  // data.boundaries 為 Word Boundary 陣列
  const blob = base64ToBlob(data.audio, 'audio/mp3');
  const audio = new Audio(URL.createObjectURL(blob));
  audio.play();
});

重要注意事項

TTS 語言必須在 translation_languages 中，無效語言會被自動忽略
主講者（WebSocket）不會收到 TTS 音訊，只有 SSE 觀眾會收到 tts_ready 事件
TTS 只在 live 階段發送，standby 預備階段不會發送

Word Boundary 卡拉 OK 效果

TTS 回應中包含 boundaries 陣列，記錄每個字詞在音訊中的精確時間位置。利用此資訊可實作卡拉 OK 逐字高亮效果。

Word Boundary 資料結構

欄位	類型	說明
`offset_ms`	int	該字詞在音訊中的起始時間（毫秒）
`duration_ms`	int	該字詞持續時間（毫秒）
`text_offset`	int	在文字字串中的起始位置（字元索引）
`word_length`	int	字詞長度（字元數）
`text`	string	字詞內容

範例資料

以 "Hello, nice to meet you" 為例：

[
  {"offset_ms": 0,    "duration_ms": 350, "text_offset": 0,  "word_length": 5, "text": "Hello"},
  {"offset_ms": 350,  "duration_ms": 100, "text_offset": 5,  "word_length": 1, "text": ","},
  {"offset_ms": 500,  "duration_ms": 250, "text_offset": 7,  "word_length": 4, "text": "nice"},
  {"offset_ms": 750,  "duration_ms": 200, "text_offset": 12, "word_length": 2, "text": "to"},
  {"offset_ms": 950,  "duration_ms": 350, "text_offset": 15, "word_length": 4, "text": "meet"},
  {"offset_ms": 1300, "duration_ms": 300, "text_offset": 20, "word_length": 3, "text": "you"}
]

實作卡拉 OK 效果

function setupKaraoke(audioElement, boundaries, text) {
  const updateHighlight = () => {
    const currentTimeMs = audioElement.currentTime * 1000;

    // 找到目前正在播放的字詞
    const currentWord = boundaries.find((b, i) => {
      const nextOffset = boundaries[i + 1]?.offset_ms ?? Infinity;
      return currentTimeMs >= b.offset_ms && currentTimeMs < nextOffset;
    });

    if (currentWord) {
      // 高亮當前字詞
      highlightWord(text, currentWord.text_offset, currentWord.word_length);
    }
  };

  // 每 50ms 更新高亮位置
  const interval = setInterval(updateHighlight, 50);
  audioElement.addEventListener('ended', () => clearInterval(interval));
}

function highlightWord(text, offset, length) {
  const before = text.substring(0, offset);
  const word = text.substring(offset, offset + length);
  const after = text.substring(offset + length);

  // 更新 DOM（依實際 UI 框架調整）
  document.getElementById('tts-text').innerHTML =
    `${before}<span class="highlight">${word}</span>${after}`;
}

CSS 樣式參考

.highlight {
  background-color: #FFD700;
  color: #000;
  padding: 2px 4px;
  border-radius: 3px;
  transition: background-color 0.1s ease;
}

TTS 設定管理

切換 TTS 模式

錄音進行中可隨時切換同步/非同步模式：

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode",
    "tts_mode": "async"
  }
}

成功回應：

{
  "type": "voice-translation",
  "data": {
    "action": "tts_mode_changed",
    "tts_mode": "async"
  }
}

廣播模式動態更新 TTS 設定

廣播進行中可透過 REST API 更新 TTS 設定：

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/broadcasts/{id}" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "tts_config": {
      "zh-TW": {"voice": "zh-TW-HsiaoChenNeural", "speaking_rate": 1.0},
      "ja-JP": {"voice": "ja-JP-NanamiNeural", "speaking_rate": 1.2}
    }
  }'

清除 TTS 設定（傳入 null）：

curl -X PATCH "https://vas-poc.vurbo.ai/api/v1/broadcasts/{id}" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "tts_config": null
  }'

TTS 錯誤處理

錯誤碼	說明	處理建議
`tts_not_enabled`	TTS 未啟用	在 start 時啟用 TTS
`tts_segment_not_found`	找不到指定句子	確認 SID 存在
`tts_translation_not_found`	缺少該語言的翻譯	確認翻譯存在
`translation_not_found`	找不到翻譯	確認翻譯已完成
`tts_synthesis_failed`	TTS 合成失敗	稍後重試
`tts_quota_exceeded`	TTS 使用量已達上限	稍後重試
`invalid_data`	無效的模式	使用 `sync` 或 `async`

Tts

目錄

概述

支援語言

核心特色

認證方式

查詢可用語音

支援語言

取得指定語言的語音列表

語音試聽

即時 TTS（WebSocket）

啟用 TTS

同步模式（sync）

非同步模式（async）

接收 TTS 音訊

切換 TTS 模式

歷史 TTS（SSE）

請求格式

事件序列

多句播放範例

廣播 TTS

主講者端設定

tts_config 參數

觀眾端接收 TTS

重要注意事項

Word Boundary 卡拉 OK 效果

Word Boundary 資料結構

範例資料

實作卡拉 OK 效果

CSS 樣式參考

TTS 設定管理

切換 TTS 模式

廣播模式動態更新 TTS 設定

TTS 錯誤處理

相關 Reference 文件