File Import
Table of Contents
- Overview
- Supported Audio Formats
- Budget Check
- Upload Audio
- Parameter Reference
- Querying Import Status
- After Import Completes
- Text Processing Parameters
- Complete Flow Diagram
- Webhook Notifications
- Related Documents
Overview
The audio import feature lets you upload pre-recorded audio files for the system to process in the background, including speech recognition, translation, and summarization. Unlike real-time speech translation (WebSocket), audio import uses the REST API and is well suited for offline batch processing scenarios.
End-to-End Flow
Budget check → Upload audio → Track processing progress → Retrieve results
| Step | API | Description |
|---|---|---|
| 1. Budget check | POST /api/v1/imports/check-quota | Confirm whether the remaining budget is sufficient |
| 2. Upload audio | POST /api/v1/imports | Upload as multipart/form-data |
| 3. Query status | GET /api/v1/imports/{importId} | Poll the processing progress |
| 3b. Real-time progress | GET /api/v1/sse/imports/{importId}/progress | SSE real-time progress push (an alternative to polling) |
| 4. View results | Tasks API / SSE API | Retrieve the transcript, translation, and summary |
Authentication
All audio import APIs are authenticated via the X-API-Key header. See Authentication for details.
Supported Audio Formats
| Format | MIME Type | Description |
|---|---|---|
| MP3 | audio/mpeg | The most common compressed format |
| WAV | audio/wav | Lossless format; larger file size |
| M4A | audio/mp4 | A format commonly used by Apple |
File limits:
| Item | Limit |
|---|---|
| Maximum file size | 500 MB |
| Maximum duration | 10 hours |
| Minimum duration | 1 second |
Budget Check
Before uploading, we recommend checking whether your monthly budget is sufficient, so you do not discover that the budget is insufficient only after uploading a large file.
Request
curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports/check-quota" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"duration_ms": 3600000}'
| Parameter | Type | Required | Description |
|---|---|---|---|
duration_ms | integer | Yes | Estimated audio duration (milliseconds); range 1,000 to 36,000,000 |
Response
{
"data": {
"allowed": true,
"remaining_budget": 48.48,
"is_unlimited": false,
"duration_minutes": 25,
"estimated_cost": 0.4236,
"remaining_minutes": 2864
}
}
| Field | Type | Description |
|---|---|---|
allowed | boolean | true indicates the budget is sufficient and you may upload |
remaining_budget | float | null | Remaining monthly budget (USD); null when there is no budget limit |
is_unlimited | boolean | Whether there is no budget limit |
duration_minutes | integer | Estimated audio duration (minutes, rounded up) |
estimated_cost | float | Estimated STT processing cost (USD) |
remaining_minutes | integer | null | Equivalent STT minutes available within the remaining budget; null when there is no budget limit |
If
allowedisfalse, we recommend prompting the user to wait for next month's budget reset or to adjust the budget. You can useduration_minutesandremaining_minutesto display a prompt such as "X minutes will be deducted, Y minutes remaining."
Upload Audio
Upload the audio file and processing parameters using the multipart/form-data format.
Basic Request
curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports" \
-H "X-API-Key: YOUR_API_KEY" \
-F "file=@meeting.mp3" \
-F 'transcription_languages=["zh-TW"]' \
-F 'translation_languages=["en-US"]' \
-F "recognition_mode=multi_speaker"
Request With Summary and Text Processing
curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports" \
-H "X-API-Key: YOUR_API_KEY" \
-F "file=@meeting.mp3" \
-F 'transcription_languages=["zh-TW"]' \
-F 'translation_languages=["en-US"]' \
-F "recognition_mode=multi_speaker" \
-F "summary_template=meeting" \
-F 'terminology={"zh-TW": [{"term": "語者分離", "boost": 1.5}]}' \
-F 'translation_dict=[{"source": "語者分離", "translations": {"en-US": "Speaker Diarization"}}]'
Success Response (HTTP 202)
{
"data": {
"import_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"stage": null,
"progress": 0,
"message": null,
"original_filename": "meeting.mp3",
"file_size": "15.2 MB",
"task_id": null,
"created_at": "2026-01-15T10:00:00.000Z"
}
}
Note: The response code is 202 Accepted, which means the server has accepted the upload but processing is not yet complete. Save the
import_idfor querying progress later.
Common Errors
| Error Code | HTTP Status | Description | How to Handle |
|---|---|---|---|
import_file_too_large | 413 | File exceeds 500 MB | Compress or split the file |
import_invalid_format | 415 | Unsupported audio format | Use mp3/wav/m4a |
auth_budget_exceeded | 402 | Monthly budget exceeded | Wait for next month's budget reset or adjust the budget |
Parameter Reference
Required Parameters
| Parameter | Type | Description |
|---|---|---|
file | file | Audio file (multipart/form-data) |
transcription_languages | string (JSON) | Transcription languages, as a JSON array (e.g., ["zh-TW"]) |
recognition_mode | string | single (single speaker) or multi_speaker (multi-speaker diarization) |
Optional Parameters
| Parameter | Type | Description |
|---|---|---|
translation_languages | string (JSON) | Target translation languages, as a JSON array (e.g., ["en-US", "ja-JP"]) |
summary_template | string | Summary template identifier (e.g., meeting, interview, speech) |
terminology | string (JSON) | Terminology list (improves recognition accuracy) |
fuzzy_correction | string (JSON) | Fuzzy-correction rules (usually no manual configuration needed) |
translation_dict | string (JSON) | Translation dictionary (ensures consistent translation of proper nouns) |
callback_url | string | Webhook callback URL (notifies you when processing completes or fails) |
Recognition Modes
| Mode | Description | Use Case |
|---|---|---|
single | Single-speaker recognition | Voice memos or speech recordings with a single speaker |
multi_speaker | Multi-speaker diarization | Meeting recordings, interviews, multi-party conversations |
Summary Templates
The available summary templates can be queried via GET /api/v1/summary-templates:
| Template | Use Case |
|---|---|
general | General summary |
meeting | Meeting notes |
meeting_minutes | Detailed meeting minutes |
speech | Speech content |
interview | Interview content |
course | Course content |
Querying Import Status
After a successful upload, use the import_id to poll the processing progress.
Request
curl -X GET "https://vas-poc.vurbo.ai/api/v1/imports/{importId}" \
-H "X-API-Key: YOUR_API_KEY"
Response
{
"data": {
"import_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "processing",
"stage": "transcribing",
"progress": 45,
"message": "Recognizing speech...",
"task_id": null,
"error_code": null,
"error_message": null
}
}
Status Transitions
pending → processing → completed
└→ failed
| Status | Description |
|---|---|
pending | Queued and waiting to be processed |
processing | Currently processing |
completed | Processing complete (task_id is populated) |
failed | Processing failed (error_code and error_message are populated) |
Processing Stages
In the processing status, the stage field indicates the current processing stage:
| Stage | Description | Approximate Progress |
|---|---|---|
converting | Audio format conversion | 0% to 10% |
transcribing | Speech recognition in progress | 10% to 60% |
translating | Translation in progress | 60% to 85% |
summarizing | Generating summary | 85% to 100% |
Polling Recommendations
async function pollImportStatus(importId, apiKey) {
const interval = setInterval(async () => {
const response = await fetch(
`https://vas-poc.vurbo.ai/api/v1/imports/${importId}`,
{ headers: { 'X-API-Key': apiKey } }
);
const result = await response.json();
const { status, stage, progress, task_id } = result.data;
console.log(`Status: ${status}, Stage: ${stage}, Progress: ${progress}%`);
if (status === 'completed') {
clearInterval(interval);
console.log(`Processing complete! Task ID: ${task_id}`);
// Use task_id to load the results...
} else if (status === 'failed') {
clearInterval(interval);
console.error(`Processing failed: ${result.data.error_message}`);
}
}, 5000); // Query every 5 seconds
}
Recommendation: A polling interval of 3 to 5 seconds is sufficient. Polling too frequently will not speed up processing.
Behavior When Audio Cannot Be Recognized (v1.3.5)
If the audio produces an empty speech-recognition result for reasons such as "complete silence / volume too low / noise throughout / the recognition language not matching the actual language of the audio," the system still ends in the completed status (not failed), but the transcript will be an empty array.
Behavior Definition
| Item | Value |
|---|---|
Final status | completed (not failed) |
| Final SSE event | completed (task_id is populated) |
| Webhook events | recording.completed + import.completed |
| Transcript entries | [] (empty array) |
segments_count | 0 |
| Budget deduction | Deducted based on the actual audio duration; not refunded |
Why Isn't It failed?
failed means the processing flow itself encountered an error (such as an invalid format, exceeded budget, or audio parsing failure). When the audio processing flow runs through to completion but simply does not recognize any content, that is a valid completed status. This lets the client handle the result through the same success branch and use entries.length === 0 to determine when to display a "no speech content" prompt.
Client Handling Recommendations
When loading the transcript (GET /api/v1/sse/history/transcribe/{taskId}), if the accumulated sentence count is 0, we recommend displaying an empty state:
const sentences = [];
// ... process SSE events and collect init_sentence
if (sentences.length === 0) {
// Display empty state
showEmptyState({
title: 'No speech content was recognized in this audio file',
hint: 'Possible reasons: volume too low, silence throughout, or a recognition language that does not match the audio. We recommend verifying the audio quality or adjusting the recognition language before uploading again.',
});
} else {
renderTranscript(sentences);
}
How to Prevent It
- Check the recognition language setting: Confirm that
transcription_languagesmatches the actual language of the audio (for example, selecten-USfor English audio, notzh-TW) - Check the audio quality: Confirm the audio has clear speech and sufficient volume (a peak of -12 dBFS or higher is recommended)
- Multilingual audio: If the audio contains multiple languages, we recommend splitting it and uploading each part separately
Tip: The budget is deducted based on the audio duration, and unrecognizable audio is no exception. We recommend listening to the audio before uploading to confirm.
After Import Completes
When the status changes to completed, the task_id in the response is the Task ID corresponding to that import task. With this ID, you can:
1. View the Task List
curl -X GET "https://vas-poc.vurbo.ai/api/v1/tasks" \
-H "X-API-Key: YOUR_API_KEY"
2. Load the Transcript (SSE Stream)
const response = await fetch(
`https://vas-poc.vurbo.ai/api/v1/sse/history/transcribe/${taskId}`,
{ headers: { 'X-API-Key': apiKey } }
);
// Process SSE events: init_metadata → init_sentence × N → init_summary → init_done
3. Play the Audio
const response = await fetch(
`https://vas-poc.vurbo.ai/api/v1/sse/audio/${taskId}`,
{ headers: { 'X-API-Key': apiKey } }
);
const blob = await response.blob();
const audio = new Audio(URL.createObjectURL(blob));
audio.play();
4. Retranslate Into Another Language
const response = await fetch(
`https://vas-poc.vurbo.ai/api/v1/sse/retranslate/${taskId}?targetLang=ja-JP`,
{ headers: { 'X-API-Key': apiKey } }
);
// Process SSE events: translation × N → done
For full history record operations, see the History and Playback Guide.
Text Processing Parameters
When uploading, you can include text processing parameters to improve recognition and translation quality.
Terminology (terminology)
Provided as a JSON object keyed by language code:
{
"zh-TW": [
{ "term": "語者分離", "boost": 1.5 },
{ "term": "CVD製程", "boost": 2.0 }
]
}
| Field | Required | Description |
|---|---|---|
term | Yes | Terminology text (maximum 100 characters) |
boost | No | Recognition weight (0.5 to 5.0, default 1.0) |
Up to 500 terms per language. The system automatically generates fuzzy-correction rules from the terminology list.
Fuzzy Correction (fuzzy_correction)
Usually no manual configuration is needed; the system generates these automatically from terminology. If you need to customize them:
{
"zh-TW": [
{ "correct": "語者分離", "incorrect": ["語這分離", "語者分力"] }
]
}
Translation Dictionary (translation_dict)
Ensures consistent translation of proper nouns:
[
{
"source": "語者分離",
"translations": {
"en-US": "Speaker Diarization",
"ja-JP": "話者分離"
}
}
]
We recommend no more than 50 entries.
Webhook Notifications
Once you set callback_url, VAS proactively sends an HTTP POST notification to your server when processing completes or fails, removing the need to poll.
How to Configure
Add the callback_url parameter when uploading:
curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports" \
-H "X-API-Key: YOUR_API_KEY" \
-F "file=@meeting.mp3" \
-F 'transcription_languages=["zh-TW"]' \
-F "recognition_mode=multi_speaker" \
-F "callback_url=https://your-server.com/webhooks/vas"
Events You Receive
| Result | Event | Description |
|---|---|---|
| Success | recording.completed + import.completed | You receive two events |
| Failure | import.failed | The import stage failed |
For the complete Webhook format, signature verification, and sample code, see the Webhook Callback Guide.
Complete Flow Diagram
┌────────────────────┐
│ check-quota │ Check whether the budget is sufficient
│ POST /imports/ │
│ check-quota │
└────────┬───────────┘
│
allowed: true?
╱ ╲
Yes No → Prompt that the budget is insufficient
│
┌─────────▼──────────┐
│ POST /imports │ Upload the audio
│ multipart/form-data│ (transcription_languages,
│ │ translation_languages,
│ │ recognition_mode, ...)
└─────────┬──────────┘
│
HTTP 202
import_id
│
┌─────────▼──────────┐
│ GET /imports/{id} │ Poll the processing status
│ every 3 to 5 s │ (query every 3 to 5 seconds)
└─────────┬──────────┘
│
status check
╱ │ ╲
pending processing completed / failed
│ │
stage: task_id ←── Processing complete
converting │
transcribing ┌────▼─────────────┐
translating │ Tasks API │ View the task list
summarizing │ SSE History API │ Load the transcript
│ SSE Audio API │ Play the audio
│ SSE Retranslate │ Retranslate
└──────────────────┘
Related Documents
| Document | Description |
|---|---|
| Authentication | Detailed explanation of API Key authentication |
| Imports API Reference | Complete specification of the audio import API |
| Import Progress SSE | Specification of the real-time progress tracking SSE stream |
| Tasks API Reference | Complete specification of the task management API |
| Summary Templates Reference | Summary template queries |
| History and Playback | How to load and play back records after an import completes |
Version: V1.5.7 Last Updated: 2026-05-20