Guides

File Import

Overview
Supported Audio Formats
Budget Check
Upload Audio
Parameter Reference
Querying Import Status
After Import Completes
Text Processing Parameters
Complete Flow Diagram
Webhook Notifications
Related Documents

Overview

The audio import feature lets you upload pre-recorded audio files for the system to process in the background, including speech recognition, translation, and summarization. Unlike real-time speech translation (WebSocket), audio import uses the REST API and is well suited for offline batch processing scenarios.

End-to-End Flow

Budget check → Upload audio → Track processing progress → Retrieve results

Step	API	Description
1. Budget check	`POST /api/v1/imports/check-quota`	Confirm whether the remaining budget is sufficient
2. Upload audio	`POST /api/v1/imports`	Upload as multipart/form-data
3. Query status	`GET /api/v1/imports/{importId}`	Poll the processing progress
3b. Real-time progress	`GET /api/v1/sse/imports/{importId}/progress`	SSE real-time progress push (an alternative to polling)
4. View results	Tasks API / SSE API	Retrieve the transcript, translation, and summary

Authentication

All audio import APIs are authenticated via the X-API-Key header. See Authentication for details.

Supported Audio Formats

Format	MIME Type	Description
MP3	`audio/mpeg`	The most common compressed format
WAV	`audio/wav`	Lossless format; larger file size
M4A	`audio/mp4`	A format commonly used by Apple

File limits:

Item	Limit
Maximum file size	500 MB
Maximum duration	10 hours
Minimum duration	1 second

Budget Check

Before uploading, we recommend checking whether your monthly budget is sufficient, so you do not discover that the budget is insufficient only after uploading a large file.

Request

curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports/check-quota" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"duration_ms": 3600000}'

Parameter	Type	Required	Description
`duration_ms`	integer	Yes	Estimated audio duration (milliseconds); range 1,000 to 36,000,000

Response

{
  "data": {
    "allowed": true,
    "remaining_budget": 48.48,
    "is_unlimited": false,
    "duration_minutes": 25,
    "estimated_cost": 0.4236,
    "remaining_minutes": 2864
  }
}

Field	Type	Description
`allowed`	boolean	`true` indicates the budget is sufficient and you may upload
`remaining_budget`	float \| null	Remaining monthly budget (USD); `null` when there is no budget limit
`is_unlimited`	boolean	Whether there is no budget limit
`duration_minutes`	integer	Estimated audio duration (minutes, rounded up)
`estimated_cost`	float	Estimated STT processing cost (USD)
`remaining_minutes`	integer \| null	Equivalent STT minutes available within the remaining budget; `null` when there is no budget limit

If allowed is false, we recommend prompting the user to wait for next month's budget reset or to adjust the budget. You can use duration_minutes and remaining_minutes to display a prompt such as "X minutes will be deducted, Y minutes remaining."

Upload Audio

Upload the audio file and processing parameters using the multipart/form-data format.

Basic Request

curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@meeting.mp3" \
  -F 'transcription_languages=["zh-TW"]' \
  -F 'translation_languages=["en-US"]' \
  -F "recognition_mode=multi_speaker"

Request With Summary and Text Processing

curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@meeting.mp3" \
  -F 'transcription_languages=["zh-TW"]' \
  -F 'translation_languages=["en-US"]' \
  -F "recognition_mode=multi_speaker" \
  -F "summary_template=meeting" \
  -F 'terminology={"zh-TW": [{"term": "語者分離", "boost": 1.5}]}' \
  -F 'translation_dict=[{"source": "語者分離", "translations": {"en-US": "Speaker Diarization"}}]'

Success Response (HTTP 202)

{
  "data": {
    "import_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "pending",
    "stage": null,
    "progress": 0,
    "message": null,
    "original_filename": "meeting.mp3",
    "file_size": "15.2 MB",
    "task_id": null,
    "created_at": "2026-01-15T10:00:00.000Z"
  }
}

Note: The response code is 202 Accepted, which means the server has accepted the upload but processing is not yet complete. Save the import_id for querying progress later.

Common Errors

Error Code	HTTP Status	Description	How to Handle
`import_file_too_large`	413	File exceeds 500 MB	Compress or split the file
`import_invalid_format`	415	Unsupported audio format	Use mp3/wav/m4a
`auth_budget_exceeded`	402	Monthly budget exceeded	Wait for next month's budget reset or adjust the budget

Parameter Reference

Required Parameters

Parameter	Type	Description
`file`	file	Audio file (multipart/form-data)
`transcription_languages`	string (JSON)	Transcription languages, as a JSON array (e.g., `["zh-TW"]`)
`recognition_mode`	string	`single` (single speaker) or `multi_speaker` (multi-speaker diarization)

Optional Parameters

Parameter	Type	Description
`translation_languages`	string (JSON)	Target translation languages, as a JSON array (e.g., `["en-US", "ja-JP"]`)
`summary_template`	string	Summary template identifier (e.g., `meeting`, `interview`, `speech`)
`terminology`	string (JSON)	Terminology list (improves recognition accuracy)
`fuzzy_correction`	string (JSON)	Fuzzy-correction rules (usually no manual configuration needed)
`translation_dict`	string (JSON)	Translation dictionary (ensures consistent translation of proper nouns)
`callback_url`	string	Webhook callback URL (notifies you when processing completes or fails)

Recognition Modes

Mode	Description	Use Case
`single`	Single-speaker recognition	Voice memos or speech recordings with a single speaker
`multi_speaker`	Multi-speaker diarization	Meeting recordings, interviews, multi-party conversations

Summary Templates

The available summary templates can be queried via GET /api/v1/summary-templates:

Template	Use Case
`general`	General summary
`meeting`	Meeting notes
`meeting_minutes`	Detailed meeting minutes
`speech`	Speech content
`interview`	Interview content
`course`	Course content

Querying Import Status

After a successful upload, use the import_id to poll the processing progress.

Request

curl -X GET "https://vas-poc.vurbo.ai/api/v1/imports/{importId}" \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "data": {
    "import_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "processing",
    "stage": "transcribing",
    "progress": 45,
    "message": "Recognizing speech...",
    "task_id": null,
    "error_code": null,
    "error_message": null
  }
}

Status Transitions

pending → processing → completed
                    └→ failed

Status	Description
`pending`	Queued and waiting to be processed
`processing`	Currently processing
`completed`	Processing complete (`task_id` is populated)
`failed`	Processing failed (`error_code` and `error_message` are populated)

Processing Stages

In the processing status, the stage field indicates the current processing stage:

Stage	Description	Approximate Progress
`converting`	Audio format conversion	0% to 10%
`transcribing`	Speech recognition in progress	10% to 60%
`translating`	Translation in progress	60% to 85%
`summarizing`	Generating summary	85% to 100%

Polling Recommendations

async function pollImportStatus(importId, apiKey) {
  const interval = setInterval(async () => {
    const response = await fetch(
      `https://vas-poc.vurbo.ai/api/v1/imports/${importId}`,
      { headers: { 'X-API-Key': apiKey } }
    );
    const result = await response.json();
    const { status, stage, progress, task_id } = result.data;

    console.log(`Status: ${status}, Stage: ${stage}, Progress: ${progress}%`);

    if (status === 'completed') {
      clearInterval(interval);
      console.log(`Processing complete! Task ID: ${task_id}`);
      // Use task_id to load the results...
    } else if (status === 'failed') {
      clearInterval(interval);
      console.error(`Processing failed: ${result.data.error_message}`);
    }
  }, 5000); // Query every 5 seconds
}

Recommendation: A polling interval of 3 to 5 seconds is sufficient. Polling too frequently will not speed up processing.

Behavior When Audio Cannot Be Recognized (v1.3.5)

If the audio produces an empty speech-recognition result for reasons such as "complete silence / volume too low / noise throughout / the recognition language not matching the actual language of the audio," the system still ends in the completed status (not failed), but the transcript will be an empty array.

Behavior Definition

Item	Value
Final `status`	`completed` (not `failed`)
Final SSE event	`completed` (`task_id` is populated)
Webhook events	`recording.completed` + `import.completed`
Transcript entries	`[]` (empty array)
`segments_count`	`0`
Budget deduction	Deducted based on the actual audio duration; not refunded

Why Isn't It `failed`?

failed means the processing flow itself encountered an error (such as an invalid format, exceeded budget, or audio parsing failure). When the audio processing flow runs through to completion but simply does not recognize any content, that is a valid completed status. This lets the client handle the result through the same success branch and use entries.length === 0 to determine when to display a "no speech content" prompt.

Client Handling Recommendations

When loading the transcript (GET /api/v1/sse/history/transcribe/{taskId}), if the accumulated sentence count is 0, we recommend displaying an empty state:

const sentences = [];
// ... process SSE events and collect init_sentence

if (sentences.length === 0) {
  // Display empty state
  showEmptyState({
    title: 'No speech content was recognized in this audio file',
    hint: 'Possible reasons: volume too low, silence throughout, or a recognition language that does not match the audio. We recommend verifying the audio quality or adjusting the recognition language before uploading again.',
  });
} else {
  renderTranscript(sentences);
}

How to Prevent It

Check the recognition language setting: Confirm that transcription_languages matches the actual language of the audio (for example, select en-US for English audio, not zh-TW)
Check the audio quality: Confirm the audio has clear speech and sufficient volume (a peak of -12 dBFS or higher is recommended)
Multilingual audio: If the audio contains multiple languages, we recommend splitting it and uploading each part separately

Tip: The budget is deducted based on the audio duration, and unrecognizable audio is no exception. We recommend listening to the audio before uploading to confirm.

After Import Completes

When the status changes to completed, the task_id in the response is the Task ID corresponding to that import task. With this ID, you can:

1. View the Task List

curl -X GET "https://vas-poc.vurbo.ai/api/v1/tasks" \
  -H "X-API-Key: YOUR_API_KEY"

2. Load the Transcript (SSE Stream)

const response = await fetch(
  `https://vas-poc.vurbo.ai/api/v1/sse/history/transcribe/${taskId}`,
  { headers: { 'X-API-Key': apiKey } }
);
// Process SSE events: init_metadata → init_sentence × N → init_summary → init_done

3. Play the Audio

const response = await fetch(
  `https://vas-poc.vurbo.ai/api/v1/sse/audio/${taskId}`,
  { headers: { 'X-API-Key': apiKey } }
);
const blob = await response.blob();
const audio = new Audio(URL.createObjectURL(blob));
audio.play();

4. Retranslate Into Another Language

const response = await fetch(
  `https://vas-poc.vurbo.ai/api/v1/sse/retranslate/${taskId}?targetLang=ja-JP`,
  { headers: { 'X-API-Key': apiKey } }
);
// Process SSE events: translation × N → done

For full history record operations, see the History and Playback Guide.

Text Processing Parameters

When uploading, you can include text processing parameters to improve recognition and translation quality.

Terminology (terminology)

Provided as a JSON object keyed by language code:

{
  "zh-TW": [
    { "term": "語者分離", "boost": 1.5 },
    { "term": "CVD製程", "boost": 2.0 }
  ]
}

Field	Required	Description
`term`	Yes	Terminology text (maximum 100 characters)
`boost`	No	Recognition weight (0.5 to 5.0, default 1.0)

Up to 500 terms per language. The system automatically generates fuzzy-correction rules from the terminology list.

Fuzzy Correction (fuzzy_correction)

Usually no manual configuration is needed; the system generates these automatically from terminology. If you need to customize them:

{
  "zh-TW": [
    { "correct": "語者分離", "incorrect": ["語這分離", "語者分力"] }
  ]
}

Translation Dictionary (translation_dict)

Ensures consistent translation of proper nouns:

[
  {
    "source": "語者分離",
    "translations": {
      "en-US": "Speaker Diarization",
      "ja-JP": "話者分離"
    }
  }
]

We recommend no more than 50 entries.

Webhook Notifications

Once you set callback_url, VAS proactively sends an HTTP POST notification to your server when processing completes or fails, removing the need to poll.

How to Configure

Add the callback_url parameter when uploading:

curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@meeting.mp3" \
  -F 'transcription_languages=["zh-TW"]' \
  -F "recognition_mode=multi_speaker" \
  -F "callback_url=https://your-server.com/webhooks/vas"

Events You Receive

Result	Event	Description
Success	`recording.completed` + `import.completed`	You receive two events
Failure	`import.failed`	The import stage failed

For the complete Webhook format, signature verification, and sample code, see the Webhook Callback Guide.

Complete Flow Diagram

           ┌────────────────────┐
           │   check-quota      │  Check whether the budget is sufficient
           │  POST /imports/    │
           │  check-quota       │
           └────────┬───────────┘
                    │
              allowed: true?
               ╱          ╲
            Yes            No → Prompt that the budget is insufficient
              │
    ┌─────────▼──────────┐
    │    POST /imports    │  Upload the audio
    │  multipart/form-data│  (transcription_languages,
    │                     │   translation_languages,
    │                     │   recognition_mode, ...)
    └─────────┬──────────┘
              │
         HTTP 202
         import_id
              │
    ┌─────────▼──────────┐
    │  GET /imports/{id}  │  Poll the processing status
    │   every 3 to 5 s    │  (query every 3 to 5 seconds)
    └─────────┬──────────┘
              │
        status check
       ╱      │      ╲
   pending  processing  completed / failed
              │              │
         stage:          task_id ←── Processing complete
         converting          │
         transcribing   ┌────▼─────────────┐
         translating    │  Tasks API        │  View the task list
         summarizing    │  SSE History API  │  Load the transcript
                        │  SSE Audio API    │  Play the audio
                        │  SSE Retranslate  │  Retranslate
                        └──────────────────┘

Document	Description
Authentication	Detailed explanation of API Key authentication
Imports API Reference	Complete specification of the audio import API
Import Progress SSE	Specification of the real-time progress tracking SSE stream
Tasks API Reference	Complete specification of the task management API
Summary Templates Reference	Summary template queries
History and Playback	How to load and play back records after an import completes

Version: V1.5.7 Last Updated: 2026-05-20

Broadcast

History Playback