Guides

File Import

Table of Contents

  1. Overview
  2. Supported Audio Formats
  3. Budget Check
  4. Upload Audio
  5. Parameter Reference
  6. Querying Import Status
  7. After Import Completes
  8. Text Processing Parameters
  9. Complete Flow Diagram
  10. Webhook Notifications
  11. Related Documents

Overview

The audio import feature lets you upload pre-recorded audio files for the system to process in the background, including speech recognition, translation, and summarization. Unlike real-time speech translation (WebSocket), audio import uses the REST API and is well suited for offline batch processing scenarios.

End-to-End Flow

Budget check → Upload audio → Track processing progress → Retrieve results
StepAPIDescription
1. Budget checkPOST /api/v1/imports/check-quotaConfirm whether the remaining budget is sufficient
2. Upload audioPOST /api/v1/importsUpload as multipart/form-data
3. Query statusGET /api/v1/imports/{importId}Poll the processing progress
3b. Real-time progressGET /api/v1/sse/imports/{importId}/progressSSE real-time progress push (an alternative to polling)
4. View resultsTasks API / SSE APIRetrieve the transcript, translation, and summary

Authentication

All audio import APIs are authenticated via the X-API-Key header. See Authentication for details.


Supported Audio Formats

FormatMIME TypeDescription
MP3audio/mpegThe most common compressed format
WAVaudio/wavLossless format; larger file size
M4Aaudio/mp4A format commonly used by Apple

File limits:

ItemLimit
Maximum file size500 MB
Maximum duration10 hours
Minimum duration1 second

Budget Check

Before uploading, we recommend checking whether your monthly budget is sufficient, so you do not discover that the budget is insufficient only after uploading a large file.

Request

curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports/check-quota" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"duration_ms": 3600000}'
ParameterTypeRequiredDescription
duration_msintegerYesEstimated audio duration (milliseconds); range 1,000 to 36,000,000

Response

{
  "data": {
    "allowed": true,
    "remaining_budget": 48.48,
    "is_unlimited": false,
    "duration_minutes": 25,
    "estimated_cost": 0.4236,
    "remaining_minutes": 2864
  }
}
FieldTypeDescription
allowedbooleantrue indicates the budget is sufficient and you may upload
remaining_budgetfloat | nullRemaining monthly budget (USD); null when there is no budget limit
is_unlimitedbooleanWhether there is no budget limit
duration_minutesintegerEstimated audio duration (minutes, rounded up)
estimated_costfloatEstimated STT processing cost (USD)
remaining_minutesinteger | nullEquivalent STT minutes available within the remaining budget; null when there is no budget limit

If allowed is false, we recommend prompting the user to wait for next month's budget reset or to adjust the budget. You can use duration_minutes and remaining_minutes to display a prompt such as "X minutes will be deducted, Y minutes remaining."


Upload Audio

Upload the audio file and processing parameters using the multipart/form-data format.

Basic Request

curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@meeting.mp3" \
  -F 'transcription_languages=["zh-TW"]' \
  -F 'translation_languages=["en-US"]' \
  -F "recognition_mode=multi_speaker"

Request With Summary and Text Processing

curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@meeting.mp3" \
  -F 'transcription_languages=["zh-TW"]' \
  -F 'translation_languages=["en-US"]' \
  -F "recognition_mode=multi_speaker" \
  -F "summary_template=meeting" \
  -F 'terminology={"zh-TW": [{"term": "語者分離", "boost": 1.5}]}' \
  -F 'translation_dict=[{"source": "語者分離", "translations": {"en-US": "Speaker Diarization"}}]'

Success Response (HTTP 202)

{
  "data": {
    "import_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "pending",
    "stage": null,
    "progress": 0,
    "message": null,
    "original_filename": "meeting.mp3",
    "file_size": "15.2 MB",
    "task_id": null,
    "created_at": "2026-01-15T10:00:00.000Z"
  }
}

Note: The response code is 202 Accepted, which means the server has accepted the upload but processing is not yet complete. Save the import_id for querying progress later.

Common Errors

Error CodeHTTP StatusDescriptionHow to Handle
import_file_too_large413File exceeds 500 MBCompress or split the file
import_invalid_format415Unsupported audio formatUse mp3/wav/m4a
auth_budget_exceeded402Monthly budget exceededWait for next month's budget reset or adjust the budget

Parameter Reference

Required Parameters

ParameterTypeDescription
filefileAudio file (multipart/form-data)
transcription_languagesstring (JSON)Transcription languages, as a JSON array (e.g., ["zh-TW"])
recognition_modestringsingle (single speaker) or multi_speaker (multi-speaker diarization)

Optional Parameters

ParameterTypeDescription
translation_languagesstring (JSON)Target translation languages, as a JSON array (e.g., ["en-US", "ja-JP"])
summary_templatestringSummary template identifier (e.g., meeting, interview, speech)
terminologystring (JSON)Terminology list (improves recognition accuracy)
fuzzy_correctionstring (JSON)Fuzzy-correction rules (usually no manual configuration needed)
translation_dictstring (JSON)Translation dictionary (ensures consistent translation of proper nouns)
callback_urlstringWebhook callback URL (notifies you when processing completes or fails)

Recognition Modes

ModeDescriptionUse Case
singleSingle-speaker recognitionVoice memos or speech recordings with a single speaker
multi_speakerMulti-speaker diarizationMeeting recordings, interviews, multi-party conversations

Summary Templates

The available summary templates can be queried via GET /api/v1/summary-templates:

TemplateUse Case
generalGeneral summary
meetingMeeting notes
meeting_minutesDetailed meeting minutes
speechSpeech content
interviewInterview content
courseCourse content

Querying Import Status

After a successful upload, use the import_id to poll the processing progress.

Request

curl -X GET "https://vas-poc.vurbo.ai/api/v1/imports/{importId}" \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "data": {
    "import_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "processing",
    "stage": "transcribing",
    "progress": 45,
    "message": "Recognizing speech...",
    "task_id": null,
    "error_code": null,
    "error_message": null
  }
}

Status Transitions

pending → processing → completed
                    └→ failed
StatusDescription
pendingQueued and waiting to be processed
processingCurrently processing
completedProcessing complete (task_id is populated)
failedProcessing failed (error_code and error_message are populated)

Processing Stages

In the processing status, the stage field indicates the current processing stage:

StageDescriptionApproximate Progress
convertingAudio format conversion0% to 10%
transcribingSpeech recognition in progress10% to 60%
translatingTranslation in progress60% to 85%
summarizingGenerating summary85% to 100%

Polling Recommendations

async function pollImportStatus(importId, apiKey) {
  const interval = setInterval(async () => {
    const response = await fetch(
      `https://vas-poc.vurbo.ai/api/v1/imports/${importId}`,
      { headers: { 'X-API-Key': apiKey } }
    );
    const result = await response.json();
    const { status, stage, progress, task_id } = result.data;

    console.log(`Status: ${status}, Stage: ${stage}, Progress: ${progress}%`);

    if (status === 'completed') {
      clearInterval(interval);
      console.log(`Processing complete! Task ID: ${task_id}`);
      // Use task_id to load the results...
    } else if (status === 'failed') {
      clearInterval(interval);
      console.error(`Processing failed: ${result.data.error_message}`);
    }
  }, 5000); // Query every 5 seconds
}

Recommendation: A polling interval of 3 to 5 seconds is sufficient. Polling too frequently will not speed up processing.


Behavior When Audio Cannot Be Recognized (v1.3.5)

If the audio produces an empty speech-recognition result for reasons such as "complete silence / volume too low / noise throughout / the recognition language not matching the actual language of the audio," the system still ends in the completed status (not failed), but the transcript will be an empty array.

Behavior Definition

ItemValue
Final statuscompleted (not failed)
Final SSE eventcompleted (task_id is populated)
Webhook eventsrecording.completed + import.completed
Transcript entries[] (empty array)
segments_count0
Budget deductionDeducted based on the actual audio duration; not refunded

Why Isn't It failed?

failed means the processing flow itself encountered an error (such as an invalid format, exceeded budget, or audio parsing failure). When the audio processing flow runs through to completion but simply does not recognize any content, that is a valid completed status. This lets the client handle the result through the same success branch and use entries.length === 0 to determine when to display a "no speech content" prompt.

Client Handling Recommendations

When loading the transcript (GET /api/v1/sse/history/transcribe/{taskId}), if the accumulated sentence count is 0, we recommend displaying an empty state:

const sentences = [];
// ... process SSE events and collect init_sentence

if (sentences.length === 0) {
  // Display empty state
  showEmptyState({
    title: 'No speech content was recognized in this audio file',
    hint: 'Possible reasons: volume too low, silence throughout, or a recognition language that does not match the audio. We recommend verifying the audio quality or adjusting the recognition language before uploading again.',
  });
} else {
  renderTranscript(sentences);
}

How to Prevent It

  • Check the recognition language setting: Confirm that transcription_languages matches the actual language of the audio (for example, select en-US for English audio, not zh-TW)
  • Check the audio quality: Confirm the audio has clear speech and sufficient volume (a peak of -12 dBFS or higher is recommended)
  • Multilingual audio: If the audio contains multiple languages, we recommend splitting it and uploading each part separately

Tip: The budget is deducted based on the audio duration, and unrecognizable audio is no exception. We recommend listening to the audio before uploading to confirm.


After Import Completes

When the status changes to completed, the task_id in the response is the Task ID corresponding to that import task. With this ID, you can:

1. View the Task List

curl -X GET "https://vas-poc.vurbo.ai/api/v1/tasks" \
  -H "X-API-Key: YOUR_API_KEY"

2. Load the Transcript (SSE Stream)

const response = await fetch(
  `https://vas-poc.vurbo.ai/api/v1/sse/history/transcribe/${taskId}`,
  { headers: { 'X-API-Key': apiKey } }
);
// Process SSE events: init_metadata → init_sentence × N → init_summary → init_done

3. Play the Audio

const response = await fetch(
  `https://vas-poc.vurbo.ai/api/v1/sse/audio/${taskId}`,
  { headers: { 'X-API-Key': apiKey } }
);
const blob = await response.blob();
const audio = new Audio(URL.createObjectURL(blob));
audio.play();

4. Retranslate Into Another Language

const response = await fetch(
  `https://vas-poc.vurbo.ai/api/v1/sse/retranslate/${taskId}?targetLang=ja-JP`,
  { headers: { 'X-API-Key': apiKey } }
);
// Process SSE events: translation × N → done

For full history record operations, see the History and Playback Guide.


Text Processing Parameters

When uploading, you can include text processing parameters to improve recognition and translation quality.

Terminology (terminology)

Provided as a JSON object keyed by language code:

{
  "zh-TW": [
    { "term": "語者分離", "boost": 1.5 },
    { "term": "CVD製程", "boost": 2.0 }
  ]
}
FieldRequiredDescription
termYesTerminology text (maximum 100 characters)
boostNoRecognition weight (0.5 to 5.0, default 1.0)

Up to 500 terms per language. The system automatically generates fuzzy-correction rules from the terminology list.

Fuzzy Correction (fuzzy_correction)

Usually no manual configuration is needed; the system generates these automatically from terminology. If you need to customize them:

{
  "zh-TW": [
    { "correct": "語者分離", "incorrect": ["語這分離", "語者分力"] }
  ]
}

Translation Dictionary (translation_dict)

Ensures consistent translation of proper nouns:

[
  {
    "source": "語者分離",
    "translations": {
      "en-US": "Speaker Diarization",
      "ja-JP": "話者分離"
    }
  }
]

We recommend no more than 50 entries.


Webhook Notifications

Once you set callback_url, VAS proactively sends an HTTP POST notification to your server when processing completes or fails, removing the need to poll.

How to Configure

Add the callback_url parameter when uploading:

curl -X POST "https://vas-poc.vurbo.ai/api/v1/imports" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@meeting.mp3" \
  -F 'transcription_languages=["zh-TW"]' \
  -F "recognition_mode=multi_speaker" \
  -F "callback_url=https://your-server.com/webhooks/vas"

Events You Receive

ResultEventDescription
Successrecording.completed + import.completedYou receive two events
Failureimport.failedThe import stage failed

For the complete Webhook format, signature verification, and sample code, see the Webhook Callback Guide.


Complete Flow Diagram

           ┌────────────────────┐
           │   check-quota      │  Check whether the budget is sufficient
           │  POST /imports/    │
           │  check-quota       │
           └────────┬───────────┘
                    │
              allowed: true?
               ╱          ╲
            Yes            No → Prompt that the budget is insufficient
              │
    ┌─────────▼──────────┐
    │    POST /imports    │  Upload the audio
    │  multipart/form-data│  (transcription_languages,
    │                     │   translation_languages,
    │                     │   recognition_mode, ...)
    └─────────┬──────────┘
              │
         HTTP 202
         import_id
              │
    ┌─────────▼──────────┐
    │  GET /imports/{id}  │  Poll the processing status
    │   every 3 to 5 s    │  (query every 3 to 5 seconds)
    └─────────┬──────────┘
              │
        status check
       ╱      │      ╲
   pending  processing  completed / failed
              │              │
         stage:          task_id ←── Processing complete
         converting          │
         transcribing   ┌────▼─────────────┐
         translating    │  Tasks API        │  View the task list
         summarizing    │  SSE History API  │  Load the transcript
                        │  SSE Audio API    │  Play the audio
                        │  SSE Retranslate  │  Retranslate
                        └──────────────────┘

DocumentDescription
AuthenticationDetailed explanation of API Key authentication
Imports API ReferenceComplete specification of the audio import API
Import Progress SSESpecification of the real-time progress tracking SSE stream
Tasks API ReferenceComplete specification of the task management API
Summary Templates ReferenceSummary template queries
History and PlaybackHow to load and play back records after an import completes

Version: V1.5.7 Last Updated: 2026-05-20

Copyright © 2026