Skip to content

Audio Engine

The Audio Engine is a Language Server module that runs an ordered pipeline of audio-processing components in a single synchronous HTTP request. You send one multipart request with a JSON components array, required user_email and project_name fields, and one or more audio files; the server validates your API key, checks project access, uploads inputs to GCS or Azure Blob (same pattern as single-task transcription), creates Dataset / FileRecord rows and a Task, runs the pipeline on temp files, uploads any new audio outputs, and returns task_id, dataset_ids, and a result envelope whose URLs are signed read links to object storage (default 24-hour validity, same pattern as GET /v1/response/{output_dataset_id}), not local /tmp paths.

The route is available only when the server is built with Audio Engine enabled and the feature flag is on (see Enabling Audio Engine).


Request headers

Accept: application/json
X-API-Key: YOUR_API_KEY
Content-Type: multipart/form-data

Use the same API key authentication as other Language Server endpoints (verify_api_key on every request). The same key must be allowed for the project_name you send (validate_project_access before the task is created).


Endpoint

Method Path Description
POST /v1/audio_engine/pipeline Run an ordered pipeline; returns 200 with result on success.

Multipart form fields

Field Required Description
components Yes JSON string whose value is a non-empty array of objects (see Components JSON).
files Yes One or more audio file parts. The first file is the working input for the pipeline unless a component’s contract says otherwise.
user_email Yes Valid email for the user owning the task (stored on the created Task).
project_name Yes Project name; must match an existing project your X-API-Key can access.
provider No Reserved for future output object-store selection. Default GOOGLE.

Components JSON

Each element must be a JSON object with:

Key Required Type Description
component_id Yes string Canonical pipeline component id (see Component catalog).
params No object Parameters for that component only. Omitted or null is treated as {}. Must be an object, not a string or array.

Order matters: the server runs components[0], then components[1], and so on. For audio-producing components, the output of index i becomes the working input for index i+1.

Example components value (stringified in the form field):

[
  {
    "component_id": "noise_reduce",
    "params": { "noise_estimation_duration_sec": 0.5 }
  }
]

Files and cross-component rules

  • Limits (configurable via environment variables on the server):
  • Max files: AUDIO_ENGINE_MAX_FILE_COUNT (default 10).
  • Max size per file: AUDIO_ENGINE_MAX_FILE_SIZE_BYTES (default 100 MiB).
  • Max pipeline length: AUDIO_ENGINE_MAX_PIPELINE_STEPS (default 20).
  • speech_similarity: If this component_id appears anywhere in components, the request must include at least two uploaded files (validation rule before execution).
  • slice_audio: If a slice_audio entry is not the last in components and its params.ranges contains more than one range, the request is rejected with 400 (multiple slices must be the last pipeline entry).
  • split_audio_channel: This step must be the last entry in components (otherwise the request is rejected with 400).

Successful response (200)

Top-level body:

Field Type Description
status string "COMPLETED" on success.
task_id string Database id of the Task created for this request (same id used for completion bookkeeping).
dataset_ids string[] Ordered ids: INPUT Dataset (uploaded files), then OUTPUT Dataset id when the pipeline produced at least one new audio file that was uploaded (same TaskDatasetLink pattern as other single tasks).
result object Pipeline result envelope (below); output_url / final_output_url are signed HTTPS GET URLs to object storage after upload (default 24h expiry).

result envelope

Field Type Description
component_ids string[] The ordered component_id values that ran.
components object[] One result object per pipeline entry, in order (see Per-component output).
final_output_url string Signed HTTPS URL of the working audio after the last component (matches the last step’s output in object storage; equals the signed input URL if no audio-producing step ran).

Per-component output

Each element of result.components always includes:

Field Description
component_id The id that ran for this index.

Additionally:

  • Audio-producing component (handler returns an output_path): the object includes output_url — after the request completes this is a signed object-store URL for that step’s output file (or the signed input URL if that path was only passed through).
  • Metrics-only component (handler returns metrics): the object includes a metrics field (shape depends on the component).

Errors

HTTP When
400 Invalid JSON in components, empty components, bad entry shape, unknown component_id, invalid user_email / empty project_name, slice_audio / split_audio_channel / speech_similarity rules violated, or non-object params.
403 Missing/invalid API key (verify_api_key), key not allowed for project_name, or missing X-API-Key when resolving project access.
422 Missing required multipart field (e.g. user_email or project_name not sent).
413 Uploaded file exceeds the configured max size.
429 Admission control: too many concurrent pipelines; retry with backoff.
500 Unhandled error, component failure, or failure to upload input/output to object storage after the task was created.
501 component_id is known to the API but no handler is registered for this deployment.
503 Audio Engine resources not initialised (e.g. lifespan not started).
504 Per-request pipeline timeout exceeded.

Enabling Audio Engine

  • Environment variable AUDIO_ENGINE_ENABLED: when set to false, the Audio Engine router is not mounted (the feature is off). Default is on in application code; confirm with your deployment.
  • Server tunables (examples): AUDIO_ENGINE_MAX_IN_FLIGHT, AUDIO_ENGINE_SEMAPHORE_WAIT_TIMEOUT, AUDIO_ENGINE_REQUEST_TIMEOUT, AUDIO_ENGINE_THREAD_POOL_WORKERS, AUDIO_ENGINE_PROCESS_POOL_WORKERS — see app/audio_engine/config.py for the full list.

How to call the API

curl

curl -sS -X POST "http://localhost:8000/v1/audio_engine/pipeline" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F 'components=[{"component_id":"noise_reduce","params":{"noise_estimation_duration_sec":0.5}}]' \
  -F "user_email=user@example.com" \
  -F "project_name=my-project" \
  -F "files=@/path/to/input.wav"

Use your deployed base URL and a real WAV for noise_reduce (mono or stereo; stereo is downmixed to mono).

Speech duration (VAD metrics only):

curl -sS -X POST "http://localhost:8000/v1/audio_engine/pipeline" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F 'components=[{"component_id":"speech_duration_measurement","params":{"threshold":0.5}}]' \
  -F "user_email=user@example.com" \
  -F "project_name=my-project" \
  -F "files=@/path/to/input.wav"

Python (httpx)

import httpx
import json

pipeline_components = [
    {"component_id": "noise_reduce", "params": {"noise_estimation_duration_sec": 0.5}},
]

files = {"files": open("input.wav", "rb")}
data = {
    "components": json.dumps(pipeline_components),
    "user_email": "user@example.com",
    "project_name": "my-project",
}

r = httpx.post(
    "http://localhost:8000/v1/audio_engine/pipeline",
    headers={"X-API-Key": "YOUR_API_KEY"},
    data=data,
    files=files,
    timeout=300.0,
)
r.raise_for_status()
print(r.json())

Chaining denoise then speech duration:

pipeline_components = [
    {"component_id": "noise_reduce", "params": {"noise_estimation_duration_sec": 0.5}},
    {
        "component_id": "speech_duration_measurement",
        "params": {
            "threshold": 0.5,
            "min_speech_duration_ms": 250,
            "min_silence_duration_ms": 100,
            "speech_pad_ms": 30,
        },
    },
]
# ... same multipart pattern as above; response result.components[1]["metrics"]["speech_duration_seconds"]

Component catalog

The server accepts any component_id in the canonical id set below. Only components that are registered in the running server actually execute; others return 501 until implemented. See Example payloads by component for copy-paste components arrays.

component_id Implemented Parameters (params) Output
noise_reduce Yes noise_estimation_duration_sec (float, default 0.5) — length of the quietest window used for the noise profile, in seconds. output_url (denoised mono WAV). Multi-channel input is averaged to mono before processing.
beg_silence_trimmer Yes manual_segments (list of [start, end] pairs) — segments to adjust. output_path (trimmed audio), metrics (adjusted segments).
end_silence_trimmer Yes None. output_path (trimmed audio).
slice_audio Yes ranges — list of [start, end] pairs; if more than one range, this entry must be last in components. output_urls (list of slice paths).
split_audio_channel Yes None. output_urls (list of channel paths).
fix_manual_segments Planned (Product-specific) (Product-specific)
speech_fluency_check Planned (Product-specific) Metrics / validation fields.
check_segments Planned (Product-specific) Metrics / validation fields.
audio_duration_check Yes min_duration_minutes (float, default 8.0). metrics containing meets_duration_requirement.
speech_duration_measurement Yes See speech_duration_measurementthreshold (0–1, default 0.5), min_speech_duration_ms, min_silence_duration_ms, speech_pad_ms (all ≥ 0; defaults 250 / 100 / 30). Unknown param keys are rejected (same as other validated components). metrics with speech_duration_seconds (float, sum of detected speech segments in seconds).
speech_similarity Planned (Product-specific) Similarity / score fields; requires two uploads in the request.

Example payloads by component

Each JSON value below is a valid non-empty components array (stringify it for the multipart components form field). Combine steps in one array to chain the pipeline; respect Files and cross-component rules.

noise_reduce (mono or stereo working input, downmixed to mono; params optional — defaults match NoiseReduceParams):

[
  {
    "component_id": "noise_reduce",
    "params": { "noise_estimation_duration_sec": 0.5 }
  }
]

beg_silence_trimmer (manual_segments are optional [start_sec, end_sec] pairs in the original timeline; default []):

[
  {
    "component_id": "beg_silence_trimmer",
    "params": { "manual_segments": [[0.5, 2.0], [3.0, 5.0]] }
  }
]

end_silence_trimmer (no tunable params — only {} is accepted):

[
  {
    "component_id": "end_silence_trimmer",
    "params": {}
  }
]

slice_audio — empty ranges (passthrough as a single output_urls entry):

[
  {
    "component_id": "slice_audio",
    "params": { "ranges": [] }
  }
]

slice_audio — one range (may appear anywhere in the pipeline):

[
  {
    "component_id": "slice_audio",
    "params": { "ranges": [[0.0, 20.5]] }
  }
]

slice_audio — multiple ranges (each pair is [start_sec, end_sec]; must be the last components entry whenever there is more than one range):

[
  {
    "component_id": "slice_audio",
    "params": { "ranges": [[0.0, 10.0], [20.0, 25.0]] }
  }
]

split_audio_channel (stereo input; must be the last pipeline step; only {} params):

[
  {
    "component_id": "split_audio_channel",
    "params": {}
  }
]

audio_duration_check:

[
  {
    "component_id": "audio_duration_check",
    "params": { "min_duration_minutes": 8.0 }
  }
]

speech_duration_measurement — all params optional (defaults shown explicitly):

[
  {
    "component_id": "speech_duration_measurement",
    "params": {
      "threshold": 0.5,
      "min_speech_duration_ms": 250,
      "min_silence_duration_ms": 100,
      "speech_pad_ms": 30
    }
  }
]

speech_duration_measurement — defaults via empty params:

[
  {
    "component_id": "speech_duration_measurement",
    "params": {}
  }
]

Planned / not yet registered — these ids are in ALL_COMPONENT_IDS and accept params: {} today (extra="forbid"). A server without a handler returns 501 for that step. Examples:

fix_manual_segments:

[
  {
    "component_id": "fix_manual_segments",
    "params": {}
  }
]

speech_fluency_check:

[
  {
    "component_id": "speech_fluency_check",
    "params": {}
  }
]

check_segments:

[
  {
    "component_id": "check_segments",
    "params": {}
  }
]

speech_similarity (multipart request must include at least two files parts):

[
  {
    "component_id": "speech_similarity",
    "params": {}
  }
]

speech_duration_measurement

Implemented: runs Silero VAD on the working audio file, detects speech segments, and returns the total speech duration as a single scalar. The step is metrics-only: it does not write a new audio file or change the pipeline’s working buffer.

Topic Detail
Sampling rate The implementation reads audio at 16 kHz (Silero’s expected rate); use a format Silero can load (typically WAV).
metrics.speech_duration_seconds Sum of (end − start) over all detected speech intervals, in seconds (can be 0.0 if no speech is detected).
Runtime deps torch and silero-vad (pinned in the repo root pyproject.toml). The Silero model is loaded once per process and reused across requests.
Executor Thread pool (pool_kind="thread"), consistent with other I/O- and native-heavy steps.

For parameter validation rules (ranges, extra="forbid"), see SpeechDurationMeasurementParams in app/audio_engine/payload_validator.py.

For the exact handler contracts and edge cases of implemented components, see the source under app/audio_engine/components/.


Operational notes

  • Synchronous: there is no separate poll endpoint; the full pipeline result is returned in the same HTTP response.
  • Concurrency: under load the server may respond with 429; clients should retry with backoff.
  • Timeouts: very long pipelines may hit 504; tune AUDIO_ENGINE_REQUEST_TIMEOUT on the server and client timeouts accordingly.

Area Path
HTTP router app/audio_engine/router.py
Pipeline executor app/audio_engine/pipeline.py
Component registry app/audio_engine/registry.py
Multipart + per-step params validation app/audio_engine/payload_validator.py
Canonical ids app/audio_engine/constants.py
Configuration app/audio_engine/config.py

Contributors adding a new component should follow Audio Engine: onboarding a new component.