Audio Engine¶
The Audio Engine is a Language Server module that runs an ordered pipeline of audio-processing components in a single synchronous HTTP request. You send one multipart request with a JSON components array, required user_email and project_name fields, and one or more audio files; the server validates your API key, checks project access, uploads inputs to GCS or Azure Blob (same pattern as single-task transcription), creates Dataset / FileRecord rows and a Task, runs the pipeline on temp files, uploads any new audio outputs, and returns task_id, dataset_ids, and a result envelope whose URLs are signed read links to object storage (default 24-hour validity, same pattern as GET /v1/response/{output_dataset_id}), not local /tmp paths.
The route is available only when the server is built with Audio Engine enabled and the feature flag is on (see Enabling Audio Engine).
Request headers¶
Use the same API key authentication as other Language Server endpoints (verify_api_key on every request). The same key must be allowed for the project_name you send (validate_project_access before the task is created).
Endpoint¶
| Method | Path | Description |
|---|---|---|
POST |
/v1/audio_engine/pipeline |
Run an ordered pipeline; returns 200 with result on success. |
Multipart form fields¶
| Field | Required | Description |
|---|---|---|
components |
Yes | JSON string whose value is a non-empty array of objects (see Components JSON). |
files |
Yes | One or more audio file parts. The first file is the working input for the pipeline unless a component’s contract says otherwise. |
user_email |
Yes | Valid email for the user owning the task (stored on the created Task). |
project_name |
Yes | Project name; must match an existing project your X-API-Key can access. |
provider |
No | Reserved for future output object-store selection. Default GOOGLE. |
Components JSON¶
Each element must be a JSON object with:
| Key | Required | Type | Description |
|---|---|---|---|
component_id |
Yes | string | Canonical pipeline component id (see Component catalog). |
params |
No | object | Parameters for that component only. Omitted or null is treated as {}. Must be an object, not a string or array. |
Order matters: the server runs components[0], then components[1], and so on. For audio-producing components, the output of index i becomes the working input for index i+1.
Example components value (stringified in the form field):
Files and cross-component rules¶
- Limits (configurable via environment variables on the server):
- Max files:
AUDIO_ENGINE_MAX_FILE_COUNT(default10). - Max size per file:
AUDIO_ENGINE_MAX_FILE_SIZE_BYTES(default 100 MiB). - Max pipeline length:
AUDIO_ENGINE_MAX_PIPELINE_STEPS(default20). speech_similarity: If thiscomponent_idappears anywhere incomponents, the request must include at least two uploaded files (validation rule before execution).slice_audio: If aslice_audioentry is not the last incomponentsand itsparams.rangescontains more than one range, the request is rejected with400(multiple slices must be the last pipeline entry).split_audio_channel: This step must be the last entry incomponents(otherwise the request is rejected with400).
Successful response (200)¶
Top-level body:
| Field | Type | Description |
|---|---|---|
status |
string | "COMPLETED" on success. |
task_id |
string | Database id of the Task created for this request (same id used for completion bookkeeping). |
dataset_ids |
string[] | Ordered ids: INPUT Dataset (uploaded files), then OUTPUT Dataset id when the pipeline produced at least one new audio file that was uploaded (same TaskDatasetLink pattern as other single tasks). |
result |
object | Pipeline result envelope (below); output_url / final_output_url are signed HTTPS GET URLs to object storage after upload (default 24h expiry). |
result envelope¶
| Field | Type | Description |
|---|---|---|
component_ids |
string[] | The ordered component_id values that ran. |
components |
object[] | One result object per pipeline entry, in order (see Per-component output). |
final_output_url |
string | Signed HTTPS URL of the working audio after the last component (matches the last step’s output in object storage; equals the signed input URL if no audio-producing step ran). |
Per-component output¶
Each element of result.components always includes:
| Field | Description |
|---|---|
component_id |
The id that ran for this index. |
Additionally:
- Audio-producing component (handler returns an
output_path): the object includesoutput_url— after the request completes this is a signed object-store URL for that step’s output file (or the signed input URL if that path was only passed through). - Metrics-only component (handler returns
metrics): the object includes ametricsfield (shape depends on the component).
Errors¶
| HTTP | When |
|---|---|
| 400 | Invalid JSON in components, empty components, bad entry shape, unknown component_id, invalid user_email / empty project_name, slice_audio / split_audio_channel / speech_similarity rules violated, or non-object params. |
| 403 | Missing/invalid API key (verify_api_key), key not allowed for project_name, or missing X-API-Key when resolving project access. |
| 422 | Missing required multipart field (e.g. user_email or project_name not sent). |
| 413 | Uploaded file exceeds the configured max size. |
| 429 | Admission control: too many concurrent pipelines; retry with backoff. |
| 500 | Unhandled error, component failure, or failure to upload input/output to object storage after the task was created. |
| 501 | component_id is known to the API but no handler is registered for this deployment. |
| 503 | Audio Engine resources not initialised (e.g. lifespan not started). |
| 504 | Per-request pipeline timeout exceeded. |
Enabling Audio Engine¶
- Environment variable
AUDIO_ENGINE_ENABLED: when set tofalse, the Audio Engine router is not mounted (the feature is off). Default is on in application code; confirm with your deployment. - Server tunables (examples):
AUDIO_ENGINE_MAX_IN_FLIGHT,AUDIO_ENGINE_SEMAPHORE_WAIT_TIMEOUT,AUDIO_ENGINE_REQUEST_TIMEOUT,AUDIO_ENGINE_THREAD_POOL_WORKERS,AUDIO_ENGINE_PROCESS_POOL_WORKERS— seeapp/audio_engine/config.pyfor the full list.
How to call the API¶
curl¶
curl -sS -X POST "http://localhost:8000/v1/audio_engine/pipeline" \
-H "X-API-Key: YOUR_API_KEY" \
-F 'components=[{"component_id":"noise_reduce","params":{"noise_estimation_duration_sec":0.5}}]' \
-F "user_email=user@example.com" \
-F "project_name=my-project" \
-F "files=@/path/to/input.wav"
Use your deployed base URL and a real WAV for noise_reduce (mono or stereo; stereo is downmixed to mono).
Speech duration (VAD metrics only):
curl -sS -X POST "http://localhost:8000/v1/audio_engine/pipeline" \
-H "X-API-Key: YOUR_API_KEY" \
-F 'components=[{"component_id":"speech_duration_measurement","params":{"threshold":0.5}}]' \
-F "user_email=user@example.com" \
-F "project_name=my-project" \
-F "files=@/path/to/input.wav"
Python (httpx)¶
import httpx
import json
pipeline_components = [
{"component_id": "noise_reduce", "params": {"noise_estimation_duration_sec": 0.5}},
]
files = {"files": open("input.wav", "rb")}
data = {
"components": json.dumps(pipeline_components),
"user_email": "user@example.com",
"project_name": "my-project",
}
r = httpx.post(
"http://localhost:8000/v1/audio_engine/pipeline",
headers={"X-API-Key": "YOUR_API_KEY"},
data=data,
files=files,
timeout=300.0,
)
r.raise_for_status()
print(r.json())
Chaining denoise then speech duration:
pipeline_components = [
{"component_id": "noise_reduce", "params": {"noise_estimation_duration_sec": 0.5}},
{
"component_id": "speech_duration_measurement",
"params": {
"threshold": 0.5,
"min_speech_duration_ms": 250,
"min_silence_duration_ms": 100,
"speech_pad_ms": 30,
},
},
]
# ... same multipart pattern as above; response result.components[1]["metrics"]["speech_duration_seconds"]
Component catalog¶
The server accepts any component_id in the canonical id set below. Only components that are registered in the running server actually execute; others return 501 until implemented. See Example payloads by component for copy-paste components arrays.
component_id |
Implemented | Parameters (params) |
Output |
|---|---|---|---|
noise_reduce |
Yes | noise_estimation_duration_sec (float, default 0.5) — length of the quietest window used for the noise profile, in seconds. |
output_url (denoised mono WAV). Multi-channel input is averaged to mono before processing. |
beg_silence_trimmer |
Yes | manual_segments (list of [start, end] pairs) — segments to adjust. |
output_path (trimmed audio), metrics (adjusted segments). |
end_silence_trimmer |
Yes | None. | output_path (trimmed audio). |
slice_audio |
Yes | ranges — list of [start, end] pairs; if more than one range, this entry must be last in components. |
output_urls (list of slice paths). |
split_audio_channel |
Yes | None. | output_urls (list of channel paths). |
fix_manual_segments |
Planned | (Product-specific) | (Product-specific) |
speech_fluency_check |
Planned | (Product-specific) | Metrics / validation fields. |
check_segments |
Planned | (Product-specific) | Metrics / validation fields. |
audio_duration_check |
Yes | min_duration_minutes (float, default 8.0). |
metrics containing meets_duration_requirement. |
speech_duration_measurement |
Yes | See speech_duration_measurement — threshold (0–1, default 0.5), min_speech_duration_ms, min_silence_duration_ms, speech_pad_ms (all ≥ 0; defaults 250 / 100 / 30). Unknown param keys are rejected (same as other validated components). |
metrics with speech_duration_seconds (float, sum of detected speech segments in seconds). |
speech_similarity |
Planned | (Product-specific) | Similarity / score fields; requires two uploads in the request. |
Example payloads by component¶
Each JSON value below is a valid non-empty components array (stringify it for the multipart components form field). Combine steps in one array to chain the pipeline; respect Files and cross-component rules.
noise_reduce (mono or stereo working input, downmixed to mono; params optional — defaults match NoiseReduceParams):
beg_silence_trimmer (manual_segments are optional [start_sec, end_sec] pairs in the original timeline; default []):
[
{
"component_id": "beg_silence_trimmer",
"params": { "manual_segments": [[0.5, 2.0], [3.0, 5.0]] }
}
]
end_silence_trimmer (no tunable params — only {} is accepted):
slice_audio — empty ranges (passthrough as a single output_urls entry):
slice_audio — one range (may appear anywhere in the pipeline):
slice_audio — multiple ranges (each pair is [start_sec, end_sec]; must be the last components entry whenever there is more than one range):
split_audio_channel (stereo input; must be the last pipeline step; only {} params):
audio_duration_check:
speech_duration_measurement — all params optional (defaults shown explicitly):
[
{
"component_id": "speech_duration_measurement",
"params": {
"threshold": 0.5,
"min_speech_duration_ms": 250,
"min_silence_duration_ms": 100,
"speech_pad_ms": 30
}
}
]
speech_duration_measurement — defaults via empty params:
Planned / not yet registered — these ids are in ALL_COMPONENT_IDS and accept params: {} today (extra="forbid"). A server without a handler returns 501 for that step. Examples:
fix_manual_segments:
speech_fluency_check:
check_segments:
speech_similarity (multipart request must include at least two files parts):
speech_duration_measurement¶
Implemented: runs Silero VAD on the working audio file, detects speech segments, and returns the total speech duration as a single scalar. The step is metrics-only: it does not write a new audio file or change the pipeline’s working buffer.
| Topic | Detail |
|---|---|
| Sampling rate | The implementation reads audio at 16 kHz (Silero’s expected rate); use a format Silero can load (typically WAV). |
metrics.speech_duration_seconds |
Sum of (end − start) over all detected speech intervals, in seconds (can be 0.0 if no speech is detected). |
| Runtime deps | torch and silero-vad (pinned in the repo root pyproject.toml). The Silero model is loaded once per process and reused across requests. |
| Executor | Thread pool (pool_kind="thread"), consistent with other I/O- and native-heavy steps. |
For parameter validation rules (ranges, extra="forbid"), see SpeechDurationMeasurementParams in app/audio_engine/payload_validator.py.
For the exact handler contracts and edge cases of implemented components, see the source under app/audio_engine/components/.
Operational notes¶
- Synchronous: there is no separate poll endpoint; the full pipeline result is returned in the same HTTP response.
- Concurrency: under load the server may respond with 429; clients should retry with backoff.
- Timeouts: very long pipelines may hit 504; tune
AUDIO_ENGINE_REQUEST_TIMEOUTon the server and client timeouts accordingly.
Related code¶
| Area | Path |
|---|---|
| HTTP router | app/audio_engine/router.py |
| Pipeline executor | app/audio_engine/pipeline.py |
| Component registry | app/audio_engine/registry.py |
Multipart + per-step params validation |
app/audio_engine/payload_validator.py |
| Canonical ids | app/audio_engine/constants.py |
| Configuration | app/audio_engine/config.py |
Contributors adding a new component should follow Audio Engine: onboarding a new component.