Audio Engine¶

The Audio Engine is a Language Server module that runs an ordered pipeline of audio-processing components in a single synchronous HTTP request. You send one multipart request with a JSON components array, required user_email and project_name fields, and one or more audio files; the server validates your API key, checks project access, uploads inputs to GCS or Azure Blob (same pattern as single-task transcription), creates Dataset / FileRecord rows and a Task, runs the pipeline on temp files, uploads any new audio outputs, and returns task_id, dataset_ids, and a result envelope whose URLs are signed read links to object storage (default 24-hour validity, same pattern as GET /v1/response/{output_dataset_id}), not local /tmp paths.

The route is available only when the server is built with Audio Engine enabled and the feature flag is on (see Enabling Audio Engine).

Request headers¶

Accept: application/json
X-API-Key: YOUR_API_KEY
Content-Type: multipart/form-data

Use the same API key authentication as other Language Server endpoints (verify_api_key on every request). The same key must be allowed for the project_name you send (validate_project_access before the task is created).

Endpoint¶

Method	Path	Description
`POST`	`/v1/audio_engine/pipeline`	Run an ordered pipeline; returns `200` with `result` on success.

Multipart form fields¶

Field	Required	Description
`components`	Yes	JSON string whose value is a non-empty array of objects (see Components JSON).
`files`	Yes	One or more audio file parts. The first file is the working input for the pipeline unless a component’s contract says otherwise.
`user_email`	Yes	Valid email for the user owning the task (stored on the created `Task`).
`project_name`	Yes	Project name; must match an existing project your `X-API-Key` can access.
`provider`	No	Reserved for future output object-store selection. Default `GOOGLE`.

Components JSON¶

Each element must be a JSON object with:

Key	Required	Type	Description
`component_id`	Yes	string	Canonical pipeline component id (see Component catalog).
`params`	No	object	Parameters for that component only. Omitted or `null` is treated as `{}`. Must be an object, not a string or array.

Order matters: the server runs components[0], then components[1], and so on. For audio-producing components, the output of index i becomes the working input for index i+1.

Example components value (stringified in the form field):

[
  {
    "component_id": "noise_reduce",
    "params": { "noise_estimation_duration_sec": 0.5 }
  }
]

Files and cross-component rules¶

Limits (configurable via environment variables on the server):
Max files: AUDIO_ENGINE_MAX_FILE_COUNT (default 10).
Max size per file: AUDIO_ENGINE_MAX_FILE_SIZE_BYTES (default 100 MiB).
Max pipeline length: AUDIO_ENGINE_MAX_PIPELINE_STEPS (default 20).
speech_similarity: If this component_id appears anywhere in components, the request must include at least two uploaded files (validation rule before execution).
slice_audio: If a slice_audio entry is not the last in components and its params.ranges contains more than one range, the request is rejected with 400 (multiple slices must be the last pipeline entry).
split_audio_channel: This step must be the last entry in components (otherwise the request is rejected with 400).

Successful response (`200`)¶

Top-level body:

Field	Type	Description
`status`	string	`"COMPLETED"` on success.
`task_id`	string	Database id of the `Task` created for this request (same id used for completion bookkeeping).
`dataset_ids`	string[]	Ordered ids: INPUT `Dataset` (uploaded files), then OUTPUT `Dataset` id when the pipeline produced at least one new audio file that was uploaded (same `TaskDatasetLink` pattern as other single tasks).
`result`	object	Pipeline result envelope (below); `output_url` / `final_output_url` are signed HTTPS GET URLs to object storage after upload (default 24h expiry).

`result` envelope¶

Field	Type	Description
`component_ids`	string[]	The ordered `component_id` values that ran.
`components`	object[]	One result object per pipeline entry, in order (see Per-component output).
`final_output_url`	string	Signed HTTPS URL of the working audio after the last component (matches the last step’s output in object storage; equals the signed input URL if no audio-producing step ran).

Per-component output¶

Each element of result.components always includes:

Field	Description
`component_id`	The id that ran for this index.

Additionally:

Audio-producing component (handler returns an output_path): the object includes output_url — after the request completes this is a signed object-store URL for that step’s output file (or the signed input URL if that path was only passed through).
Metrics-only component (handler returns metrics): the object includes a metrics field (shape depends on the component).

Errors¶

HTTP	When
400	Invalid JSON in `components`, empty `components`, bad entry shape, unknown `component_id`, invalid `user_email` / empty `project_name`, `slice_audio` / `split_audio_channel` / `speech_similarity` rules violated, or non-object `params`.
403	Missing/invalid API key (`verify_api_key`), key not allowed for `project_name`, or missing `X-API-Key` when resolving project access.
422	Missing required multipart field (e.g. `user_email` or `project_name` not sent).
413	Uploaded file exceeds the configured max size.
429	Admission control: too many concurrent pipelines; retry with backoff.
500	Unhandled error, component failure, or failure to upload input/output to object storage after the task was created.
501	`component_id` is known to the API but no handler is registered for this deployment.
503	Audio Engine resources not initialised (e.g. lifespan not started).
504	Per-request pipeline timeout exceeded.

Enabling Audio Engine¶

Environment variable AUDIO_ENGINE_ENABLED: when set to false, the Audio Engine router is not mounted (the feature is off). Default is on in application code; confirm with your deployment.
Server tunables (examples): AUDIO_ENGINE_MAX_IN_FLIGHT, AUDIO_ENGINE_SEMAPHORE_WAIT_TIMEOUT, AUDIO_ENGINE_REQUEST_TIMEOUT, AUDIO_ENGINE_THREAD_POOL_WORKERS, AUDIO_ENGINE_PROCESS_POOL_WORKERS — see app/audio_engine/config.py for the full list.

How to call the API¶

curl¶

curl -sS -X POST "http://localhost:8000/v1/audio_engine/pipeline" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F 'components=[{"component_id":"noise_reduce","params":{"noise_estimation_duration_sec":0.5}}]' \
  -F "user_email=user@example.com" \
  -F "project_name=my-project" \
  -F "files=@/path/to/input.wav"

Use your deployed base URL and a real WAV for noise_reduce (mono or stereo; stereo is downmixed to mono).

Speech duration (VAD metrics only):

curl -sS -X POST "http://localhost:8000/v1/audio_engine/pipeline" \
  -H "X-API-Key: YOUR_API_KEY" \
  -F 'components=[{"component_id":"speech_duration_measurement","params":{"threshold":0.5}}]' \
  -F "user_email=user@example.com" \
  -F "project_name=my-project" \
  -F "files=@/path/to/input.wav"

Python (httpx)¶

import httpx
import json

pipeline_components = [
    {"component_id": "noise_reduce", "params": {"noise_estimation_duration_sec": 0.5}},
]

files = {"files": open("input.wav", "rb")}
data = {
    "components": json.dumps(pipeline_components),
    "user_email": "user@example.com",
    "project_name": "my-project",
}

r = httpx.post(
    "http://localhost:8000/v1/audio_engine/pipeline",
    headers={"X-API-Key": "YOUR_API_KEY"},
    data=data,
    files=files,
    timeout=300.0,
)
r.raise_for_status()
print(r.json())

Chaining denoise then speech duration:

pipeline_components = [
    {"component_id": "noise_reduce", "params": {"noise_estimation_duration_sec": 0.5}},
    {
        "component_id": "speech_duration_measurement",
        "params": {
            "threshold": 0.5,
            "min_speech_duration_ms": 250,
            "min_silence_duration_ms": 100,
            "speech_pad_ms": 30,
        },
    },
]
# ... same multipart pattern as above; response result.components[1]["metrics"]["speech_duration_seconds"]

Component catalog¶

The server accepts any component_id in the canonical id set below. Only components that are registered in the running server actually execute; others return 501 until implemented. See Example payloads by component for copy-paste components arrays.

`component_id`	Implemented	Parameters (`params`)	Output
`noise_reduce`	Yes	`noise_estimation_duration_sec` (float, default `0.5`) — length of the quietest window used for the noise profile, in seconds.	`output_url` (denoised mono WAV). Multi-channel input is averaged to mono before processing.
`beg_silence_trimmer`	Yes	`manual_segments` (list of [start, end] pairs) — segments to adjust.	`output_path` (trimmed audio), `metrics` (adjusted segments).
`end_silence_trimmer`	Yes	None.	`output_path` (trimmed audio).
`slice_audio`	Yes	`ranges` — list of [start, end] pairs; if more than one range, this entry must be last in `components`.	`output_urls` (list of slice paths).
`split_audio_channel`	Yes	None.	`output_urls` (list of channel paths).
`fix_manual_segments`	Planned	(Product-specific)	(Product-specific)
`speech_fluency_check`	Planned	(Product-specific)	Metrics / validation fields.
`check_segments`	Planned	(Product-specific)	Metrics / validation fields.
`audio_duration_check`	Yes	`min_duration_minutes` (float, default `8.0`).	`metrics` containing `meets_duration_requirement`.
`speech_duration_measurement`	Yes	See speech_duration_measurement — `threshold` (0–1, default `0.5`), `min_speech_duration_ms`, `min_silence_duration_ms`, `speech_pad_ms` (all ≥ 0; defaults `250` / `100` / `30`). Unknown param keys are rejected (same as other validated components).	`metrics` with `speech_duration_seconds` (float, sum of detected speech segments in seconds).
`speech_similarity`	Planned	(Product-specific)	Similarity / score fields; requires two uploads in the request.

Example payloads by component¶

Each JSON value below is a valid non-empty components array (stringify it for the multipart components form field). Combine steps in one array to chain the pipeline; respect Files and cross-component rules.

noise_reduce (mono or stereo working input, downmixed to mono; params optional — defaults match NoiseReduceParams):

[
  {
    "component_id": "noise_reduce",
    "params": { "noise_estimation_duration_sec": 0.5 }
  }
]

beg_silence_trimmer (manual_segments are optional [start_sec, end_sec] pairs in the original timeline; default []):

[
  {
    "component_id": "beg_silence_trimmer",
    "params": { "manual_segments": [[0.5, 2.0], [3.0, 5.0]] }
  }
]

end_silence_trimmer (no tunable params — only {} is accepted):

[
  {
    "component_id": "end_silence_trimmer",
    "params": {}
  }
]

slice_audio — empty ranges (passthrough as a single output_urls entry):

[
  {
    "component_id": "slice_audio",
    "params": { "ranges": [] }
  }
]

slice_audio — one range (may appear anywhere in the pipeline):

[
  {
    "component_id": "slice_audio",
    "params": { "ranges": [[0.0, 20.5]] }
  }
]

slice_audio — multiple ranges (each pair is [start_sec, end_sec]; must be the last components entry whenever there is more than one range):

[
  {
    "component_id": "slice_audio",
    "params": { "ranges": [[0.0, 10.0], [20.0, 25.0]] }
  }
]

split_audio_channel (stereo input; must be the last pipeline step; only {} params):

[
  {
    "component_id": "split_audio_channel",
    "params": {}
  }
]

audio_duration_check:

[
  {
    "component_id": "audio_duration_check",
    "params": { "min_duration_minutes": 8.0 }
  }
]

speech_duration_measurement — all params optional (defaults shown explicitly):

[
  {
    "component_id": "speech_duration_measurement",
    "params": {
      "threshold": 0.5,
      "min_speech_duration_ms": 250,
      "min_silence_duration_ms": 100,
      "speech_pad_ms": 30
    }
  }
]

speech_duration_measurement — defaults via empty params:

[
  {
    "component_id": "speech_duration_measurement",
    "params": {}
  }
]

Planned / not yet registered — these ids are in ALL_COMPONENT_IDS and accept params: {} today (extra="forbid"). A server without a handler returns 501 for that step. Examples:

fix_manual_segments:

[
  {
    "component_id": "fix_manual_segments",
    "params": {}
  }
]

speech_fluency_check:

[
  {
    "component_id": "speech_fluency_check",
    "params": {}
  }
]

check_segments:

[
  {
    "component_id": "check_segments",
    "params": {}
  }
]

speech_similarity (multipart request must include at least two files parts):

[
  {
    "component_id": "speech_similarity",
    "params": {}
  }
]

speech_duration_measurement¶

Implemented: runs Silero VAD on the working audio file, detects speech segments, and returns the total speech duration as a single scalar. The step is metrics-only: it does not write a new audio file or change the pipeline’s working buffer.

Topic	Detail
Sampling rate	The implementation reads audio at 16 kHz (Silero’s expected rate); use a format Silero can load (typically WAV).
`metrics.speech_duration_seconds`	Sum of `(end − start)` over all detected speech intervals, in seconds (can be `0.0` if no speech is detected).
Runtime deps	`torch` and `silero-vad` (pinned in the repo root `pyproject.toml`). The Silero model is loaded once per process and reused across requests.
Executor	Thread pool (`pool_kind="thread"`), consistent with other I/O- and native-heavy steps.

For parameter validation rules (ranges, extra="forbid"), see SpeechDurationMeasurementParams in app/audio_engine/payload_validator.py.

For the exact handler contracts and edge cases of implemented components, see the source under app/audio_engine/components/.

Operational notes¶

Synchronous: there is no separate poll endpoint; the full pipeline result is returned in the same HTTP response.
Concurrency: under load the server may respond with 429; clients should retry with backoff.
Timeouts: very long pipelines may hit 504; tune AUDIO_ENGINE_REQUEST_TIMEOUT on the server and client timeouts accordingly.

Area	Path
HTTP router	`app/audio_engine/router.py`
Pipeline executor	`app/audio_engine/pipeline.py`
Component registry	`app/audio_engine/registry.py`
Multipart + per-step `params` validation	`app/audio_engine/payload_validator.py`
Canonical ids	`app/audio_engine/constants.py`
Configuration	`app/audio_engine/config.py`

Contributors adding a new component should follow Audio Engine: onboarding a new component.