DeploymentsContainer

Batch persistent worker

Run a long-lived HTTP transcription worker that accepts multiple jobs without restarting, reducing turnaround time and improving CPU/GPU utilisation.

Available from version 15.7.0

A batch persistent worker (also called an HTTP batch worker) is a long-running transcription service that loads the ASR models once at startup and then accepts jobs over an HTTP API for the lifetime of the container. Unlike standard batch containers — which start up, process a single job, and exit — a persistent worker stays alive indefinitely, serving jobs as they arrive.

This gives you:

No per-job cold start. The models are loaded into memory once. Every subsequent job skips the startup cost entirely.
Concurrent processing. The --parallel flag controls how many processing units the worker handles simultaneously. Individual jobs can also be assigned multiple processing units(called engines in this document) to reduce their own turnaround time.

The worker exposes an HTTP API for submitting jobs, polling status, fetching transcripts, and checking availability.

Why use a persistent worker?

	Standard batch	Persistent worker
Startup cost	Per job	Once
Memory usage	One container per job	Multiple jobs share one container
CPU/GPU utilisation	Interrupted between jobs	Continuous
Best for	Large, infrequent files	High throughput or smaller files

Cold start overhead is significant for short audio. Loading the ASR models — especially onto GPU — takes several seconds. For a 5-minute file this cost is negligible. For a 10-second clip, startup can take longer than transcription itself. The persistent worker eliminates this by loading the models once.

High-throughput workloads benefit from a single long-lived container. Routing many jobs to one worker is more efficient than launching a container per job. The --parallel setting lets you tune concurrency to your workload.

GPU utilisation is maximised. On GPU deployments, a standard batch container leaves the GPU idle between jobs. A persistent worker keeps the GPU warm and available, reducing wasted capacity across back-to-back requests.

When processing long audio jobs the benefits on RTF of the Persistent batch worker is negligible, and the resultant RTF is similar to that of a standard batch job.

Deploying the worker

Docker

docker run -it -e LICENSE_TOKEN=$TOKEN_VALUE -p PORT:18000 batch-asr-transcriber-en:15.0.0 --run-mode http --parallel=4 --all-formats /output_dir_name

Parameters

Parameter	Description
`--parallel`	Number of parallel engines (each engine maps to one GPU connection when on GPU container).
`--all-formats`	Directory where all job outputs and logs are saved. If omitted, defaults to `/tmp/jobs`. See Generating multiple transcript formats for details.
`PORT`	The local port forwarded to the container's internal port (`18000`).

Environment variables

Variable	Description
`SM_BATCH_WORKER_LISTEN_PORT`	Override the default internal port (`18000`).
`SM_BATCH_WORKER_MAX_JOB_HISTORY`	Maximum number of completed job records to retain in memory.

Submitting a job

Once the worker is running and is available, submit jobs by POSTing to /v2/jobs with an audio file and a transcription config. The worker queues the job and returns a job_id immediately; poll GET /v2/jobs/{job_id} for status, then fetch the transcript once it reaches DONE.

curl -X POST address.of.container:PORT/v2/jobs \
  -H 'X-SM-Processing-Data: {"parallel_engines": 2, "user_id": "MY_USER_ID"}' \
  -F 'config={
    "type": "transcription",
    "transcription_config": {
      "language": "en",
      "diarization": "speaker",
      "operating_point": "enhanced"
    }
  }' \
  -F 'data_file=@~/audio_file.mp3'

import asyncio
import os
from dotenv import load_dotenv
from speechmatics.batch import AsyncClient

load_dotenv()

async def main():
    client = AsyncClient(
        api_key=os.getenv("SPEECHMATICS_API_KEY"),
        url="address.of.container:PORT/v2"
    )
    result = await client.transcribe(
        "audio.wav",
        parallel_engines=2,
        user_id="MY_USER_ID"
    )
    print(result.transcript_text)
    await client.close()

asyncio.run(main())

Response codes

Code	Meaning
`201`	Job accepted. Returns `{"job_id": "abcdefgh01"}`
`400`	Invalid request
`503`	Server busy — not enough free engines

Managing capacity

The worker processes multiple jobs concurrently, up to the --parallel limit you set at startup.

Each job can request multiple engines using the parallel_engines value in the X-SM-Processing-Data header. More engines per job means faster turnaround for that job, at the cost of reduced concurrency for others.

To check available capacity before submitting, query the /jobs health endpoint. The unused_engines field tells you how many engines are free.

If a job requests more engines than are currently available, it will be rejected:

HTTP 503: {"detail": "Server busy: 8 engines not available (2 engines in use, 5 parallel allowed)"}

Requesting parallel engines

curl -X POST address.of.container:PORT/v2/jobs \
  -H 'X-SM-Processing-Data: {"parallel_engines": 2}' \
  -F 'config={"type": "transcription", "transcription_config": {"language": "en"}}' \
  -F 'data_file=@~/audio_file.mp3'

Speaker identification

To enable the Speaker identification feature you can use the same logic used for the one shot batch container. To enable per-customer encrypted identifiers (as used in our SaaS offering), pass a user_id in the X-SM-Processing-Data header.

curl -X POST address.of.container:PORT/v2/jobs \
  -H 'X-SM-Processing-Data: {"user_id": "MY_USER_ID"}' \
  -F 'config={
    "type": "transcription",
    "transcription_config": {
      "language": "en",
      "diarization": "speaker",
      "operating_point": "enhanced"
    }
  }' \
  -F 'data_file=@~/audio_file.mp3'

For details on secrets management, refer to the Speaker identification documentation.

Job API reference

`GET /v2/jobs`

Returns a list of jobs.

Query parameters:

Parameter	Description
`created_before`	ISO 8601 datetime. Only return jobs created before this time.
`limit`	Max number of jobs to return (1–100).

Example response:

{
  "jobs": [
    {
      "id": "191f47e4a4204fa4ac2b",
      "created_at": "2026-03-18T19:27:42.436Z",
      "data_name": "5_min",
      "text_name": null,
      "duration": 300,
      "status": "RUNNING",
      "config": {
        "type": "transcription",
        "transcription_config": {
          "language": "en",
          "diarization": "speaker",
          "operating_point": "enhanced"
        }
      }
    },
    {
      "id": "6dcb02e0dc5943e2b643",
      "created_at": "2026-03-18T19:27:47.550Z",
      "data_name": "5_min",
      "text_name": null,
      "duration": 300,
      "status": "RUNNING",
      "config": {
        "type": "transcription",
        "transcription_config": {
          "language": "en",
          "diarization": "speaker",
          "operating_point": "enhanced"
        }
      }
    }
  ]
}

`GET /v2/jobs/{job_id}`

Returns the status of a specific job.

Example response:

{
  "job": {
    "id": "191f47e4a4204fa4ac2b",
    "created_at": "2026-03-18T19:27:42.436Z",
    "data_name": "5_min",
    "duration": 300,
    "status": "DONE",
    "config": {
      "type": "transcription",
      "transcription_config": {
        "language": "en",
        "diarization": "speaker",
        "operating_point": "enhanced"
      }
    },
    "request_id": "191f47e4a4204fa4ac2b"
  }
}

`GET /v2/jobs/{job_id}/transcript`

Returns the transcript for a completed job.

Query parameters:

Parameter	Options
`format`	`json`, `txt`, `srt`

Error responses:

Code	Reason
`404`	Job not found, job not yet complete (includes current status), or unsupported format

`GET /v2/jobs/{job_id}/log`

Returns the processing logs for a specific job.

Health endpoints

The worker exposes three health endpoints on the same port as job submission.

These endpoints are designed to work as liveness and readiness probes in a Kubernetes cluster.

`GET /jobs`

Returns current engine usage and a list of active jobs. Use unused_engines to determine how many engines you can request for the next job.

Example response:

{
  "active_jobs": [
    { "job_id": "f8a564954b334eecb823", "parallel_engines": 1 },
    { "job_id": "29351ae8cf2c4e8694f0", "parallel_engines": 1 }
  ],
  "max_engines": 8,
  "unused_engines": 6
}

`GET /live`

Liveness probe. Returns 200 when all container services are running and healthy.

{ "live": true }

`GET /ready`

Readiness probe. Returns 200 when at least one engine slot is free, 503 when all engines are occupied.

{
  "ready": true,
  "engines_used": 2
}

Why use a persistent worker?​

Deploying the worker​

Docker​

Parameters​

Environment variables​

Submitting a job​

Response codes​

Managing capacity​

Requesting parallel engines​

Speaker identification​

Job API reference​

GET /v2/jobs​

GET /v2/jobs/{job_id}​

GET /v2/jobs/{job_id}/transcript​

GET /v2/jobs/{job_id}/log​

Health endpoints​

GET /jobs​

GET /live​

GET /ready​

Why use a persistent worker?

Deploying the worker

Docker

Parameters

Environment variables

Submitting a job

Response codes

Managing capacity

Requesting parallel engines

Speaker identification

Job API reference

`GET /v2/jobs`

`GET /v2/jobs/{job_id}`

`GET /v2/jobs/{job_id}/transcript`

`GET /v2/jobs/{job_id}/log`

Health endpoints

`GET /jobs`

`GET /live`

`GET /ready`