Skip to main content
This tutorial transcribes an hour-long audio file using Distill Whisper — an optimized version of Whisper-large-v2 that’s 60% faster while maintaining accuracy within 1% of the original. The endpoint accepts either a base64-encoded string of the audio file or a URL to download the audio file. To see the final implementation, you can view it here

Basic Setup

Developing models with Cerebrium is similar to developing on a virtual machine or Google Colab. Install the Cerebrium package and log in. See the installation docs for details. Create the project:
cerebrium init 1-whisper-transcription
Add the following packages to the [cerebrium.dependencies.pip] section of your cerebrium.toml file:
[cerebrium.dependencies.pip]
accelerate = "latest"
transformers = ">=4.35.0"
openai-whisper = "latest"
pydantic = "latest"
Create a util.py file for utility functions — downloading a file from a URL or converting a base64 string to a file:
import base64
import uuid

DOWNLOAD_ROOT = "/tmp/"

def download_file_from_url(url: str, filename: str):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, "wb") as f:
            f.write(response.content)
        return filename
    else:
        raise Exception("Download failed")


def save_base64_string_to_file(audio: str):
    decoded_data = base64.b64decode(audio)
    filename = f"{DOWNLOAD_ROOT}/{uuid.uuid4()}"
    with open(filename, "wb") as file:
        file.write(decoded_data)
    return filename

With the utility functions complete, update main.py with the main application code. The endpoint accepts either a base64-encoded string or a public URL of the audio file, passes it to the model, and returns the output. Define the request object:
from typing import Optional
from pydantic import BaseModel, HttpUrl

class Item(BaseModel):
    audio: Optional[str]
    file_url: Optional[HttpUrl]
    webhook_endpoint: Optional[HttpUrl]
Pydantic handles data validation. While audio and file_url are optional parameters, at least one must be provided. The webhook_endpoint parameter, automatically included by Cerebrium in every request, is useful for long-running requests. Note: Cerebrium has a 3-minute timeout for each inference request. For long audio files (2+ hours) that take several minutes to process, use a webhook_endpoint — a URL where Cerebrium sends a POST request with the function’s results.

Setup Model and inference

Import the required packages and load the Whisper model. The model downloads during initial deployment and is automatically cached in persistent storage for subsequent use. Loading the model outside the predict function ensures this code only runs on cold start (startup). For warm containers, only the predict function executes for inference.
from huggingface_hub import hf_hub_download
from whisper import load_model, transcribe
from util import download_file_from_url, save_base64_string_to_file

distil_large_v2 = hf_hub_download(repo_id="distil-whisper/distil-large-v3", filename="original-model.bin")
model = load_model(distil_large_v2)

def predict(run_id, audio=None, file_url=None, webhook_endpoint=None):
    item = Item(audio=audio, file_url=file_url, webhook_endpoint=webhook_endpoint)
    input_filename = f"{run_id}.mp3"

    if audio is None and file_url is None:
        raise 'Either audio or file_url must be provided'
    else:
        if item.audio is not None:
            file = save_base64_string_to_file(item.audio)
        elif item.file_url is not None:
            file = download_file_from_url(item.file_url, input_filename)
        print("Transcribing file...")

        result = transcribe(model, audio=file)
        return result
The predict function, which runs only on inference requests, creates an audio file from either the download URL or base64 string, transcribes it, and returns the output.

Deploy

Configure your compute and environment settings in cerebrium.toml:
[cerebrium.deployment]
name = "1-whisper-transcription"
python_version = "3.11"
include = ["./*", "main.py", "cerebrium.toml"]
exclude = ["./example_exclude"]
docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"

[cerebrium.hardware]
region = "us-east-1"
provider = "aws"
compute = "AMPERE_A10"
cpu = 3
memory = 12.0
gpu_count = 1

[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 60

[cerebrium.dependencies.pip]
accelerate = "latest"
transformers = ">=4.35.0"
openai-whisper = "latest"
pydantic = "latest"

[cerebrium.dependencies.conda]

[cerebrium.dependencies.apt]
"ffmpeg" = "latest"

Deploy the app using this command:
cerebrium deploy
After deployment, make this request:
curl --location 'https://api.aws.us-east-1.cerebrium.ai/v4/p-<YOUR PROJECT ID>/1-whisper-transcription/predict' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <YOUR TOKEN HERE>' \
--data '{"file_url": "https://your-public-url.com/test.mp3"}''
The response returns immediately with a 202 status code and a run_id — a unique identifier to correlate the result with the initial workload. The endpoint returns results in this format:
{
  "run_id": "2R5PnHprwNqiS5tcFMor-4c6rSrxuzrVtBU1JfjT5iWFG6s4pHo1Ug==",
  "message": "Finished inference request with run_id: `2R5PnHprwNqiS5tcFMor-4c6rSrxuzrVtBU1JfjT5iWFG6s4pHo1Ug==`",
  "result": {
    "text": " Testing, one, two, three, testing.",
    "segments": [
      {
        "id": 0,
        "seek": 0,
        "start": 0,
        "end": 4,
        "text": " Testing, one, two, three, testing.",
        "tokens": [
          50364, 45517, 11, 472, 11, 220, 20534, 11, 220, 27583, 11, 220, 83,
          8714, 13, 50564
        ],
        "temperature": 0,
        "avg_logprob": -0.3824356023003073,
        "compression_ratio": 1,
        "no_speech_prob": 0.019467202946543694
      }
    ],
    "language": "en"
  },
  "status_code": 200,
  "run_time_ms": 2053.8525581359863
}