Skip to main content
This tutorial covers deploying a high-throughput, low-latency REST API for serving text-embeddings, reranking models, clip, clap, and colpali using the open-source framework infinity. Infinity supports multiple GPUs/CPUs and frameworks. The inference server is built on PyTorch, optimum (ONNX/TensorRT), and CTranslate2, using FlashAttention for NVIDIA CUDA, AMD ROCM, CPU, AWS INF2, and APPLE MPS accelerators. It uses dynamic batching and dedicated tokenization worker threads. Find the final working version here on GitHub.

Project Setup

Complete the quickstart to install the CLI and create an account.
  1. Run the command: cerebrium init infinity-throughput
This creates two files:
  • main.py: The entrypoint code
  • cerebrium.toml: Container image and auto-scaling parameters
Start by defining the container environment. Infinity has a public Docker image on Dockerhub. Cerebrium requires Dockerhub authentication to pull images (even public ones). Sign in with the following command:
docker login -u your-dockerhub-username
# Enter your password or access token when prompted
Add the following to cerebrium.toml
[cerebrium.deployment]
name = "1-high-throughput"
python_version = "3.11"
docker_base_image_url = "michaelf34/infinity:0.0.77"
disable_auth = true
include = ['./*', 'main.py', 'cerebrium.toml']
exclude = ['.*']
Autoscaling criteria vary by hardware type and model selection. Define them in the following cerebrium.toml sections:
[cerebrium.hardware]
cpu = 6.0
memory = 12.0
compute = "AMPERE_A10"
region = "us-east-1"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 2
cooldown = 30
replica_concurrency = 500
scaling_metric = "concurrency_utilization"

[cerebrium.dependencies.pip]
numpy = "latest"
"infinity-emb[all]" = "0.0.77"
optimum = ">=1.24.0,<2.0.0"
transformers = "<4.49"
click = "==8.1.8"
fastapi = "latest"
uvicorn = "latest"
pandas = "latest"
The model runs on an Ampere A10, which handles up to 500 concurrent inputs. In main.py, create a class that handles embedding model functionality using the Infinity framework. This example uses multiple models to demonstrate the range of supported functionality.
from infinity_emb import AsyncEngineArray, EngineArgs

class InfinityModel:
    def __init__(self):
        self.model_ids = [
            "jinaai/jina-clip-v1",
            "michaelfeil/bge-small-en-v1.5",
            "mixedbread-ai/mxbai-rerank-xsmall-v1",
            "philschmid/tiny-bert-sst2-distilled"
        ]
        self.engine_array = None

    def _get_array(self):
        return AsyncEngineArray.from_args([
            EngineArgs(model_name_or_path=model, model_warmup=False)
            for model in self.model_ids
        ])

    async def setup(self):
        print(f"Setting up models: {self.model_ids}")
        self.engine_array = self._get_array()
        await self.engine_array.astart()
        print("All models loaded successfully!")


model = InfinityModel()
Model loading can take time, so FastAPI provides greater control over readiness. Cerebrium supports custom ASGI servers. Add the following to main.py
from fastapi import FastAPI, Body

app = FastAPI(title="High-Throughput Embedding Service")

@app.on_event("startup")
async def startup_event():
    """Initialize models on container startup"""
    await model.setup()


@app.get("/health")
async def health():
    return {"status": "healthy"}

@app.get("/ready")
async def ready():
    """Readiness endpoint to report model initialization state."""
    is_ready = model.engine_array is not None
    return {"ready": is_ready}
Infinity supports text embeddings, image embeddings, reranking, and classification. Create separate endpoints for each:
def embeddings_to_list(embeddings: list) -> list:
    """Convert list of numpy arrays to list of lists."""
    return [e.tolist() for e in embeddings]

@app.post("/embed")
async def embed(sentences: list[str] = Body(...), model_index: int = Body(1)):
    """Generate embeddings using the specified model."""
    engine = model.engine_array[model_index]
    embeddings, usage = await engine.embed(sentences=sentences)

    return {
        "embeddings": to_json(embeddings),
        "usage": to_json(usage),
        "model": model.model_ids[model_index]
    }


@app.post("/image_embed")
async def image_embed(image_urls: list[str] = Body(...), model_index: int = Body(0)):
    """Generate embeddings for images using CLIP model."""
    engine = model.engine_array[model_index]
    embeddings, usage = await engine.image_embed(images=image_urls)

    return {
        "embeddings": to_json(embeddings),
        "usage": to_json(usage),
        "model": model.model_ids[model_index]
    }


@app.post("/rerank")
async def rerank(query: str = Body(...), docs: list[str] = Body(...), model_index: int = Body(2)):
    """Rerank documents based on query relevance."""
    engine = model.engine_array[model_index]
    rankings, usage = await engine.rerank(query=query, docs=docs)

    return {
        "rankings": to_json(rankings),
        "usage": to_json(usage),
        "model": model.model_ids[model_index]
    }


@app.post("/classify")
async def classify(sentences: list[str] = Body(...), model_index: int = Body(3)):
    """Classify text sentiment."""
    engine = model.engine_array[model_index]
    classes, usage = await engine.classify(sentences=sentences)

    return {
        "classifications": to_json(classes),
        "usage": to_json(usage),
        "model": model.model_ids[model_index]
    }

This creates a multi-purpose embedding server. Update cerebrium.toml to point to the FastAPI server by adding the following section:
[cerebrium.runtime.custom]
port = 5000
entrypoint = ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "5000"]
healthcheck_endpoint = "/health"
readycheck_endpoint = "/ready"
Deploy with cerebrium deploy. After deployment, run inference with a command like:
curl --location 'https://api.aws.us-east-1.cerebrium.ai/v4/p-xxxx/infinity-throughput/image_embed' \
--header 'Content-Type: application/json' \
--data '{"image_urls": ["https://www.borrowmydoggy.com/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2F4ij0poqn%2Fproduction%2Fe24bfbd855cda99e303975f2bd2a1bf43079b320-800x600.jpg&w=1080&q=80"]}'
The response looks like:
{
    "embeddings": [
        [
            -0.05284368246793747,
            0.0011637501884251833,
            -0.029046623036265373,
            ....
        ]
    ]
}
The result is a scalable, multi-purpose embedding/reranking server.