Skip to main content

Introduction

This guide covers migrating from Hugging Face inference endpoints to Cerebrium’s serverless infrastructure platform, including key differences, migration benefits, and step-by-step instructions for deploying a Llama 3.1 8B model on Cerebrium.

Comparing Hugging Face and Cerebrium

Key features and performance metrics compared:
FeatureHugging FaceCerebrium
Pricing$0.000278 per second$0.0004676 per second
Minimum cooldown period15m1s
First build timed9m25s49s
Subsequent build times1m50s - 2m15s58s - 1m5s
Response time (From cold)1m45s - 1m48s8s - 17s
Response time (From warm)6s2s
Co-locating your modelsRequires a separate repository for each inference endpoint and modeCo-locate multiple models from various sources in a single app
Response handling (From cold)Throws an errorWaits for infrastructure to become available and returns a response

Benefits of Migrating to Cerebrium

  1. Faster build times: Cerebrium significantly reduces build times by up to 95%, especially for subsequent builds (an additional 56% reduction). This can greatly improve iteration speed and the cost of running experiments with complex ML apps.
  2. Flexible cooldown period: With a minimum cooldown period of just 1 second (compared to Hugging Face’s 15 minutes), Cerebrium allows for more efficient resource utilization and cost management.
  3. Improved cold start handling: When encountering a cold start, Cerebrium waits for the infrastructure to become available instead of throwing an error. This results in a better user experience and fewer failed requests.
  4. Model colocation flexibility: Cerebrium doesn’t require a separate repository for each inference endpoint, simplifying the management of models. Each function in your app becomes an endpoint automatically, which means that you can run multiple models from the same app to save costs.
  5. Pay-per-use model: Cerebrium’s pricing model ensures you pay only for the compute resources you actually use. This can lead to cost savings, especially for sporadic or low-volume inference needs.
  6. Competitive performance: Cerebrium only adds up to 50ms of latency to your inference requests. This results in competitive response times from a warm start. Caching mechanisms and highly optimized orchestration pipelines help apps start from a cold state in an average of 2-5 seconds.
  7. Customizable infrastructure: Cerebrium allows for fine-grained control over the infrastructure specifications, enabling you to optimize for your specific use case.

Migration process

The following walks through migrating a Llama 3.1 8B model from Hugging Face to Cerebrium, from configuration setup to deployment.

1. Cerebrium setup and configuration

Set up the required files and configure the environment.

1.1 Install Cerebrium CLI

First, install the Cerebrium CLI:
pip install cerebrium --upgrade

1.2 Update your requirements file

Scaffold your application by running cerebrium init [PROJECT_NAME]. During the initialization, a cerebrium.toml is created. This file configures the deployment, hardware, scaling, and dependencies for your Cerebrium project. Update your cerebrium.toml file to reflect the following:
[cerebrium.deployment]
name = "llama-8b-vllm"
python_version = "3.11"
docker_base_image_url = "debian:bookworm-slim"
include = ["./*", "main.py", "cerebrium.toml"]
exclude = [".*"]

[cerebrium.hardware]
cpu = 2
memory = 12.0
compute = "AMPERE_A10"

[cerebrium.scaling]
min_replicas = 0
max_replicas = 5
cooldown = 30

[cerebrium.dependencies.pip]
sentencepiece = "latest"
torch = "latest"
transformers = "latest"
accelerate = "latest"
xformers = "latest"
pydantic = "latest"
bitsandbytes = "latest"

Configuration breakdown:
  • cerebrium.deployment: Specifies the project name, Python version, base Docker image, and which files to include/exclude as project files.
  • cerebrium.hardware: Defines the CPU, memory, and GPU requirements for your deployment.
  • cerebrium.scaling: Configures auto-scaling behavior, including minimum and maximum replicas, and cooldown period.
  • cerebrium.dependencies.pip: Lists the Python packages required for your project.

1.3 Update your code

Next, update main.py with the model loading and inference logic.
import torch
import os
from huggingface_hub import login
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

# Log into Hugging Face Hub
login(token=os.environ.get("HF_AUTH_TOKEN"))

model_path = "meta-llama/Meta-Llama-3.1-8B-Instruct"
cache_directory = "/persistent-storage"

# Set up tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path, cache_dir=cache_directory)
tokenizer.pad_token_id = 0
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
    cache_dir=cache_directory,
)

class Item(BaseModel):
    prompt: str
    temperature: float
    top_p: float
    top_k: int
    max_tokens: int
    frequency_penalty: float

def run(
        prompt, temperature=0.6, top_p=0.9, top_k=0, max_tokens=512, frequency_penalty=1
):
    item = Item(
        prompt=prompt,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        max_tokens=max_tokens,
        frequency_penalty=frequency_penalty,
    )

    # Place prompt in template
    inputs = tokenizer(
        item.prompt, return_tensors="pt", max_length=512, truncation=True, padding=True
    )
    input_ids = inputs["input_ids"].to("cuda")

    # Set up generation config
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        max_tokens=max_tokens,
    )
    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
        )
    result = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)

    return {"result": result}
This script:
  1. Authenticates with Hugging Face using a secret token. Add this secret in the Cerebrium dashboard.
  2. Initializes the Llama 3.1 8B model using vLLM for efficient inference.
  3. Defines an Item class to structure and validate (using Pydantic) the input parameters.
  4. Implements a run function that generates text based on the provided prompt and parameters.

2. Deployment

To deploy your app to Cerebrium, use the following CLI command in your project directory:
cerebrium deploy
This command uses the configuration in cerebrium.toml to set up and deploy your model.

3. Using the Deployed Model

Once deployed, you can use your model as follows:
import requests
import json

url = "https://api.aws.us-east-1.cerebrium.ai/v4/[PROJECT_NAME]/llama-8b-vllm/run"

payload = json.dumps({"prompt": "tell me about yourself"})

headers = {
  'Authorization': 'Bearer [CEREBRIUM_API_KEY]',
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)
Make sure to replace [CEREBRIUM_API_KEY] with your Inference API key, which can be found in your dashboard under API keys. This code sends a POST request to your deployed model’s endpoint with a prompt, and prints the model’s response.

Additional Considerations

When migrating, keep the following points in mind:
  1. API structure: The Cerebrium implementation uses a different API structure compared to Huggingface
  2. Authentication: Ensure you have set up the HF_AUTH_TOKEN secret in Cerebrium for authenticating with Hugging Face
  3. Model permissions: The example uses the Llama 3.1 8B Instruct model. Ensure you have the necessary permissions to use this model
  4. Hardware optimization: The cerebrium.toml file specifies the hardware requirements. Adjust these based on your specific model and performance needs
  5. Dependency management: Regularly review and update the dependencies listed in cerebrium.toml to ensure you’re using the latest compatible versions
  6. Scaling configuration: The example sets up auto-scaling with 0 to 5 replicas and a 30-second cooldown. Monitor your usage patterns and adjust these parameters as needed
  7. Cold starts: While Cerebrium handles cold starts more gracefully than Huggingface, be aware that the first request after a period of inactivity may still take longer to process
  8. Monitoring and logging: Familiarize yourself with Cerebrium’s monitoring and logging capabilities to track your model’s performance and usage
  9. Cost management: Although Cerebrium’s pay-per-use model can be more cost-effective, set up proper monitoring and alerts to avoid unexpected costs
  10. Testing: Thoroughly test your migrated models to ensure they perform as expected on the new platform

Conclusion

Migrating from Hugging Face inference endpoints to Cerebrium offers faster build times, more flexible resource management, and lower costs. The migration requires some setup and code changes, but the resulting deployment provides improved performance and scalability. Continuously monitor and optimize the deployment in production. Reach out to support or join the Slack and Discord communities for questions during the migration process. Read more about Cerebrium’s functionality: