Introduction
This guide covers migrating from Hugging Face inference endpoints to Cerebrium’s serverless infrastructure platform, including key differences, migration benefits, and step-by-step instructions for deploying a Llama 3.1 8B model on Cerebrium.Comparing Hugging Face and Cerebrium
Key features and performance metrics compared:| Feature | Hugging Face | Cerebrium |
|---|---|---|
| Pricing | $0.000278 per second | $0.0004676 per second |
| Minimum cooldown period | 15m | 1s |
| First build timed | 9m25s | 49s |
| Subsequent build times | 1m50s - 2m15s | 58s - 1m5s |
| Response time (From cold) | 1m45s - 1m48s | 8s - 17s |
| Response time (From warm) | 6s | 2s |
| Co-locating your models | Requires a separate repository for each inference endpoint and mode | Co-locate multiple models from various sources in a single app |
| Response handling (From cold) | Throws an error | Waits for infrastructure to become available and returns a response |
Benefits of Migrating to Cerebrium
- Faster build times: Cerebrium significantly reduces build times by up to 95%, especially for subsequent builds (an additional 56% reduction). This can greatly improve iteration speed and the cost of running experiments with complex ML apps.
- Flexible cooldown period: With a minimum cooldown period of just 1 second (compared to Hugging Face’s 15 minutes), Cerebrium allows for more efficient resource utilization and cost management.
- Improved cold start handling: When encountering a cold start, Cerebrium waits for the infrastructure to become available instead of throwing an error. This results in a better user experience and fewer failed requests.
- Model colocation flexibility: Cerebrium doesn’t require a separate repository for each inference endpoint, simplifying the management of models. Each function in your app becomes an endpoint automatically, which means that you can run multiple models from the same app to save costs.
- Pay-per-use model: Cerebrium’s pricing model ensures you pay only for the compute resources you actually use. This can lead to cost savings, especially for sporadic or low-volume inference needs.
- Competitive performance: Cerebrium only adds up to 50ms of latency to your inference requests. This results in competitive response times from a warm start. Caching mechanisms and highly optimized orchestration pipelines help apps start from a cold state in an average of 2-5 seconds.
- Customizable infrastructure: Cerebrium allows for fine-grained control over the infrastructure specifications, enabling you to optimize for your specific use case.
Migration process
The following walks through migrating a Llama 3.1 8B model from Hugging Face to Cerebrium, from configuration setup to deployment.1. Cerebrium setup and configuration
Set up the required files and configure the environment.1.1 Install Cerebrium CLI
First, install the Cerebrium CLI:1.2 Update your requirements file
Scaffold your application by runningcerebrium init [PROJECT_NAME]. During the initialization, a cerebrium.toml is created. This file configures the deployment, hardware, scaling, and dependencies for your Cerebrium project. Update your cerebrium.toml file to reflect the following:
cerebrium.deployment: Specifies the project name, Python version, base Docker image, and which files to include/exclude as project files.cerebrium.hardware: Defines the CPU, memory, and GPU requirements for your deployment.cerebrium.scaling: Configures auto-scaling behavior, including minimum and maximum replicas, and cooldown period.cerebrium.dependencies.pip: Lists the Python packages required for your project.
1.3 Update your code
Next, updatemain.py with the model loading and inference logic.
- Authenticates with Hugging Face using a secret token. Add this secret in the Cerebrium dashboard.
- Initializes the Llama 3.1 8B model using vLLM for efficient inference.
- Defines an
Itemclass to structure and validate (using Pydantic) the input parameters. - Implements a
runfunction that generates text based on the provided prompt and parameters.
2. Deployment
To deploy your app to Cerebrium, use the following CLI command in your project directory:cerebrium.toml to set up and deploy your model.
3. Using the Deployed Model
Once deployed, you can use your model as follows:[CEREBRIUM_API_KEY] with your Inference API key, which can be found in your dashboard under API keys. This code sends a POST request to your deployed model’s endpoint with a prompt, and prints the model’s response.
Additional Considerations
When migrating, keep the following points in mind:- API structure: The Cerebrium implementation uses a different API structure compared to Huggingface
- Authentication: Ensure you have set up the
HF_AUTH_TOKENsecret in Cerebrium for authenticating with Hugging Face - Model permissions: The example uses the Llama 3.1 8B Instruct model. Ensure you have the necessary permissions to use this model
- Hardware optimization: The
cerebrium.tomlfile specifies the hardware requirements. Adjust these based on your specific model and performance needs - Dependency management: Regularly review and update the dependencies listed in
cerebrium.tomlto ensure you’re using the latest compatible versions - Scaling configuration: The example sets up auto-scaling with 0 to 5 replicas and a 30-second cooldown. Monitor your usage patterns and adjust these parameters as needed
- Cold starts: While Cerebrium handles cold starts more gracefully than Huggingface, be aware that the first request after a period of inactivity may still take longer to process
- Monitoring and logging: Familiarize yourself with Cerebrium’s monitoring and logging capabilities to track your model’s performance and usage
- Cost management: Although Cerebrium’s pay-per-use model can be more cost-effective, set up proper monitoring and alerts to avoid unexpected costs
- Testing: Thoroughly test your migrated models to ensure they perform as expected on the new platform