Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce the new Container Caching capability in Amazon SageMaker, which significantly reduces the time required to scale generative AI models for inference. This innovation allows you to scale your models faster, observing up to 56% reduction in latency when scaling a new model copy and up to 30% when adding a model copy on a new instance. These improvements are available across a wide range of SageMaker’s Deep Learning Containers (DLCs), including Large Model Inference (LMI, powered by vLLM and multiple other frameworks), Hugging Face Text Generation Inference (TGI), PyTorch (Powered by TorchServe), and NVIDIA Triton. Fast container startup times are critical to scale generative AI models effectively, making sure end-users aren’t negatively impacted as inference demand increases.

As generative AI models and their hosting containers grow in size and complexity, scaling these models efficiently for inference becomes increasingly challenging. Until now, each time SageMaker scaled up an inference endpoint by adding new instances, it needed to pull the container image (often several tens of gigabytes in size) from Amazon Elastic Container Registry (Amazon ECR), a process that could take minutes. For generative AI models requiring multiple instances to handle high-throughput inference requests, this added significant overhead to the total scaling time, potentially impacting application performance during traffic spikes.

Container Caching addresses this scaling challenge by pre-caching the container image, eliminating the need to download it when scaling up. This new feature brings several key benefits for generative AI inference workloads: dramatically faster scaling to handle traffic spikes, improved resource utilization on GPU instances, and potential cost savings through more efficient scaling and reduced idle time during scale-up events. These benefits are particularly impactful for popular frameworks and tools like vLLM-powered LMI, Hugging Face TGI, PyTorch with TorchServe, and NVIDIA Triton, which are widely used in deploying and serving generative AI models on SageMaker inference.

In our tests, we’ve seen substantial improvements in scaling times for generative AI model endpoints across various frameworks. The implementation of Container Caching for running Llama3.1 70B model showed significant and consistent improvements in end-to-end (E2E) scaling times. We ran 5+ scaling simulations and observed consistent performance with low variations across trials. When scaling the model on an available instance, the E2E scaling time was reduced from 379 seconds (6.32 minutes) to 166 seconds (2.77 minutes), resulting in an absolute improvement of 213 seconds (3.55 minutes), or a 56% reduction in scaling time. This enhancement allows customers running high-throughput production workloads to handle sudden traffic spikes more efficiently, providing more predictable scaling behavior and minimal impact on end-user latency across their ML infrastructure, regardless of the chosen inference framework.

In this post, we explore the new Container Caching feature for SageMaker inference, addressing the challenges of deploying and scaling large language models (LLMs). We discuss how this innovation significantly reduces container download and load times during scaling events, a major bottleneck in LLM and generative AI inference. You’ll learn about the key benefits of Container Caching, including faster scaling, improved resource utilization, and potential cost savings. We showcase its real-world impact on various applications, from chatbots to content moderation systems. We then guide you through getting started with Container Caching, explaining its automatic enablement for SageMaker provided DLCs and how to reference cached versions. Finally, we delve into the supported frameworks, with a focus on LMI, PyTorch, Hugging Face TGI, and NVIDIA Triton, and conclude by discussing how this feature fits into our broader efforts to enhance machine learning (ML) workloads on AWS.

This feature is only supported when using inference components. For more information on inference components, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

The challenge of deploying LLMs for inference

As LLMs and their respective hosting containers continue to grow in size and complexity, AI and ML engineers face increasing challenges in deploying and scaling these models efficiently for inference. The rapid evolution of LLMs, with some models now using hundreds of billions of parameters, has led to a significant increase in the computational resources and sophisticated infrastructure required to run them effectively.

One of the primary bottlenecks in the deployment process is the time required to download and load containers when scaling up endpoints or launching new instances. This challenge is particularly acute in dynamic environments where rapid scaling is crucial to maintain service quality. The sheer size of these containers, often ranging from several gigabytes to tens of gigabytes, can lead to substantial delays in the scaling process.

When a scale-up event occurs, several actions take place, each contributing to the total time between triggering a scale-up event and serving traffic from the newly added instances. These actions typically include:

Provisioning new compute resources
Downloading container image
Loading container image
Loading the model weights into memory
Initializing the inference runtime
Shifting traffic to serve new requests

The cumulative time for these steps can range from several minutes to tens of minutes, depending on the model size, runtime used by the model, and infrastructure capabilities. This delay can lead to suboptimal user experiences and potential service degradation during traffic spikes, making it a critical area for optimization in the field of AI inference infrastructure.

The introduction of Container Caching for SageMaker DLCs brings several key benefits for inference workloads:

Faster scaling – By having the latest DLCs pre-cached, the time required to scale inference endpoints in response to traffic spikes is substantially reduced. This provides a more consistent and responsive experience for inference hosting, allowing systems to adapt quickly to changing demand patterns. ML engineers can now design more aggressive auto scaling policies, knowing that new instances can be brought online in a fraction of the time previously required.
Quick endpoint startup – Using pre-cached containers significantly decreases the startup time for new model deployments. This acceleration in the deployment pipeline enables more frequent model updates and iterations, fostering a more agile development cycle. AI and ML engineers can now move from model training to production deployment with unprecedented speed, reducing time-to-market for new AI features and improvements.
Improved resource utilization – Container Caching minimizes idle time on expensive GPU instances during the initialization phase. Instead of waiting for container downloads, these high-performance resources can immediately focus on inference tasks. This optimization provides more efficient use of computational resources, potentially allowing for higher throughput and better cost-effectiveness.
Cost savings – The cumulative effect of faster deployments and more efficient scaling can lead to significant reductions in overall inference costs. By minimizing idle time and improving resource utilization, organizations can potentially serve the same workload with fewer instances or handle increased demand without proportional increases in infrastructure costs. Additionally, the improved responsiveness can lead to better user experiences, potentially driving higher engagement and revenue in customer-facing applications.
Enhanced compatibility – By focusing on the latest SageMaker DLCs, this caching mechanism makes sure users always have quick access to the most recent and optimized environments for their models. This can be particularly beneficial for teams working with cutting-edge AI technologies that require frequent updates to the underlying frameworks and libraries.

Container Caching represents a significant advancement in AI inference infrastructure. It addresses a critical bottleneck in the deployment process, empowering organizations to build more responsive, cost-effective, and scalable AI systems.

Getting started with Container Caching for inference

Container Caching is automatically enabled for popular SageMaker DLCs like LMI, Hugging Face TGI, NVIDIA Triton, and PyTorch used for inference. To use cached containers, you only need to make sure you’re using a supported SageMaker container. No additional configuration or steps are required.

The following table lists the supported DLCs.

SageMaker DLC	Starting Version	Starting Container
LMI	0.29.0	763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
LMI-TRT	0.29.0	763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
LMI-Neuron	0.29.0	763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1
TGI-GPU	2.4.0	763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.4.0-gpu-py311-cu124-ubuntu22.04-v2.0
TGI-Neuron	2.1.2	763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.25-neuronx-py310-ubuntu22.04-v1.0
Pytorch-GPU	2.5.1	763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker
Pytorch-CPU	2.5.1	763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-cpu-py311-ubuntu22.04-sagemaker
Triton	24.09	763104351884.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:24.09-py3

In the following sections, we discuss how to get started with several popular SageMaker DLCs.

Hugging Face TGI

Developed by Hugging Face, TGI is an inference framework for deploying and serving LLMs, offering a purpose-built solution that combines security, performance, and ease of management. TGI is specifically designed to deliver high-performance text generation through advanced features like tensor parallelism and continuous batching. It supports a wide range of popular open source LLMs, making it a popular choice for diverse AI applications. What sets TGI apart is its optimization for both NVIDIA GPUs and AWS accelerators with AWS Inferentia and AWS Trainium, providing optimal performance across different hardware configurations.

With the introduction of Container Caching, customers using the latest release of TGI containers on SageMaker will experience improved scaling performance. The caching mechanism works automatically, requiring no additional configuration or code changes. This seamless integration means that organizations can immediately benefit from faster scaling without any operational overhead.

Philipp Schmid, Technical Lead at Hugging Face, shares his perspective on this enhancement: “Hugging Face TGI containers are widely used by SageMaker inference customers, offering a powerful solution optimized for running popular models from the Hugging Face. We are excited to see Container Caching speed up auto scaling for users, expanding the reach and adoption of open models from Hugging Face.”

You can use Container Caching with Hugging Face TGI using the following code:

// Using Container Caching for Huggingface TGI
//Create an IC with Hugging face image

create_inference_component(
        image="763104351884.dkr.ecr.<region>.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.4.0-gpu-py311-cu124-ubuntu22.04-v2.0", 
        model_url= "s3://path/to/your/model/artifacts"
        )

** We will cache latest version of currently maintained images - https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only

NVIDIA Triton

NVIDIA Triton Inference Server is a model server from NVIDIA that supports multiple deep learning frameworks and model formats. On SageMaker, Triton offers a comprehensive serving stack with support for various backends, including TensorRT, PyTorch, Python, and more. Triton is particularly powerful because of its ability to optimize inference across different hardware configurations while providing features like dynamic batching, concurrent model execution, and ensemble models. The Triton architecture enables efficient model serving through features like multi-framework support, optimized GPU utilization, and flexible model management.

With Container Caching, Triton deployments on SageMaker become even more efficient, especially when scaling large-scale inference workloads. This is particularly beneficial when deploying multiple models using Triton’s Python backend or when running model ensembles that require complex preprocessing and postprocessing pipelines. This improves the deployment and scaling experience for Triton workloads by eliminating the need to repeatedly download container images during scaling events.

Eliuth Triana, Global Lead Amazon Developer Relations at NVIDIA, comments on this enhancement:

“The integration of Container Caching with NVIDIA Triton Inference Server on SageMaker represents a significant advancement in serving machine learning models at scale. This feature perfectly complements Triton’s advanced serving capabilities by reducing deployment latency and optimizing resource utilization during scaling events. For customers running production workloads with Triton’s multi-framework support and dynamic batching, Container Caching provides faster response to demand spikes while maintaining Triton’s performance optimizations.”

To use Container Caching with NVIDIA Triton, use the following code:

// Using Container Caching for Triton
create_inference_component( 
    image="763104351884.dkr.ecr.<region>.amazonaws.com/sagemaker-tritonserver:24.09-py3", 
    model_url="s3://path/to/your/model/artifacts" 
)

PyTorch and TorchServe (now with vLLM engine integration)

SageMaker Deep Learning Container for PyTorch is powered by TorchServe . It offers a comprehensive solution for deploying and serving PyTorch models, including Large Language Models (LLMs), in production environments. TorchServe provides robust model serving capabilities through HTTP REST APIs, like flexible configuration options and performance optimization features like server-side batching, multi-model serving and dynamic model loading. The container supports a wide range of models and advanced features, including quantization, and parameter-efficient methods like LoRA.

The latest version of PyTorch also uses TorchServe integrated with vLLM engine which leverages advanced features such as vLLM’s state-of-the-art inference engine with PagedAttention and continuous batching. It supports single-node, multi-GPU distributed inference, enabling tensor parallel sharding for larger models. The integration of Container Caching significantly reduces scaling times, particularly beneficial for large models during auto-scaling events. TorchServe’s handler system allows for easy customization of pre- and post-processing logic, making it adaptable to various use cases. With its growing feature set, TorchServe is a popular choice for deploying and scaling machine learning models among inference customers.

You can use Container Caching with PyTorch using the following code:

 // Using Container Caching for PyTorch 
 create_inference_component( 
    image="763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-inference:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker", 
    model_url="s3://path/to/your/model/artifacts" 
 )

LMI container

The Large Model Inference (LMI) container is a high-performance serving solution that can be used through a no-code interface with smart defaults that can be extended to fit your unique needs. LMI delivers performance differentiation through advanced optimizations, outpacing open source backends like vLLM, TensorRT-LLM, and Transformers NeuronX while offering a unified UI.

Popular features such as continuous batching, token streaming, and speculative decoding are available out of the box to provide superior throughput, latency, and scalability. LMI supports a wide array of use cases like multi-node inference and model personalization through LoRA adapters, and performance optimizations like quantization and compilation.

With Container Caching, LMI containers deliver even faster scaling capabilities, particularly beneficial for large-scale LLM deployments where container startup times can significantly impact auto scaling responsiveness. This enhancement works seamlessly across all supported backends while maintaining the container’s advanced features and optimization capabilities.

Contributors of LMI containers comment on this enhancement:

“The addition of Container Caching to LMI containers represents a significant step forward in making LLM deployment more efficient and responsive. This feature complements our efforts to speed up model loading through pre-sharding, weight streaming, and compiler caching, enabling customers to achieve both high-performance inference and rapid scaling capabilities, which is crucial for production LLM workloads.”

To use Container Caching with LMI, use the following code:

# Using Container Caching for LMI
create_inference_component(
    image= "763104351884.dkr.ecr.<region>.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124",
    model_url="s3://path/to/your/model/artifacts"
)

Performance Evaluation:

The implementation of Container Caching for running Llama3.1 70B model showed significant and consistent improvements in end-to-end (E2E) scaling times. We ran 5+ scaling simulations and observed consistent performance with low variations across trials. When scaling the model on an available instance, the E2E scaling time was reduced from 379 seconds (6.32 minutes) to 166 seconds (2.77 minutes), resulting in an absolute improvement of 213 seconds (3.55 minutes), or a 56% reduction in scaling time. For the scenario of scaling the model by adding a new instance, the E2E scaling time decreased from 580 seconds (9.67 minutes) to 407 seconds (6.78 minutes), yielding an improvement of 172 seconds (2.87 minutes), which translates to a 30% reduction in scaling time. These results demonstrate that Container Caching substantially and reliably enhances the efficiency of model scaling operations, particularly for large language models like Llama3.1 70B, with more pronounced benefits observed when scaling on existing instances.

To run this benchmark, we use sub-minute metrics to detect the need for scaling. For more details, see Amazon SageMaker inference launches faster auto scaling for generative AI models.

The following table summarizes our setup.

Region	CMH
Instance Type	p4d.24xlarge
Container	LMI V13.31
Container Image	763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
Model	Llama 3.1 70B

Scaling the model by adding a new instance

For this scenario, we explore scaling the model by adding a new instance.

The following table summarizes the results when containers are not cached.

Meta Llama 3.1 70B
Trial	Time to Detect Need for Scaling	Time to Spin Up an Instance	Time to Instantiate a New Model Copy	End-to-End Scaling Latency
1	40	223	339	602
2	40	203	339	582
3	40	175	339	554
4	40	210	339	589
5	40	191	339	570
Average		200	339	580

The following table summarizes the results after containers are cached.

Meta Llama 3.1 70B
Trial	Time to Detect Need for Scaling	Time to Spin Up an Instance	Time to Instantiate a New Model Copy	End-to-End Scaling Latency
1	40	185	173	398
2	40	175	188	403
3	40	164	208	412
4	40	185	187	412
5	40	185	187	412
Average		178.8	188.6	407.4

Scaling the model on an available instance

In this scenario, we explore scaling the model on an available instance.

The following table summarizes the results when containers are not cached.

Meta Llama 3.1 70B
Trial	Time to Detect Need for Scaling	Time to Instantiate a New Model Copy	End-to-End Scaling Latency
1	40	339	379
2	40	339	379
3	40	339	379
4	40	339	379
5	40	339	379
Average		339	379

The following table summarizes the results after containers are cached.

Meta Llama 3.1 70B
Trial	Time to Detect Need for Scaling	Time to Instantiate a New Model Copy	End-to-End Scaling Latency
1	40	150	190
2	40	122	162
3	40	121	161
4	40	119	159
5	40	119	159
Average		126.2	166.2

Summary of findings

The following table summarizes our results in both scenarios.

.	End-to End Scaling Time Before	End-to-End Scaling Time After	Improvement in Absolute Numbers	% Improvements
Scaling the model on an available instance	379	166	213	56
Scaling the model by adding a new instance	580	407	172	30

Customers using ODCRs for GPUs may experience a lower time to spin up new instances as compared to on demand depending on instance type.

Conclusion

Container Caching for inference is just one of the many ways SageMaker can improve the efficiency and performance of ML workloads on AWS. We encourage you to try out this new feature for your inference workloads and share your experiences with us. Your feedback is invaluable as we continue to innovate and improve our ML platform.

To learn more about Container Caching and other SageMaker features for inference, refer to Amazon SageMaker Documentation or check out our GitHub repositories for examples and tutorials on deploying models for inference.

About the Authors

Wenzhao Sun, PhD, is a Sr. Software Dev Engineer with the SageMaker Inference team. He possesses a strong passion for pushing the boundaries of technical solutions, striving to maximize their theoretical potential. His primary focus is on delivering secure, high-performance, and user-friendly machine learning features for AWS customers. Outside of work, he enjoys traveling and video games.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Aakash Deep is a Software Development Engineering Manager with the Amazon SageMaker Inference team. He enjoys working on machine learning and distributed systems. His mission is to deliver secure, highly performant, highly scalable and user friendly machine learning features for AWS customers. Outside of work, he enjoys hiking and traveling.

Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring fusion cuisines, traveling, and spending time with family and friends.