Speed up your AI inference workloads with new NVIDIA-powered capabilities in Amazon SageMaker

This post is co-written with Abhishek Sawarkar, Eliuth Triana, Jiahong Liu and Kshitiz Gupta from NVIDIA.

At re:Invent 2024, we are excited to announce new capabilities to speed up your AI inference workloads with NVIDIA accelerated computing and software offerings on Amazon SageMaker. These advancements build upon our collaboration with NVIDIA, which includes adding support for inference-optimized GPU instances and integration with NVIDIA technologies. They represent our continued commitment to delivering scalable, cost-effective, and flexible GPU-accelerated AI inference capabilities to our customers.

Today, we are introducing three key advancements that further expand our AI inference capabilities:

NVIDIA NIM microservices are now available in AWS Marketplace for SageMaker Inference deployments, providing customers with easy access to state-of-the-art generative AI models.
NVIDIA Nemotron-4 is now available on Amazon SageMaker JumpStart, significantly expanding the range of high-quality, pre-trained models available to our customers. This integration provides a powerful multilingual model that excels in reasoning benchmarks.
Inference-optimized P5e and G6e instances are now generally available on Amazon SageMaker, giving customers access to NVIDIA H200 Tensor Core and L40S GPUs for AI inference workloads.

In this post, we will explore how you can use these new capabilities to enhance your AI inference on Amazon SageMaker. We’ll walk through the process of deploying NVIDIA NIM microservices from AWS Marketplace for SageMaker Inference. We’ll then dive into NVIDIA’s model offerings on SageMaker JumpStart, showcasing how to access and deploy the Nemotron-4 model directly in the JumpStart interface. This will include step-by-step instructions on how to find the Nemotron-4 model in the JumpStart catalog, select it for your use case, and deploy it with a few clicks. We’ll also demonstrate how to fine-tune and optimize this model for your specific requirements. Additionally, we’ll introduce you to the new inference-optimized P5e and G6e instances powered by NVIDIA H200 and L40S GPUs, showcasing how they can significantly boost your AI inference performance. By the end of this post, you’ll have a practical understanding of how to implement these advancements in your own AI projects, enabling you to accelerate your inference workloads and drive innovation in your organization.

Announcing NVIDIA NIM in AWS Marketplace for SageMaker Inference

NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, offers a set of high-performance microservices designed to help organizations rapidly deploy and scale generative AI applications on NVIDIA-accelerated infrastructure. SageMaker Inference is a fully managed capability for customers to run generative AI and machine learning models at scale, providing purpose-built features and a broad array of inference-optimized instances. AWS Marketplace serves as a curated digital catalog where customers can find, buy, deploy, and manage third-party software, data, and services needed to build solutions and run businesses. We’re excited to announce that AWS customers can now access NVIDIA NIM microservices for SageMaker Inference deployments through the AWS Marketplace , simplifying the deployment of generative AI models and helping partners and enterprises to scale their AI capabilities. The initial availability includes a portfolio of models packaged as NIM microservices, expanding the options for AI inference on Amazon SageMaker, including:

NVIDIA Nemotron-4: a cutting-edge large language model (LLM) designed to generate diverse synthetic data that closely mimics real-world data, enhancing the performance and robustness of custom LLMs across various domains.
Llama 3.1 8B-Instruct: an 8-billion-parameter multilingual LLM that is a pre-trained and instruction-tuned generative model optimized for language understanding, reasoning, and text generation use cases.
Llama 3.1 70B-Instruct: a 70-billion-parameter pre-trained, instruction-tuned model optimized for multilingual dialogue.
Mixtral 8x7B Instruct v0.1: a high-quality sparse mixture of experts model (SMoE) with open weights that can follow instructions, complete requests, and generate creative text formats.

Key benefits of deploying NIM on AWS

Ease of deployment: AWS Marketplace integration makes it straightforward to select and deploy models directly, eliminating complex setup processes. Select your preferred model from the marketplace, configure your infrastructure options, and deploy within minutes.
Seamless integration with AWS services: AWS offers robust infrastructure options, including GPU-optimized instances for inference, managed AI services such as SageMaker, and Kubernetes support with EKS, helping your deployments scale effectively.
Security and control: Maintain full control over your infrastructure settings on AWS, allowing you to optimize your runtime environments to match specific use cases.

How to get started with NVIDIA NIM on AWS

To deploy NVIDIA NIM microservices from the AWS Marketplace, follow these steps:

Visit the NVIDIA NIM page on the AWS Marketplace and select your desired model, such as Llama 3.1 or Mixtral.
Choose the AWS Regions to deploy to, GPU instance types, and resource allocations to fit your needs.
Use the notebook examples to start your deployment using SageMaker to create the model, configure the endpoint, and deploy the model, and AWS will handle the orchestration of resources, networking, and scaling as needed.

NVIDIA NIM microservices in the AWS Marketplace facilitates seamless deployment in SageMaker so that organizations across various industries can develop, deploy, and scale their generative AI applications more quickly and effectively than ever.

SageMaker JumpStart now includes NVIDIA models: Introducing NVIDIA NIM microservices for Nemotron models

SageMaker JumpStart is a model hub and no-code solution within SageMaker that makes advanced AI inference capabilities more accessible to AWS customers by providing a streamlined path to access and deploy popular models from different providers. It offers an intuitive interface where organizations can easily deploy popular AI models with a few clicks, eliminating the complexity typically associated with model deployment and infrastructure management. The integration offers enterprise-grade features including model evaluation metrics, fine-tuning and customization capabilities, and collaboration tools, all while giving customers full control of their deployment.

We are excited to announce that NVIDIA models are now available in SageMaker JumpStart, marking a significant milestone in our ongoing collaboration. This integration brings NVIDIA’s cutting-edge AI models directly to SageMaker Inference customers, starting with the powerful Nemotron-4 model. With JumpStart, customers can access their state-of-the-art models within the SageMaker ecosystem to combine NVIDIA’s AI models with the scalable and price performance inference from SageMaker.

Support for Nemotron-4 – A multilingual and fine-grained reasoning model

We are also excited to announce that NVIDIA Nemotron-4 is now available in JumpStart model hub. Nemotron-4 is a cutting-edge LLM designed to generate diverse synthetic data that closely mimics real-world data, enhancing the performance and robustness of custom LLMs across various domains. Compact yet powerful, it has been fine-tuned on carefully curated datasets that emphasize high-quality sources and underrepresented domains. This refined approach enables strong results in commonsense reasoning, mathematical problem-solving, and programming tasks. Moreover, Nemotron-4 exhibits outstanding multilingual capabilities compared to similarly sized models, and even outperforms those over four times larger and those explicitly specialized for multilingual tasks.

Nemotron-4 – performance and optimization benefits

Nemotron-4 demonstrates great performance in common sense reasoning tasks like SIQA, ARC, PIQA, and Hellaswag with an average score of 73.4, outperforming similarly sized models and demonstrating similar performance against larger ones such as Llama-2 34B. Its exceptional multilingual capabilities also surpass specialized models like mGPT 13B and XGLM 7.5B on benchmarks like XCOPA and TyDiQA, highlighting its versatility and efficiency. When deployed through NVIDIA NIM microservices on SageMaker, these models deliver optimized inference performance, allowing businesses to generate and validate synthetic data with unprecedented speed and accuracy.

Through SageMaker JumpStart, customers can access pre-optimized models from NVIDIA that significantly simplify deployment and management. These containers are specifically tuned for NVIDIA GPUs on AWS, providing optimal performance out of the box. NIM microservices deliver efficient deployment and scaling, allowing organizations to focus on their use cases rather than infrastructure management.

Quick start guide

From SageMaker Studio console, select JumpStart and choose the NVIDIA model family as shown in the following image.
Select the NVIDIA Nemotron-4 NIM microservice.
On the model details page, choose Deploy, and a pop-up window will remind you that you need an AWS Marketplace subscription. If you haven’t subscribed to this model, you can choose Subscribe, which will direct you to the AWS Marketplace to complete the subscription. Otherwise, you can choose Deploy to proceed with model deployment.
On the model deployment page, you can configure the endpoint name, select the endpoint instance type and instance count, in addition to other advanced settings, such as IAM role and VPC setting.

After you finish setting up the endpoint and choose Deploy at the bottom right corner, the NVIDIA Nemotron-4 model will be deployed to a SageMaker endpoint. After the endpoint’s status is In Service, you can start testing the model by invoking the endpoint using the following code. Take a look at the example notebook if you want to deploy the model programmatically.

 messages = [
 {"role": "user", "content": "Hello! How are you?"},
 {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
 {"role": "user", "content": "Write a short limerick about the wonders of GPU Computing."}
]
payload = {
 "model": payload_model,
 "messages": messages,
 "max_tokens": 100,
 "stream": True
}
response = client.invoke_endpoint_with_response_stream(
 EndpointName=endpoint_name,
 Body=json.dumps(payload),
 ContentType="application/json",
 Accept="application/jsonlines",
)

To clean up the endpoint, you can delete the endpoint from the SageMaker Studio console or call the delete endpoint API.
```
sagemaker.delete_endpoint(EndpointName=<endpoint_name>)
```

SageMaker JumpStart provides an additional streamlined path to access and deploy NVIDIA NIM microservices, making advanced AI capabilities even more accessible to AWS customers. Through JumpStart’s intuitive interface, organizations can deploy Nemotron models with a few clicks, eliminating the complexity typically associated with model deployment and infrastructure management. The integration offers enterprise-grade features including model evaluation metrics, customization capabilities, and collaboration tools, all while maintaining data privacy within the customer’s VPC. This comprehensive integration enables organizations to accelerate their AI initiatives while using the combined strengths of the scalable infrastructure provided by AWS and NVIDIA’s optimized models.

P5e and G6e instances powered by NVIDIA H200 Tensor Core and L40S GPUs are now available on SageMaker Inference

SageMaker now supports new P5e and G6e instances, powered by NVIDIA GPUs for AI inference.

P5e instances use NVIDIA H200 Tensor Core GPUs for AI and machine learning. These instances offer 1.7 times larger GPU memory and 1.4 times higher memory bandwidth than previous generations. With eight powerful H200 GPUs per instance connected using NVIDIA NVLink for seamless GPU-to-GPU communication and blazing-fast 3,200 Gbps multi-node networking through EFA technology, P5e instances are purpose-built for deploying and training even the most demanding ML models. These instances deliver performance, reliability, and scalability for your cutting-edge inference applications.

G6e instances, powered by NVIDIA L40S GPU s, are one of the most cost-efficient GPU instances for deploying generative AI models and the highest-performance universal GPU instances for spatial computing, AI, and graphics workloads. They offer 2 times higher GPU memory (48 GB) and 2.9 times faster GPU memory bandwidth compared to G6 instances. G6e instances deliver up to 2.5 times better performance compared to G5 instances. Customers can use G6e instances to deploy LLMs and diffusion models for generating images, video, and audio. G6e instances feature up to eight NVIDIA L40S GPUs with 384 GB of total GPU memory (48 GB of memory per GPU) and third-generation AMD EPYC processors. They also support up to 192 vCPUs, up to 400 Gbps of network bandwidth, up to 1.536 TB of system memory, and up to 7.6 TB of local NVMe SSD storage.

Both instances’ families are now available on SageMaker Inference. Checkout AWS Region availability and pricing on our pricing page.

Conclusion

These new capabilities let you deploy NVIDIA NIM microservices on SageMaker through the AWS Marketplace, use new NVIDIA Nemotron models, and tap the latest GPU instance types to power your ML workloads. We encourage you to give these offerings a look and use them to accelerate your AI workloads on SageMaker Inference.

About the authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends. You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Eliuth Triana is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within Cloud platforms & enhancing user experience on accelerated computing.

Jiahong Liu is a Solutions Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA-accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking, and wildlife watching.