Track, allocate, and manage your generative AI cost and usage with Amazon Bedrock

As enterprises increasingly embrace generative AI , they face challenges in managing the associated costs. With demand for generative AI applications surging across projects and multiple lines of business, accurately allocating and tracking spend becomes more complex. Organizations need to prioritize their generative AI spending based on business impact and criticality while maintaining cost transparency across customer and user segments. This visibility is essential for setting accurate pricing for generative AI offerings, implementing chargebacks, and establishing usage-based billing models.

Without a scalable approach to controlling costs, organizations risk unbudgeted usage and cost overruns. Manual spend monitoring and periodic usage limit adjustments are inefficient and prone to human error, leading to potential overspending. Although tagging is supported on a variety of Amazon Bedrock resources—including provisioned models, custom models, agents and agent aliases, model evaluations, prompts, prompt flows, knowledge bases, batch inference jobs, custom model jobs, and model duplication jobs—there was previously no capability for tagging on-demand foundation models. This limitation has added complexity to cost management for generative AI initiatives.

To address these challenges, Amazon Bedrock has launched a capability that organization can use to tag on-demand models and monitor associated costs. Organizations can now label all Amazon Bedrock models with AWS cost allocation tags, aligning usage to specific organizational taxonomies such as cost centers, business units, and applications. To manage their generative AI spend judiciously, organizations can use services like AWS Budgets to set tag-based budgets and alarms to monitor usage, and receive alerts for anomalies or predefined thresholds. This scalable, programmatic approach eliminates inefficient manual processes, reduces the risk of excess spending, and ensures that critical applications receive priority. Enhanced visibility and control over AI-related expenses enables organizations to maximize their generative AI investments and foster innovation.

Introducing Amazon Bedrock application inference profiles

Amazon Bedrock recently introduced cross-region inference, enabling automatic routing of inference requests across AWS Regions. This feature uses system-defined inference profiles (predefined by Amazon Bedrock), which configure different model Amazon Resource Names (ARNs) from various Regions and unify them under a single model identifier (both model ID and ARN). While this enhances flexibility in model usage, it doesn’t support attaching custom tags for tracking, managing, and controlling costs across workloads and tenants.

To bridge this gap, Amazon Bedrock now introduces application inference profiles, a new capability that allows organizations to apply custom cost allocation tags to track, manage, and control their Amazon Bedrock on-demand model costs and usage. This capability enables organizations to create custom inference profiles for Bedrock base foundation models, adding metadata specific to tenants, thereby streamlining resource allocation and cost monitoring across varied AI applications.

Creating application inference profiles

Application inference profiles allow users to define customized settings for inference requests and resource management. These profiles can be created in two ways:

Single model ARN configuration: Directly create an application inference profile using a single on-demand base model ARN, allowing quick setup with a chosen model.
Copy from system-defined inference profile: Copy an existing system-defined inference profile to create an application inference profile, which will inherit configurations such as cross-Region inference capabilities for enhanced scalability and resilience.

The application inference profile ARN has the following format, where the inference profile ID component is a unique 12-digit alphanumeric string generated by Amazon Bedrock upon profile creation.

arn:aws:bedrock:<region>:<account_id>:application-inference-profile/<inference_profile_id>

System-defined compared to application inference profiles

The primary distinction between system-defined and application inference profiles lies in their type attribute and resource specifications within the ARN namespace:

System-defined inference profiles: These have a type attribute of SYSTEM_DEFINED and utilize the inference-profile resource type. They’re designed to support cross-Region and multi-model capabilities but are managed centrally by AWS.

{
 …
"inferenceProfileArn": "arn:aws:bedrock:us-east-1:<Account ID>:inference-profile/us-1.anthropic.claude-3-sonnet-20240229-v1:0",
"inferenceProfileId": "us-1.anthropic.claude-3-sonnet-20240229-v1:0",
"inferenceProfileName": "US-1 Anthropic Claude 3 Sonnet",
"status": "ACTIVE",
"type": "SYSTEM_DEFINED",
…
}

Application inference profiles: These profiles have a type attribute of APPLICATION and use the application-inference-profile resource type. They’re user-defined, providing granular control and flexibility over model configurations and allowing organizations to tailor policies with attribute-based access control (ABAC) using AWS Identity and Access Management (IAM). This enables more precise IAM policy authoring to manage Amazon Bedrock access more securely and efficiently.
```
{
…
"inferenceProfileArn": "arn:aws:bedrock:us-east-1:<Account ID>:application-inference-profile/<Auto generated ID>",
"inferenceProfileId": <Auto generated ID>,
"inferenceProfileName": <User defined name>,
"status": "ACTIVE",
"type": "APPLICATION"
…
}
```

These differences are important when integrating with Amazon API Gateway or other API clients to help ensure correct model invocation, resource allocation, and workload prioritization. Organizations can apply customized policies based on profile type, enhancing control and security for distributed AI workloads. Both models are shown in the following figure.

Establishing application inference profiles for cost management

Imagine an insurance provider embarking on a journey to enhance customer experience through generative AI. The company identifies opportunities to automate claims processing, provide personalized policy recommendations, and improve risk assessment for clients across various regions. However, to realize this vision, the organization must adopt a robust framework for effectively managing their generative AI workloads.

The journey begins with the insurance provider creating application inference profiles that are tailored to their diverse business units. By assigning AWS cost allocation tags, the organization can effectively monitor and track their Bedrock spend patterns. For example, the claims processing team established an application inference profile with tags such as dept:claims, team:automation, and app:claims_chatbot. This tagging structure categorizes costs and allows assessment of usage against budgets.

Users can manage and use application inference profiles using Bedrock APIs or the boto3 SDK:

CreateInferenceProfile: Initiates a new inference profile, allowing users to configure the parameters for the profile.
GetInferenceProfile: Retrieves the details of a specific inference profile, including its configuration and current status.
ListInferenceProfiles: Lists all available inference profiles within the user’s account, providing an overview of the profiles that have been created.
TagResource: Allows users to attach tags to specific Bedrock resources, including application inference profiles, for better organization and cost tracking.
ListTagsForResource: Fetches the tags associated with a specific Bedrock resource, helping users understand how their resources are categorized.
UntagResource: Removes specified tags from a resource, allowing for management of resource organization.
Invoke models with application inference profiles:

- Converse API: Invokes the model using a specified inference profile for conversational interactions.
- ConverseStream API: Similar to the Converse API but supports streaming responses for real-time interactions.
- InvokeModel API: Invokes the model with a specified inference profile for general use cases.
- InvokeModelWithResponseStream API: Invokes the model and streams the response, useful for handling large data outputs or long-running processes.

Note that application inference profile APIs cannot be accessed through the AWS Management Console.

Invoke model with application inference profile using Converse API

The following example demonstrates how to create an application inference profile and then invoke the Converse API to engage in a conversation using that profile –

def create_inference_profile(profile_name, model_arn, tags):
    """Create Inference Profile using base model ARN"""
    response = bedrock.create_inference_profile(
        inferenceProfileName=profile_name,
        description="test",
        modelSource={'copyFrom': model_arn},
        tags=tags
    )
    print("CreateInferenceProfile Response:", response['ResponseMetadata']['HTTPStatusCode']),
    print(f"{response}n")
    return response

# Create Inference Profile
print("Testing CreateInferenceProfile...")
tags = [{'key': 'dept', 'value': 'claims'}]
base_model_arn = "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0"
claims_dept_claude_3_sonnet_profile = create_inference_profile("claims_dept_claude_3_sonnet_profile", base_model_arn, tags)

# Extracting the ARN and retrieving Application Inference Profile ID
claims_dept_claude_3_sonnet_profile_arn = claims_dept_claude_3_sonnet_profile['inferenceProfileArn']

def converse(model_id, messages):
    """Use the Converse API to engage in a conversation with the specified model"""
    response = bedrock_runtime.converse(
        modelId=model_id,
        messages=messages,
        inferenceConfig={
            'maxTokens': 300,  # Specify max tokens if needed
        }
    )
    
    status_code = response.get('ResponseMetadata', {}).get('HTTPStatusCode')
    print("Converse Response:", status_code)
    parsed_response = parse_converse_response(response)
    print(parsed_response)
    return response

# Example of Converse API with Application Inference Profile
print("nTesting Converse...")
prompt = "nnHuman: Tell me about Amazon Bedrock.nnAssistant:"
messages = [{"role": "user", "content": [{"text": prompt}]}]
response = converse(claims_dept_claude_3_sonnet_profile_arn, messages)

Tagging, resource management, and cost management with application inference profiles

Tagging within application inference profiles allows organizations to allocate costs with specific generative AI initiatives, ensuring precise expense tracking. Application inference profiles enable organizations to apply cost allocation tags at creation and support additional tagging through the existing TagResource and UnTagResource APIs, which allow metadata association with various AWS resources. Custom tags such as project_id, cost_center, model_version, and environment help categorize resources, improving cost transparency and allowing teams to monitor spend and usage against budgets.

Visualize cost and usage with application inference profiles and cost allocation tags

Leveraging cost allocation tags with tools like AWS Budgets, AWS Cost Anomaly Detection, AWS Cost Explorer, AWS Cost and Usage Reports (CUR), and Amazon CloudWatch provides organizations insights into spending trends, helping detect and address cost spikes early to stay within budget.

With AWS Budgets, organization can set tag-based thresholds and receive alerts as spending approach budget limits, offering a proactive approach to maintaining control over AI resource costs and quickly addressing any unexpected surges. For example, a $10,000 per month budget could be applied on a specific chatbot application for the Support Team in the Sales Department by applying the following tags to the application inference profile: dept:sales, team:support, and app:chat_app. AWS Cost Anomaly Detection can also monitor tagged resources for unusual spending patterns, making it easier to operationalize cost allocation tags by automatically identifying and flagging irregular costs.

The following AWS Budgets console screenshot illustrates an exceeded budget threshold:

For deeper analysis, AWS Cost Explorer and CUR enable organizations to analyze tagged resources daily, weekly, and monthly, supporting informed decisions on resource allocation and cost optimization. By visualizing cost and usage based on metadata attributes, such as tag key/value and ARN, organizations gain an actionable, granular view of their spending.

The following AWS Cost Explorer console screenshot illustrates a cost and usage graph filtered by tag key and value:

The following AWS Cost Explorer console screenshot illustrates a cost and usage graph filtered by Bedrock application inference profile ARN:

Organizations can also use Amazon CloudWatch to monitor runtime metrics for Bedrock applications, providing additional insights into performance and cost management. Metrics can be graphed by application inference profile, and teams can set alarms based on thresholds for tagged resources. Notifications and automated responses triggered by these alarms enable real-time management of cost and resource usage, preventing budget overruns and maintaining financial stability for generate AI workloads.

The following Amazon CloudWatch console screenshot highlights Bedrock runtime metrics filtered by Bedrock application inference profile ARN:

The following Amazon CloudWatch console screenshot highlights an invocation limit alarm filtered by Bedrock application inference profile ARN:

Through the combined use of tagging, budgeting, anomaly detection, and detailed cost analysis, organizations can effectively manage their AI investments. By leveraging these AWS tools, teams can maintain a clear view of spending patterns, enabling more informed decision-making and maximizing the value of their generative AI initiatives while ensuring critical applications remain within budget.

Retrieving application inference profile ARN based on the tags for Model invocation

Organizations often use a generative AI gateway or large language model proxy when calling Amazon Bedrock APIs, including model inference calls. With the introduction of application inference profiles, organizations need to retrieve the inference profile ARN to invoke model inference for on-demand foundation models. There are two primary approaches to obtain the appropriate inference profile ARN.

Static configuration approach: This method involves maintaining a static configuration file in the AWS Systems Manager Parameter Store or AWS Secrets Manager that maps tenant/workload keys to their corresponding application inference profile ARNs. While this approach offers simplicity in implementation, it has significant limitations. As the number of inference profiles scales from tens to hundreds or even thousands, managing and updating this configuration file becomes increasingly cumbersome. The static nature of this method requires manual updates whenever changes occur, which can lead to inconsistencies and increased maintenance overhead, especially in large-scale deployments where organizations need to dynamically retrieve the correct inference profile based on tags.
Dynamic retrieval using the Resource Groups API: The second, more robust approach leverages the AWS Resource Groups GetResources API to dynamically retrieve application inference profile ARNs based on resource and tag filters. This method allows for flexible querying using various tag keys such as tenant ID, project ID, department ID, workload ID, model ID, and region. The primary advantage of this approach is its scalability and dynamic nature, enabling real-time retrieval of application inference profile ARNs based on current tag configurations.

However, there are considerations to keep in mind. The GetResources API has throttling limits, necessitating the implementation of a caching mechanism. Organizations should maintain a cache with a Time-To-Live (TTL) based on the API’s output to optimize performance and reduce API calls. Additionally, implementing thread safety is crucial to help ensure that organizations always read the most up-to-date inference profile ARNs when the cache is being refreshed based on the TTL.

As illustrated in the following diagram, this dynamic approach involves a client making a request to the Resource Groups service with specific resource type and tag filters. The service returns the corresponding application inference profile ARN, which is then cached for a set period. The client can then use this ARN to invoke the Bedrock model through the InvokeModel or Converse API.

By adopting this dynamic retrieval method, organizations can create a more flexible and scalable system for managing application inference profiles, allowing for more straightforward adaptation to changing requirements and growth in the number of profiles.

The architecture in the preceding figure illustrates two methods for dynamically retrieving inference profile ARNs based on tags. Let’s describe both approaches with their pros and cons:

Bedrock client maintaining the cache with TTL: This method involves the client directly querying the AWS ResourceGroups service using the GetResources API based on resource type and tag filters. The client caches the retrieved keys in a client-maintained cache with a TTL. The client is responsible for refreshing the cache by calling the GetResources API in the thread safe way.
Lambda-based Method: This approach uses AWS Lambda as an intermediary between the calling client and the ResourceGroups API. This method employs Lambda Extensions core with an in-memory cache, potentially reducing the number of API calls to ResourceGroups. It also interacts with Parameter Store, which can be used for configuration management or storing cached data persistently.

Both methods use similar filtering criteria (resource-type-filter and tag-filters) to query the ResourceGroup API, allowing for precise retrieval of inference profile ARNs based on attributes such as tenant, model, and Region. The choice between these methods depends on factors such as the expected request volume, desired latency, cost considerations, and the need for additional processing or security measures. The Lambda-based approach offers more flexibility and optimization potential, while the direct API method is simpler to implement and maintain.

Overview of Amazon Bedrock resources tagging capabilities

The tagging capabilities of Amazon Bedrock have evolved significantly, providing a comprehensive framework for resource management across multi-account AWS Control Tower setups. This evolution enables organizations to manage resources across development, staging, and production environments, helping organizations track, manage, and allocate costs for their AI/ML workloads.

At its core, the Amazon Bedrock resource tagging system spans multiple operational components. Organizations can effectively tag their batch inference jobs, agents, custom model jobs, knowledge bases, prompts, and prompt flows. This foundational level of tagging supports granular control over operational resources, enabling precise tracking and management of different workload components. The model management aspect of Amazon Bedrock introduces another layer of tagging capabilities, encompassing both custom and base models, and distinguishes between provisioned and on-demand models, each with its own tagging requirements and capabilities.

With the introduction of application inference profiles, organizations can now manage and track their on-demand Bedrock base foundation models. Because teams can create application inference profiles derived from system-defined inference profiles, they can configure more precise resource tracking and cost allocation at the application level. This capability is particularly valuable for organizations that are running multiple AI applications across different environments, because it provides clear visibility into resource usage and costs at a granular level.

The following diagram visualizes the multi-account structure and demonstrates how these tagging capabilities can be implemented across different AWS accounts.

Conclusion

In this post we introduced the latest feature from Amazon Bedrock, application inference profiles. We explored how it operates and discussed key considerations. The code sample for this feature is available in this GitHub repository. This new capability enables organizations to tag, allocate, and track on-demand model inference workloads and spending across their operations. Organizations can label all Amazon Bedrock models using tags and monitoring usage according to their specific organizational taxonomy—such as tenants, workloads, cost centers, business units, teams, and applications. This feature is now generally available in all AWS Regions where Amazon Bedrock is offered.

About the authors

Kyle T. Blocksom is a Sr. Solutions Architect with AWS based in Southern California. Kyle’s passion is to bring people together and leverage technology to deliver solutions that customers love. Outside of work, he enjoys surfing, eating, wrestling with his dog, and spoiling his niece and nephew.

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.