Automating AWS Cost and Usage Report with CloudFormation

In this blog post, we’ll explore how to set up AWS Cost and Usage Report (CUR) automatically using AWS CloudFormation. This includes creating an S3 bucket for storing your reports, configuring the CUR to export data in Parquet format, and setting up Athena and Glue for querying the data. By the end of this post, you’ll have a comprehensive, automated solution for managing and analyzing your AWS cost and usage data.

Prerequisites

Before diving into the CloudFormation template, you’ll need:

An AWS account
AWS CLI installed and configured
Appropriate permissions to create AWS resources (S3, IAM, Athena, Glue, CUR)

Project Structure

To achieve our goal, we’ll run a CloudFormation stack locally using a single YAML file. This stack will perform the following tasks:

Configure the Cost and Usage Report export
Create an Athena/Glue database

Once the table is created, we’ll see an example of an Athena query for cost analysis.

By the end of this process, we will have a simplified folder structure:

cost_and_usage_report_stack
│
├── cloudformation
│ ├── MasterTemplate.yaml # Combined CloudFormation template
│
├── athena
│ ├── YourCURReport-create-table.sql # The query created by AWS
│ ├── example_query.sql # Example SQL query for Athena
│
├── README.md

Detailed Steps

1. Create the S3 Bucket

Given that the bucket name must be unique, I added my initials at the end. You will need to change this. I usually work in us-east-1 so I use this region (if you already have a bucket, skip this and update the rest) :

aws s3 mb s3://your-cur-reports-bucket-fg --region us-east-1

2. Write your Cloudformation Template

In this session, we will create with cloudformation the Cost and Usage CUR report and the Athena database to integrate it into a SQL queryable env. If you want to customize or integrate with other services like Redshift, you can adapt this YAML file.

If you have an existing database, you can skip the second part of the yaml related to Glue/Athena but remember to update any references correctly.

`cloudformation/MasterTemplate.yaml`


AWSTemplateFormatVersion: '2010-09-09'
Resources:
  CostAndUsageReportDefinition:
    Type: 'AWS::CUR::ReportDefinition'
    Properties:
      ReportName: 'YourCURReport'
      TimeUnit: 'HOURLY'
      Format: 'Parquet'
      Compression: 'Parquet'
      S3Bucket: 'your-cur-reports-bucket-fg'
      S3Prefix: 'cur-reports/'
      S3Region: 'us-east-1'
      AdditionalSchemaElements:
        - RESOURCES
      AdditionalArtifacts:
        - ATHENA
      RefreshClosedReports: true
      ReportVersioning: 'OVERWRITE_REPORT'

  GlueDatabase:
    Type: 'AWS::Glue::Database'
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: 'yourcurreport'

3. Running the CloudFormation Stack

Use AWS CLI to create the stack using MasterTemplate.yaml:

aws cloudformation create-stack --stack-name your-cur-stack --template-body file://cloudformation/MasterTemplate.yaml --capabilities CAPABILITY_NAMED_IAM

AWS processes cost data daily. In this step, you will need to wait up to 24 hours for your data to be available. If you want to create the table with my query, you can, but it’ll be empty.

4. Create an Athena Table

The next day, log into your account and open your S3 bucket. The files from yesterday will be there! (So fun Huh?!)

For instance, you might find it in s3://your-cur-reports-bucket-fg/cur-reports/YourCURReport
Inside this folder, you will also find a SQL file to create the table in Athena: /YYYYMMDD-YYYYMMDD/YourCURReport-create-table.sql

In my case, that’s what I found (It has many columns!):

CREATE EXTERNAL TABLE YourCURReport.your_c_u_r_report(
    identity_line_item_id STRING,
    identity_time_interval STRING,
    bill_invoice_id STRING,
    bill_invoicing_entity STRING,
    bill_billing_entity STRING,
    bill_bill_type STRING,
    bill_payer_account_id STRING,
    bill_billing_period_start_date TIMESTAMP,
    bill_billing_period_end_date TIMESTAMP,
    line_item_usage_account_id STRING,
    line_item_line_item_type STRING,
    line_item_usage_start_date TIMESTAMP,
    line_item_usage_end_date TIMESTAMP,
    line_item_product_code STRING,
    line_item_usage_type STRING,
    line_item_operation STRING,
    line_item_availability_zone STRING,
    line_item_resource_id STRING,
    line_item_usage_amount DOUBLE,
    line_item_normalization_factor DOUBLE,
    line_item_normalized_usage_amount DOUBLE,
    line_item_currency_code STRING,
    line_item_unblended_rate STRING,
    line_item_unblended_cost DOUBLE,
    line_item_blended_rate STRING,
    line_item_blended_cost DOUBLE,
    line_item_line_item_description STRING,
    line_item_tax_type STRING,
    line_item_legal_entity STRING,
    product_product_name STRING,
    product_availability_zone STRING,
    product_capacitystatus STRING,
    product_classicnetworkingsupport STRING,
    product_clock_speed STRING,
    product_current_generation STRING,
    product_dedicated_ebs_throughput STRING,
    product_ecu STRING,
    product_enhanced_networking_supported STRING,
    product_from_location STRING,
    product_from_location_type STRING,
    product_from_region_code STRING,
    product_gpu_memory STRING,
    product_group STRING,
    product_group_description STRING,
    product_instance_family STRING,
    product_instance_type STRING,
    product_instance_type_family STRING,
    product_intel_avx2_available STRING,
    product_intel_avx_available STRING,
    product_intel_turbo_available STRING,
    product_license_model STRING,
    product_location STRING,
    product_location_type STRING,
    product_marketoption STRING,
    product_max_iopsvolume STRING,
    product_max_throughputvolume STRING,
    product_max_volume_size STRING,
    product_memory STRING,
    product_network_performance STRING,
    product_normalization_size_factor STRING,
    product_operating_system STRING,
    product_operation STRING,
    product_physical_processor STRING,
    product_pre_installed_sw STRING,
    product_processor_architecture STRING,
    product_processor_features STRING,
    product_product_family STRING,
    product_region STRING,
    product_region_code STRING,
    product_servicecode STRING,
    product_servicename STRING,
    product_sku STRING,
    product_storage STRING,
    product_storage_media STRING,
    product_storage_type STRING,
    product_tenancy STRING,
    product_tiertype STRING,
    product_to_location STRING,
    product_to_location_type STRING,
    product_to_region_code STRING,
    product_transfer_type STRING,
    product_usagetype STRING,
    product_vcpu STRING,
    product_volume_api_name STRING,
    product_volume_type STRING,
    product_vpcnetworkingsupport STRING,
    pricing_rate_code STRING,
    pricing_rate_id STRING,
    pricing_currency STRING,
    pricing_public_on_demand_cost DOUBLE,
    pricing_public_on_demand_rate STRING,
    pricing_term STRING,
    pricing_unit STRING,
    reservation_amortized_upfront_cost_for_usage DOUBLE,
    reservation_amortized_upfront_fee_for_billing_period DOUBLE,
    reservation_effective_cost DOUBLE,
    reservation_end_time STRING,
    reservation_modification_status STRING,
    reservation_normalized_units_per_reservation STRING,
    reservation_number_of_reservations STRING,
    reservation_recurring_fee_for_usage DOUBLE,
    reservation_start_time STRING,
    reservation_subscription_id STRING,
    reservation_total_reserved_normalized_units STRING,
    reservation_total_reserved_units STRING,
    reservation_units_per_reservation STRING,
    reservation_unused_amortized_upfront_fee_for_billing_period DOUBLE,
    reservation_unused_normalized_unit_quantity DOUBLE,
    reservation_unused_quantity DOUBLE,
    reservation_unused_recurring_fee DOUBLE,
    reservation_upfront_value DOUBLE,
    savings_plan_total_commitment_to_date DOUBLE,
    savings_plan_savings_plan_a_r_n STRING,
    savings_plan_savings_plan_rate DOUBLE,
    savings_plan_used_commitment DOUBLE,
    savings_plan_savings_plan_effective_cost DOUBLE,
    savings_plan_amortized_upfront_commitment_for_billing_period DOUBLE,
    savings_plan_recurring_commitment_for_billing_period DOUBLE
)

PARTITIONED BY (
    year STRING,
    month STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH  SERDEPROPERTIES (
 'serialization.format' = '1'
) LOCATION 's3://your-cur-reports-bucket-fg/cur-reports/YourCURReport/YourCURReport/'

This file contains all the columns of the report you generated (so if you customize before, here your SQL file will be different). Open your Athena console, select the database, and run the query from the file. (You can also look more into this folder; it contains more JSONs and metadata of the process.)

Initially, if you try to directly select the table, it might return as missing. (Relax, This is expected behavior.) You need to run a command to ask Glue to search the newer partitions¹:

msck repair table yourcurreport.YourCURReport

¹ You can configure a crawler in Glue to keep updating its partitions, but it will generate costs

5. Retrieve your Data

Now you can run your query and retrieve real data about your usage!

`athena/example_query.sql`

SELECT
  line_item_usage_account_id,
  product_servicecode,
  product_product_name,
  SUM(line_item_blended_cost) as cost
FROM yourcurreport.your_cur_table
GROUP BY line_item_usage_account_id, product_servicecode, product_product_name
ORDER BY cost DESC;

This query will group costs by account and service. You can’t do this in the console because it allows grouping by only one variable (and filtering by the others). In a large environment, this grouping capability can clearly show where your biggest sources of cost are.

I suggest looking into a few variables: service, usage type, region, and unblended cost. There are some query examples from AWS Well-Architected Cost Optimization.

Conclusion

In this blog post, we’ve automated the setup of the AWS Cost and Usage Report using a single CloudFormation template executed locally. By following these steps, you can efficiently track and manage your AWS costs. Feel free to expand this template to suit your specific requirements. If you encounter any issues with the tutorial, consider using the repo directly.

This case is interesting because, in most situations, business data is the primary focus. However, without control over cloud costs, your IT department may face significant challenges. Applying a data-driven approach to your IT challenges can be a powerful tool.

Be mindful of the costs associated with this process (storage and requests in S3, bytes read in Athena), so it’s a good idea to create transformation processes that provide views aligned with your business needs, rather than looking directly into raw data.

In future posts, I will discuss good practices and frameworks for cost reduction in your environment!

GitHub Repo: https://github.com/felipe-de-godoy/cost_and_usage_report_stack

Source link
lol