Amazon DataZone is a data management service that makes it quick and convenient to catalog, discover, share, and govern data stored in AWS, on-premises, and third-party sources. Amazon DataZone allows you to create and manage data zones, which are virtual data lakes that store and process your data, without the need for extensive coding or infrastructure management. Amazon DataZone makes it straightforward for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization so they can discover, use, and collaborate to derive data-driven insights.
Amazon SageMaker Canvas is a no-code machine learning (ML) service that empowers business analysts and domain experts to build, train, and deploy ML models without writing a single line of code. SageMaker Canvas streamlines data ingestion from popular sources like Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Athena, Snowflake, Salesforce, and Databricks, offering robust data preparation with Amazon SageMaker Data Wrangler, automated model building through Amazon SageMaker Autopilot, and a playground for using pre-built ML models, including foundation models (FMs) from Amazon Bedrock and Amazon SageMaker Jumpstart.
Enterprises can use no-code ML solutions to streamline their operations and optimize their decision-making without extensive administrative overhead. For example, when financial institutions use ML models to perform fraud detection analysis, they can use low-code and no-code solutions to enable rapid iteration of fraud detection models to improve efficiency and accuracy. However, ML governance plays a key role to make sure the data used in these models is accurate, secure, and reliable. With the integration of Amazon DataZone and Amazon SageMaker, users can set up infrastructure with security controls, collaborate on ML projects, and govern access to data and ML assets. You can use SageMaker Canvas as part of this integration to build ML models that are from approved and reliable datasets.
In this post, we show how the Amazon DataZone integration with SageMaker Canvas allows users to publish their data assets, and other builders from the same organization can search and discover the published datasets, subscribe to them, and consume the data. After you’re subscribed to a data asset, you can consume it from SageMaker Canvas, perform feature engineering, build an ML model, and then publish the model back to the Amazon DataZone project. The new governance capability that makes it straightforward to govern access to your infrastructure, data, and ML resources for the business problem being addressed.
Solution overview
In this section, we provide an overview of three personas: the data admin, data publisher, and data scientist. The data administrator is responsible for provisioning the necessary Amazon DataZone resources to enable the integration with SageMaker according to the Amazon DataZone concepts. The data admin defines the required security controls for ML infrastructure and deploys the SageMaker environment with Amazon DataZone. The data publisher is responsible for publishing and governing access for the bespoke data in the Amazon DataZone business data catalog. The data scientist discovers and subscribes to data and ML resources, accesses the data from SageMaker Canvas, prepares the data, performs feature engineering, builds an ML model, and exports the model back to the Amazon DataZone catalog. In this post, we use a banking dataset that has data related to direct marketing campaigns for a banking institution. This dataset contains continuous, integer, and categorical variables that are used to predict whether the client will subscribe to a term deposit. The following diagram illustrates the workflow.
Prerequisites
Before you can start using the SageMaker and Amazon DataZone integration, you must have the following:
- An AWS account with appropriate permissions to create and manage resources in SageMaker and Amazon DataZone.
- An Amazon DataZone domain and an associated Amazon DataZone project configured in your AWS account.
- Familiarity with SageMaker and its components, such as Amazon SageMaker Studio, SageMaker Canvas, and SageMaker notebooks.
- The sample dataset
- Upload the dataset to Amazon S3 and crawl the data to create an AWS Glue database and tables. For instructions to catalog the data, refer to Populating the AWS Glue Data Catalog.
Data admin steps on Amazon DataZone
As a data administrator, you need to set up the necessary Amazon DataZone resources to enable the integration with SageMaker. Follow the steps outlined in Amazon DataZone quickstart with AWS Glue data or refer to the following video to set up an Amazon DataZone domain, enable SageMaker and data lake blueprints, create Amazon DataZone projects (for publishing data assets and to subscribe data assets from the data catalog), and provision default SageMaker and default data lake environments in the respective projects. The data lake environment is required to configure an AWS Glue database table, which is used to publish an asset in the Amazon DataZone catalog. The following video demonstrates how to configure the data source (from an AWS Glue database) and publish the dataset in the Amazon DataZone catalog.
Prior to initiating the data scientist workflow, the following prerequisites are required to be in place for the DataZone project:
- An Amazon DataZone project named Banking-Consumer-ML, which is used in the data scientist workflow.
- A SageMaker environment profile with the default SageMaker blueprint.
- A SageMaker environment based on the SageMaker environment profile, which allows the data scientist to launch SageMaker Studio from the Amazon DataZone project console.
- A data asset named Bank that contains the customer data from a banking institution that captures the demographic, financial, and marketing campaign data for the bank’s customers. The data asset is already published in the Amazon DataZone data catalog and can be searched from any project created under the Amazon DataZone domain.
Data scientist workflow
In this section, we demonstrate how a data scientist subscribes to an existing data asset from the SageMaker Studio asset catalog, imports the dataset to SageMaker Canvas, builds an ML model, and publishes the model back to the Amazon DataZone data catalog, which can be reused across the projects in the domain. As the data scientist, complete the following steps:
- In the Environments section of the Banking-Consumer-ML project, choose SageMaker Studio.
- Choose Assets in the navigation pane.
- On the Asset catalog tab, search for and choose the data asset Bank.
You can view the metadata and schema of the banking dataset to understand the data attributes and columns.
- To raise a request to subscribe to the dataset, choose Subscribe.
- Enter a reason for the request and choose Submit.
After the data scientist raises a subscription request, a subscription request is created and a notification is sent for approval from the asset publishing project.
The data publisher for the asset publishing project views the subscription request by navigating to the data owning project console and choosing Incoming requests under Published data in the navigation pane. The data publisher chooses View request to view the request and, based on the organization’s data access policy, approves the incoming subscription request.
The data publisher can view the subscription status for the asset and is also able to revoke and remove subscription access anytime from the data publishing project console.
The data publisher can also view and approve the request under Manage asset requests on the SageMaker Studio Assets page.
On the Assets page, the Bank dataset that the data scientist subscribed to is now visible.
- Under Applications in the navigation pane, choose Canvas, then choose Open Canvas to launch SageMaker Canvas from SageMaker Studio.
- Choose Data Wrangler in the navigation pane.
- On the Import and prepare dropdown menu, choose Tabular.
SageMaker Data Wrangler simplifies the process of data preparation and feature engineering, and enables the completion of each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface.
- For Select a data source, choose Athena.
Athena is a serverless, interactive analytics service that provides a simplified and flexible way to analyze petabytes of data where it lives. Because the data source for the banking dataset is a database created in the AWS Glue Data Catalog using an AWS Glue crawler, the data is queried using Athena in SageMaker Data Wrangler. With this step, the data scientist can import the data into the Data Wrangler tool to perform feature engineering and prepare the data for ML modeling.
- Expand bankmarketing and drag and drop the bank dataset into the canvas.
SageMaker Canvas loads the selected dataset in the Import preview section. The banking dataset contains information about bank clients such as age, job, marital status, education, credit default status, and details about the marketing campaign contacts like communication type, duration, number of contacts, and outcome of the previous campaign.
- Choose Import to import the dataset into SageMaker Data Wrangler.
A new data flow is created on the Data Wrangler console.
- Choose Get data insights to identify potential data quality issues and get recommendations.
- In the Create analysis pane, provide the following information:
- For Analysis type, choose Data Quality And Insights Report.
- For Analysis name, enter a name.
- For Problem type, select Classification.
- For Target column, enter y.
- For Data size, select Sampled dataset (20k).
- Choose Create.
You can review the generated Data Quality and Insights Report to gain a deeper understanding of the data, including statistics, duplicates, anomalies, missing values, outliers, target leakage, data imbalance, and more. If you’re satisfied with the data based on the generated report, you can continue with the data scientist workflow. Refer to Accelerate data preparation for ML in Amazon SageMaker Canvas for a deeper understanding of the process to prepare data for end-to-end model building.
- On the options menu (three dots), choose Create model to create a dataset.
- Enter a name for the dataset (for example, Banking-Customer-DataSet), then choose Export.
After the dataset is exported, a confirmation message is displayed on the console.
- Choose Create model to continue.
The exported dataset is also visible on the Datasets page on the SageMaker Canvas console. Here, you can alternatively select the dataset and choose Create a model to continue.
- In the Create new model section, provide the following information:
- For Model name, enter a name for the model (for example, Banking-Customer-Prediction-Model).
- For Problem type, select Predictive analysis.
- Choose Create.
The objective of the model is to predict whether a customer is likely to subscribe for the bank’s term deposit (variable y).
- On the Build tab, for Target column, choose the column that the model intends to predict.
- Choose Preview model.
The Preview model option runs a quick build of the binary classification model for a subset of data for 10–15 minutes to preview the outcome before running the full build, which typically takes around 4 hours or longer. Optionally, you can choose the Configure model option to customize the ML model.
With the Configure model option, you can customize the model type, objective metric, training method, and training/testing data split, and set limits on model creation job runtime.
SageMaker Canvas runs the preview model and displays the outcome that shows the estimated accuracy (%) and a list of dataset features in descending order of importance. You can observe that columns duration, pdays, month, and housing are the dominant features that impact the model’s prediction.
Optionally, you can choose the View all option on the Build tab to get a full list of options to perform feature transformation and data wrangling, such as dropping unimportant columns, dropping duplicate data, replacing missing values, changing data types, and combining columns to create new columns. This allows you to perform feature engineering before building the model.
- Choose Standard build to start the model building process.
You can monitor the progress of model creation.
When the model is complete, the model status is shown along with Overview, Scoring, and Advanced metrics options.
You can review the model status and test the model on the Predict tab. With the prediction option, you can perform either a batch or single prediction and test the model.
- On the options menu (three dots), choose Add to Model Registry to register the model using Amazon SageMaker Model Registry.
- Enter a group name (for this post, canvas-Banking-Customer-Prediction-Model) and choose Add.
Subsequent builds of the ML model are versioned and are stored under the same group name in the SageMaker Studio model registry.
- On the SageMaker Studio console, choose Models in the navigation to view the model you just added to the model registry.
- On the Model Groups tab, select the published model version and on the options menu (three dots), choose Update model status.
- For Status, choose Approved, then choose Save and update.
- Select the approved model and on the options menu (three dots), choose Publish to asset catalog.
- After the status is updated, choose View asset to view the published asset.
Alternatively, choose Assets in the navigation pane and on the Asset catalog tab, view the published model by searching the catalog or filtering by the asset type.
The published ML model is also accessible from the Amazon DataZone data portal. Navigate to the Banking-Consumer-ML project and choose Published data to view the details of the ML model published from SageMaker Canvas.
The published model can also be subscribed to from other projects from the Amazon DataZone domain.
Clean up
We recommend deleting any potentially unused resources to avoid incurring unexpected costs. For example, you can delete the Amazon DataZone domain and log out of SageMaker Canvas to automatically delete the workspace instance.
Conclusion
In this post, we covered an end-to-end integration of SageMaker Canvas and Amazon DataZone, including infrastructure controls, sharing and consuming data assets, and creating and publishing ML models. This integration provides a powerful solution for data governance, collaboration, and reusability across ML projects. With Amazon DataZone, data administrators can publish and govern access to data assets, and data scientists can discover, subscribe to, and consume those datasets within SageMaker Canvas. This streamlined workflow enables efficient collaboration between data providers and consumers. Moreover, the ability to publish trained ML models back to the Amazon DataZone catalog promotes reusability, allowing models to be discovered and subscribed to by other teams or projects within the organization. This approach reduces duplication of effort and fosters knowledge sharing across the ML lifecycle.
You can extend this solution to generative artificial intelligence (AI) use cases as well. For example, large language models (LLMs) or other FMs trained on curated datasets can be published and shared through Amazon DataZone, enabling different teams to fine-tune or adapt these models for their specific applications while adhering to robust governance policies. This empowers organizations to unlock the full potential of ML and generative AI while maintaining control and oversight over their data assets.
Try out the new Amazon DataZone integration with SageMaker Canvas today to search and discover the published datasets from an Amazon DataZone project, subscribe to and consume data from SageMaker Canvas, perform feature engineering, build an ML model, and then publish the model back to the Amazon DataZone project.
About the authors
Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He supports enterprise customers migrate and modernize their workloads on AWS cloud. He is a Cloud Architect with 24+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Machine Learning & Data Analytics with focus on Data and Feature Engineering domain. He is an aspiring marathon runner and his hobbies include hiking, bike riding and spending time with his wife and two boys.
Ajjay Govindaram is a Senior Solutions Architect at AWS. He works with strategic customers who are using AI/ML to solve complex business problems. His experience lies in providing technical direction as well as design assistance for modest to large-scale AI/ML application deployments. His knowledge ranges from application architecture to big data, analytics, and machine learning. He enjoys listening to music while resting, experiencing the outdoors, and spending time with his loved ones.
Siamak Nariman is a Senior Product Manager at AWS. He is focused on AI/ML technology, ML model management, and ML governance to improve overall organizational efficiency and productivity. He has extensive experience automating processes and deploying various technologies.
Huong Nguyen is a Sr. Product Manager at AWS. She is leading the ML data preparation for SageMaker Canvas and SageMaker Data Wrangler, with 15 years of experience building customer-centric and data-driven products.
Source link
lol