Generating fashion product descriptions by fine-tuning a vision-language model with SageMaker and Amazon Bedrock

In the world of online retail, creating high-quality product descriptions for millions of products is a crucial, but time-consuming task. Using machine learning (ML) and natural language processing (NLP) to automate product description generation has the potential to save manual effort and transform the way ecommerce platforms operate. One of the main advantages of high-quality product descriptions is the improvement in searchability. Customers can more easily locate products that have correct descriptions, because it allows the search engine to identify products that match not just the general category but also the specific attributes mentioned in the product description. For example, a product that has a description that includes words such as “long sleeve” and “cotton neck” will be returned if a consumer is looking for a “long sleeve cotton shirt.” Furthermore, having factoid product descriptions can increase customer satisfaction by enabling a more personalized buying experience and improving the algorithms for recommending more relevant products to users, which raise the probability that users will make a purchase.

With the advancement of Generative AI, we can use vision-language models (VLMs) to predict product attributes directly from images. Pre-trained image captioning or visual question answering (VQA) models perform well on describing every-day images but can’t to capture the domain-specific nuances of ecommerce products needed to achieve satisfactory performance in all product categories. To solve this problem, this post shows you how to predict domain-specific product attributes from product images by fine-tuning a VLM on a fashion dataset using Amazon SageMaker, and then using Amazon Bedrock to generate product descriptions using the predicted attributes as input. So you can follow along, we’re sharing the code in a GitHub repository.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

You can use a managed service, such as Amazon Rekognition, to predict product attributes as explained in Automating product description generation with Amazon Bedrock. However, if you’re trying to extract specifics and detailed characteristics of your product or your domain (industry), fine-tuning a VLM on Amazon SageMaker is necessary.

Vision-language models

Since 2021, there has been a rise in interest in vision-language models (VLMs), which led to the release of solutions such as Contrastive Language-Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP). When it comes to tasks such as image captioning, text-guided image generation, and visual question-answering, VLMs have demonstrated state-of-the art performance.

In this post, we use BLIP-2, which was introduced in BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, as our VLM. BLIP-2 consists of three models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model (LLM). We use a version of BLIP-2, that contains Flan-T5-XL as the LLM.

The following diagram illustrates the overview of BLIP-2:

Figure 1: BLIP-2 overview

The pre-trained version of the BLIP-2 model has been demonstrated in Build an image-to-text generative AI application using multimodality models on Amazon SageMaker and Build a generative AI-based content moderation solution on Amazon SageMaker JumpStart. In this post, we demonstrate how to fine-tune BLIP-2 for a domain-specific use case.

Solution overview

The following diagram illustrates the solution architecture.

Figure 2: High-level solution architecture

The high-level overview of the solution is:

An ML scientist uses Sagemaker notebooks to process and split the data into training and validation data.
The datasets are uploaded to Amazon Simple Storage Service (Amazon S3) using the S3 client (a wrapper around an HTTP call).
Then the Sagemaker client is used to launch a Sagemaker Training job, again a wrapper for an HTTP call.
The training job manages the copying of the datasets from S3 to the training container, the training of the model, and the saving of its artifacts to S3.
Then, through another call of the Sagemaker client, an endpoint is generated, copying the model artifacts into the endpoint hosting container.
The inference workflow is then invoked through an AWS Lambda request, which first makes an HTTP request to the Sagemaker endpoint, and then uses that to make another request to Amazon Bedrock.

In the following sections, we demonstrate how to:

Set up the development environment
Load and prepare the dataset
Fine-tune the BLIP-2 model to learn product attributes using SageMaker
Deploy the fine-tuned BLIP-2 model and predict product attributes using SageMaker
Generate product descriptions from predicted product attributes using Amazon Bedrock

Set up the development environment

An AWS account is needed with an AWS Identity and Access Management (IAM) role that has permissions to manage resources created as part of the solution. For details, see Creating an AWS account.

We use Amazon SageMaker Studio with the ml.t3.medium instance and the Data Science 3.0 image. However, you can also use an Amazon SageMaker notebook instance or any integrated development environment (IDE) of your choice.

Note: Be sure to set up your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, see Configure the AWS CLI.

An ml.g5.2xlarge instance is used for SageMaker Training jobs, and an ml.g5.2xlarge instance is used for SageMaker endpoints. Ensure sufficient capacity for this instance in your AWS account by requesting a quota increase if required. Also check the pricing of the on-demand instances.

You need to clone this GitHub repository for replicating the solution demonstrated in this post. First, launch the notebook main.ipynb in SageMaker Studio by selecting the Image as Data Science and Kernel as Python 3. Install all the required libraries mentioned in the requirements.txt.

Load and prepare the dataset

For this post, we use the Kaggle Fashion Images Dataset, which contain 44,000 products with multiple category labels, descriptions, and high resolution images. In this post we want to demonstrate how to fine-tune a model to learn attributes such as fabric, fit, collar, pattern, and sleeve length of a shirt using the image and a question as inputs.

Each product is identified by an ID such as 38642, and there is a map to all the products in styles.csv. From here, we can fetch the image for this product from images/38642.jpg and the complete metadata from styles/38642.json. To fine-tune our model, we need to convert our structured examples into a collection of question and answer pairs. Our final dataset has the following format after processing for each attribute:

Id | Question | Answer
38642 | What is the fabric of the clothing in this picture? | Fabric: Cotton

After we process the dataset, we split it into training and validation sets, create CSV files, and upload the dataset to Amazon S3.

Fine-tune the BLIP-2 model to learn product attributes using SageMaker

To launch a SageMaker Training job, we need the HuggingFace Estimator. SageMaker starts and manages all of the necessary Amazon Elastic Compute Cloud (Amazon EC2) instances for us, supplies the appropriate Hugging Face container, uploads the specified scripts, and downloads data from our S3 bucket to the container to /opt/ml/input/data.

We fine-tune BLIP-2 using the Low-Rank Adaptation (LoRA) technique, which adds trainable rank decomposition matrices to every Transformer structure layer while keeping the pre-trained model weights in a static state. This technique can increase training throughput and reduce the amount of GPU RAM required by 3 times and the number of trainable parameters by 10,000 times. Despite using fewer trainable parameters, LoRA has been demonstrated to perform as well as or better than the full fine-tuning technique.

We prepared entrypoint_vqa_finetuning.py which implements fine-tuning of BLIP-2 with the LoRA technique using Hugging Face Transformers, Accelerate, and Parameter-Efficient Fine-Tuning (PEFT). The script also merges the LoRA weights into the model weights after training. As a result, you can deploy the model as a normal model without any additional code.

from peft import LoraConfig, get_peft_model
from transformers import Blip2ForConditionalGeneration
 
model = Blip2ForConditionalGeneration.from_pretrained(
        "Salesforce/blip2-flan-t5-xl",
        device_map="auto",
        cache_dir="/tmp",
        load_in_8bit=True,
    )

config = LoraConfig(
    r=8, # Lora attention dimension.
    lora_alpha=32, # the alpha parameter for Lora scaling.
    lora_dropout=0.05, # the dropout probability for Lora layers.
    bias="none", # the bias type for Lora.
    target_modules=["q", "v"],
)

model = get_peft_model(model, config)

We reference entrypoint_vqa_finetuning.py as the entry_point in the Hugging Face Estimator.

from sagemaker.huggingface import HuggingFace

hyperparameters = {
    'epochs': 10,
    'file-name': "vqa_train.csv",
}

estimator = HuggingFace(
    entry_point="entrypoint_vqa_finetuning.py",
    source_dir="../src",
    role=role,
    instance_count=1,
    instance_type="ml.g5.2xlarge", 
    transformers_version='4.26',
    pytorch_version='1.13',
    py_version='py39',
    hyperparameters = hyperparameters,
    base_job_name="VQA",
    sagemaker_session=sagemaker_session,
    output_path=f"{output_path}/models",
    code_location=f"{output_path}/code",
    volume_size=60,
    metric_definitions=[
        {'Name': 'batch_loss', 'Regex': 'Loss: ([0-9\.]+)'},
        {'Name': 'epoch_loss', 'Regex': 'Epoch Loss: ([0-9\.]+)'}
    ],
)

We can start our training job by running with the .fit() method and passing our Amazon S3 path for images and our input file.

estimator.fit({"images": images_input, "input_file": input_file})

Deploy the fine-tuned BLIP-2 model and predict product attributes using SageMaker

We deploy the fine-tuned BLIP-2 model to the SageMaker real time endpoint using the HuggingFace Inference Container. You can also use the large model inference (LMI) container, which is described in more detail in Build a generative AI-based content moderation solution on Amazon SageMaker JumpStart, which deploys a pre-trained BLIP-2 model. Here, we reference our fine-tuned model in Amazon S3 instead of the pre-trained model available in the Hugging Face hub. We first create the model and deploy the endpoint.

from sagemaker.huggingface import HuggingFaceModel

model = HuggingFaceModel(
   model_data=estimator.model_data,
   role=role,
   transformers_version="4.28",
   pytorch_version="2.0",
   py_version="py310",
   model_server_workers=1,
   sagemaker_session=sagemaker_session
)

endpoint_name = "endpoint-finetuned-blip2"
model.deploy(initial_instance_count=1, instance_type="ml.g5.2xlarge", endpoint_name=endpoint_name )

When the endpoint status becomes in service, we can invoke the endpoint for the instructed vision-to-language generation task with an input image and a question as a prompt:

inputs = {
    "prompt": "What is the sleeve length of the shirt in this picture?",
    "image": image # image encoded in Base64
}

The output response looks like the following:

{"Sleeve Length": "Long Sleeves"}

Generate product descriptions from predicted product attributes using Amazon Bedrock

To get started with Amazon Bedrock, request access to the foundational models (they are not enabled by default). You can follow the steps in the documentation to enable model access. In this post, we use Anthropic’s Claude in Amazon Bedrock to generate product descriptions. Specifically, we use the model anthropic.claude-3-sonnet-20240229-v1 because it provides good performance and speed.

After creating the boto3 client for Amazon Bedrock, we create a prompt string that specifies that we want to generate product descriptions using the product attributes.

You are an expert in writing product descriptions for shirts. Use the data below to create product description for a website. The product description should contain all given attributes.
Provide some inspirational sentences, for example, how the fabric moves. Think about what a potential customer wants to know about the shirts. Here are the facts you need to create the product descriptions:
[Here we insert the predicted attributes by the BLIP-2 model]

The prompt and model parameters, including maximum number of tokens used in the response and the temperature, are passed to the body. The JSON response must be parsed before the resulting text is printed in the final line.

bedrock = boto3.client(service_name="bedrock-runtime", region_name="us-west-2")

model_id = "anthropic.claude-3-sonnet-20240229-v1"

body = json.dumps(
    {"system": prompt, "messages": attributes_content, "max_tokens": 400, "temperature": 0.1, "anthropic_version": "bedrock-2023-05-31"}
)

response = bedrock.invoke_model(
    body=body,
    modelId=model_id,
    accept="application/json",
    contentType="application/json"
)

The generated product description response looks like the following:

"Classic Striped Shirt Relax into comfortable casual style with this classic collared striped shirt. With a regular fit that is neither too slim nor too loose, this versatile top layers perfectly under sweaters or jackets."

Conclusion

We’ve shown you how the combination of VLMs on SageMaker and LLMs on Amazon Bedrock present a powerful solution for automating fashion product description generation. By fine-tuning the BLIP-2 model on a fashion dataset using Amazon SageMaker, you can predict domain-specific and nuanced product attributes directly from images. Then, using the capabilities of Amazon Bedrock, you can generate product descriptions from the predicted product attributes, enhancing the searchability and personalization of ecommerce platforms. As we continue to explore the potential of generative AI, LLMs and VLMs emerge as a promising avenue for revolutionizing content generation in the ever-evolving landscape of online retail. As a next step, you can try fine-tuning this model on your own dataset using the code provided in the GitHub repository to test and benchmark the results for your use cases.

About the Authors

Antonia Wiebeler is a Data Scientist at the AWS Generative AI Innovation Center, where she enjoys building proofs of concept for customers. Her passion is exploring how generative AI can solve real-world problems and create value for customers. While she is not coding, she enjoys running and competing in triathlons.

Daniel Zagyva is a Data Scientist at AWS Professional Services. He specializes in developing scalable, production-grade machine learning solutions for AWS customers. His experience extends across different areas, including natural language processing, generative AI, and machine learning operations.

Lun Yeh is a Machine Learning Engineer at AWS Professional Services. She specializes in NLP, forecasting, MLOps, and generative AI and helps customers adopt machine learning in their businesses. She graduated from TU Delft with a degree in Data Science & Technology.

Fotinos Kyriakides is an AI/ML Consultant at AWS Professional Services specializing in developing production-ready ML solutions and platforms for AWS customers. In his free time Fotinos enjoys running and exploring.