Scraping Data with DevTools and HAR Files

Scraping Data with DevTools and HAR Files


Data scraping is a game-changer for anyone looking to extract meaningful information from websites. With tools like Chrome DevTools and HAR files, you can uncover hidden APIs and capture valuable data streams effortlessly. In this post, I’ll share how I used these tools to scrape product data from Blinkit, a grocery delivery platform, and show you how you can do it too.



Why I Chose Data Scraping for My Grocery App

While building a grocery delivery app, I faced a major challenge—lack of real data. Creating my own dataset from scratch would have been extremely time-consuming and offered no real advantage to the project. I needed a quicker, more practical solution, which led me to the idea of scraping data. By extracting product details from Blinkit, I could get accurate, real-world data to test and refine my app without wasting resources.



Common methods to scrape data on the web

  1. Manual Copy-Pasting

    • Simple but tedious. Suitable for extracting small amounts of data.
  2. Web Scraping Tools

    • Tools like Scrapy, BeautifulSoup, or Puppeteer automate the process of extracting data from websites.
    • Best for structured data extraction on a larger scale.
  3. API Integration

    • Some websites offer public APIs for accessing their data directly and legally.
    • Requires knowledge of API endpoints and authentication processes.
  4. Browser DevTools

    • Inspect network requests, capture HAR files, or analyze page elements directly in the browser.
    • Great for identifying hidden APIs or JSON data.
  5. Headless Browsers

    • Use headless browser libraries like Puppeteer or Selenium to automate navigation and scraping.
    • Ideal for sites requiring JavaScript rendering or interaction.
  6. Parsing HAR Files

    • HAR files capture all network activity for a webpage. They can be parsed to extract APIs, JSON responses, or other data.
    • Useful for sites with dynamic content or hidden data.
  7. HTML Parsing

    • Extract data by parsing HTML content using libraries like BeautifulSoup (Python) or Cheerio (Node.js).
    • Effective for simple, static websites.
  8. Data Extraction from PDFs or Images

    • Tools like PyPDF2, Tesseract (OCR), or Adobe APIs help extract text from files when data isn’t available online.
  9. Automated Scripts

    • Custom scripts written in Python, Node.js, or similar languages to scrape, parse, and store data.
    • Offers complete control over the scraping process.
  10. Third-Party APIs

    • Use services like DataMiner, Octoparse, or Scrapy Cloud to handle scraping tasks for you.
    • Saves time but may have limitations based on service plans.



I Chose HAR File Parsing



What is a HAR File?

A HAR (HTTP Archive) file is a JSON-formatted archive file that records the network activity of a web page. It contains detailed information about every HTTP request and response, including headers, query parameters, payloads, and timings. HAR files are often used for debugging, performance analysis, and, in this case, data scraping.



Structure of a HAR File

A HAR file consists of several sections, with the primary ones being:

Structure of HAR File

  1. Log

    • The root object of a HAR file, containing metadata about the recorded session and the captured entries.
  2. Entries

    • An array of objects where each entry represents an individual HTTP request and its corresponding response.

Key properties include:

  • request: Details about the request, such as URL, headers, method, and query parameters.
  • response: Information about the response, including status code, headers, and content.
  • timings: The breakdown of the time spent during the request-response cycle (e.g., DNS, connect, wait, receive).
  1. Pages

    • Contains data about the web pages loaded during the session, such as the page title, load time, and the timestamp of when the page was opened.
  2. Creator

    • Metadata about the tool or browser used to generate the HAR file, including its name and version.



Why I Chose HAR File Parsing

HAR files provide a comprehensive snapshot of all network activity on a webpage. This makes them perfect for identifying hidden APIs, capturing JSON payloads, and extracting the exact data required for scraping. The structured JSON format also simplifies the parsing process using tools like Python or JavaScript libraries.



The Plan: Scraping Data Using HAR File Parsing

The Plan

To extract product data from Blinkit efficiently, I followed a structured plan:

  1. Browsing and Capturing Network Activity

    • Opened Blinkit’s site and launched Chrome DevTools.
    • Browsed various product pages to capture all necessary API calls in the Network tab.

Inspect Elements and Blinkit opened

  1. Exporting the HAR File

    • Saved the recorded network activity as a HAR file for offline analysis.
  2. Parsing the HAR File

    • Used Python to parse the HAR file and extract relevant data.
    • Created three key functions to streamline the process:
  • Function 1: Filter Relevant Responses

    • Extracted all responses matching the endpoint /listing?catId=* to get product-related data.

Clean Data function

  • Function 2: Clean and Extract Data

    • Processed the filtered responses to extract key fields like id, name, category, and more.

Filter data function

  • Function 3: Save Images Locally

    • Identified all product image URLs in the data and downloaded them to local files for reference.

Get Images Function

  1. Execution and Results

    • The entire process, including some trial and error, took around 30–40 minutes.
    • Successfully scraped data for approximately 600 products, including names, categories, and images.

The data extracted

This approach allowed me to gather the necessary data for my grocery delivery app quickly and efficiently.



Conclusion

Data scraping, when done efficiently, can save a lot of time and effort, especially when you need real-world data to test or build an application. By leveraging Chrome DevTools and HAR files, I was able to quickly extract valuable product data from Blinkit without manually creating a dataset. The process, while requiring some trial and error, was straightforward and provided a practical solution to a common problem faced by developers. With this method, I was able to gather 600 product details in under an hour, helping me move forward with my grocery delivery app project.

Data scraping, however, should always be approached ethically and responsibly. Always ensure you comply with a website’s terms of service and legal guidelines before scraping. If done right, scraping can be a powerful tool for collecting data and improving your projects.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.