Step-by-Step Guide: Write Your First Web Crawling with Selenium in C#

Introduction

In this exercise, we’ll try to get the daily hot topics from “Google Trends”. We’ll walk through the WebCrawlingHelper class, which simplifies the process of navigating web pages, finding elements, and extracting data.

Prerequisites

Before we begin, ensure you have the following:

Visual Studio or any C# IDE.
.NET SDK installed.
Selenium WebDriver and ChromeDriver packages installed. You can add them via NuGet Package Manager:

Install-Package Selenium.WebDriver
Install-Package Selenium.WebDriver.ChromeDriver

Step 1: Setting Up the WebCrawlingHelper Class

Create the Helper Class

Create a new C# class file named WebCrawlingHelper.cs and copy the following code into it:

using System.Collections.Generic;
using System.Linq;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;

namespace PRA.Util
{
    public class WebCrawlingHelper
    {
        private IWebDriver driver;
        private List<WebElementModel> elements;

        public WebCrawlingHelper()
        {
            var options = new ChromeOptions()
            {
                BinaryLocation = "C:\Program Files\Google\Chrome\Application\chrome.exe"
            };

            options.AddArguments(new List<string>() {
                "headless",
                "disable-gpu"
            });

            driver = new ChromeDriver(options);
        }

        // Navigate to a URL
        public void GoToUrl(string _url)
        {
            this.driver.Navigate().GoToUrl(_url);
            this.elements = new List<WebElementModel>();
        }

        // Add a single web element
        public void AddElement(By _by, string _elementName)
        {
            WebElementModel tempElement = new WebElementModel();
            tempElement.ElementName = _elementName;
            tempElement.Element = this.driver.FindElement(_by);
            this.elements.Add(tempElement);
        }

        // Add multiple web elements
        public void AddElements(By _by, string _elementName)
        {
            var tempElements = this.driver.FindElements(_by);
            int i = 0;

            foreach (IWebElement element in tempElements)
            {
                WebElementModel tempElement = new WebElementModel();
                tempElement.ElementName = _elementName + "_" + i.ToString();
                tempElement.Element = element;
                this.elements.Add(tempElement);
                i++;
            }
        }

        // Count elements matching a locator
        public int CountNoOfElements(By _by)
        {
            return this.driver.FindElements(_by).Count;
        }

        // Send keys to an input element
        public void ElementSendKeys(string _key, string _elementName)
        {
            IWebElement input = this.elements.FirstOrDefault(x => x.ElementName == _elementName)?.Element;
            input?.SendKeys(_key);
        }

        // Get all stored elements
        public List<WebElementModel> GetElements()
        {
            return this.elements;
        }

        // Click on a web element
        public void Click(string _elementName)
        {
            IWebElement button = this.elements.FirstOrDefault(x => x.ElementName == _elementName)?.Element;
            button?.Click();
            System.Threading.Thread.Sleep(5000); // Wait for page load
        }

        // Quit the driver
        public void Quit()
        {
            this.driver.Quit();
        }
    }

    public class WebElementModel
    {
        public string ElementName { get; set; }
        public IWebElement Element { get; set; }
    }
}

Explanation of the Code

Driver Initialization: The ChromeDriver is initialized with options to run in headless mode (without a UI).
Element Management: Methods are provided to add single/multiple elements, count elements, send keys, and click elements.
Data Storage: Elements are stored in a list for easy access later.

Step 2: Using the WebCrawlingHelper

Create a Main Method

Now, let’s create a method to demonstrate how to use the WebCrawlingHelper. Add the following method to your program:

public static void WebCrawling_POC()
{
    WebCrawlingHelper crawler = new WebCrawlingHelper();

    // Navigate to Google Trends
    crawler.GoToUrl("https://trends.google.com/trending?geo=US&hl=zh-HK&hours=24");

    // Count the number of rows in the trends table
    int noOfRows = crawler.CountNoOfElements(By.XPath("//*[@id='trend-table']/div[1]/table/tbody[2]/tr"));

    // Store all related elements
    for (int i = 0; i < noOfRows; i++)
    {
        string iplus1 = (i + 1).ToString();
        string istring = i.ToString();

        crawler.AddElements(By.XPath($"//*[@id='trend-table']/div[1]/table/tbody[2]/tr[{iplus1}]/td[2]/div[1]"), "topic_" + istring);
        crawler.AddElements(By.XPath($"//*[@id='trend-table']/div[1]/table/tbody[2]/tr[{iplus1}]/td[3]/div/div[1]"), "rate_" + istring);
    }

    // Print all related elements
    for (int j = 0; j < noOfRows; j++)
    {
        var topic = crawler.GetElements().FirstOrDefault(x => x.ElementName == $"topic_{j}");
        var rate = crawler.GetElements().FirstOrDefault(x => x.ElementName == $"rate_{j}_0");

        if (topic != null && rate != null)
        {
            Console.WriteLine($"Topic: {topic.Element.GetAttribute("innerText")}, Rate: {rate.Element.GetAttribute("innerText")}");
        }
    }

    // Clean up
    crawler.Quit();
}

Explanation of the Usage Code

Navigate to URL: The GoToUrl method is called to navigate to Google Trends.
Count Elements: The number of rows in the trends table is counted.
Store Elements: Topics and rates are stored using the AddElements method.
Print Data: The results are printed to the console.
Quit Driver: Finally, the Quit method is called to close the browser.

Step 3: Running Your Code

To run your code:

Ensure all necessary packages are installed.
Call the WebCrawling_POC method from your Main method.
Compile and run your application.

Result:

Conclusion

You have successfully created a web crawling utility using Selenium in C#. You can enhance this utility further by adding more features such as error handling, logging, and scraping additional data.

Feel free to experiment with different websites and expand the functionality as needed! Happy coding!

Remarks

You could also get elements by:

By.Id(_id)
By.TagName(_tagName)
By.CssSelector(_css)

And you could use any browser inspect tools to find the XPath (Copy XPath) you want!

Demo:

Source link
lol