Detecting and Analyzing Comment Quality Using Vector Search

There is a good chance you encounter vector search regularly, even if you are not building applications with it. Discovering content recommendations based on previous liked content is a common use case of vector embeddings, and one that many of us utilize as consumers of media. Yet, you may not realize vector search can do a lot more than tell us what new movie to watch on a Saturday evening. Before you scroll down to the comments section on your favorite blog post, find out how vector search can help you decide if it’s worth your time.

Everyday applications of vector search

First, let’s take just a moment to remind ourselves what we are talking about when we talk about vector search. A vector embedding is a list of numbers that capture the semantic and contextual meaning of a given set of data, whether that be text, video, images or audio. This is possible using embedding models that are trained on human language and vast amounts of information. When we talk about vector search, we are talking about searching those vector embeddings to arrive at relevant results and other use cases.

Great, now that we have a shared understanding of what we are discussing, let’s dive into the actual topic at hand, because vector embeddings can do so much more for us than just show us the next great baking show we’re going to want to watch.

Vector search can with great certainty give us a glimpse into the quality of content before we even begin reading it.

Practical Example: Analyzing Dev.to Blog Comments

We all have experienced comment sections on articles that have run amok. The comments have little to do with each other, are not connected to the article they are supposedly responding to, and are filled with spam posts about get rich quick schemes or other similar scams. However, sometimes, comment sections can be helpful. Comments can often carry a conversation forward past the original article with readers adding their own knowledge and viewpoints. There are many technical blog posts I have read where commenters offered more up-to-date solutions since the original article had been posted and saved me valuable time when researching how to fix a bug or work with a given library.

How do we know when a comment section falls in the first category and should be avoided, or when it falls into the latter category and we should check it out?

Converting the comments section of a blog post into vector embeddings and then scoring the contextual and semantic similarity of the comments against each other can give us a lot of insight into answering that question.

Wondering, though, how you might be able to give that a try? I built a Chrome extension just for you! This extension will give you the opportunity to experience the usefulness of vector search way beyond content recommendations and into quality control.

While the extension works for blog comments, this idea goes even further than quality control of blog posts. How about fraud detection? Revolut, one of Europe’s biggest banks, is doing exactly that for their credit card customers every single day.

Building the Chrome Extension

Want to give it a spin? Here’s a step-by-step guide in running this Chrome extension to see how vector search can determine the overall quality of blog post comments before you ever scroll down the page.

As a note of disclaimer, this extension requires technical know-how to use as it’s not built for mass production. Namely, having some familiarity with working with GitHub and the command line will be very useful. You do not need to write any code. It is all written for you.

The extension comes in two parts: the extension itself and a backend web server that processes the data. You need the server running to get results, so let’s get working on both.

Setting up the Backend Server

Navigate to this GitHub repository in your web browser and clone the repository to your computer. If you have the GitHub CLI you can run the following from your command line:

gh repo clone hummusonrails/comments-spam-analyzer-backend

Once you have the contents, go to the directory in your terminal and rename the sample environment variables file from .env.sample to .env. The file will hold your confidential credentials for both OpenAI and Couchbase, so make sure not to share that file on any public website like GitHub.

Go get your OpenAI API key from the OpenAI portal and add it to the environment file.
Create a new cluster and bucket in Couchbase Capella, the fully managed database as a service platform. Capella offers a free forever account option, which is perfect to use for this extension. Add the cluster and bucket names to your environment file.
Grab your Couchbase Capella connection credentials if you have not created them yet, or create new credentials from the UI. Add the connection credentials to your environment file.
Fetch your Couchbase Capella connection string. Add the connection string to your environment file.

With your environment file defined, install the server’s dependencies by running npm install from the command line, and then start the server by running npm start.

Your backend server is now up and running and ready to begin processing comments data, converting the comments into vector embeddings and providing you with a quality percentage score.

Install and Using the Browser Extension

Last, but certainly not least, let’s get the browser extension up and running.

Like the previous step, first navigate to this GitHub repo and clone the repository. If you have the GitHub CLI you can run the following command from your terminal as well:

gh repo clone hummusonrails/comments-spam-analyzer

From the directory of the project, install the dependencies by running npm install and then npm build to build the extension. Your extension is now ready to be added to your web browser.

Inside Chrome navigate to chrome://extensions and toggle on Developer Mode. Click the Load Unpacked button, which will appear once you have enabled developer mode and select the directory of the extension in your file system.

The extension has been built to work with any blog post on this site, https://dev.to/. All you need to do is open a blog post from the site and then open the extension by clicking on the Extensions tab in your browser menu and choosing the Comment Quality Analyzer extension you just loaded.

The first time you run the extension it will ask you the URL of your backend server. Since you are running it locally, enter http://localhost:3000/ and press submit. Then, you can click Analyze and wait a few seconds for the results to be processed. Once processed you will see a percentage score of the quality of the comments on that blog post as determined by their semantic and contextual similarity. The idea being the more contextually and semantically similar comments are to each other, the more on topic the comments are to the topic at hand.

Real-world Applications and Beyond

This Chrome extension is just the tip of the iceberg of what you can accomplish and build with vector search! Want to learn more and get inspired on how you can build innovative use cases with vector search? Check out these articles for further reading:

Source link
lol