In today’s data-driven world, the ability to perform efficient and powerful searches is crucial for any application. MyScale, a fast and open-source columnar database management system, offers a compelling solution by combining the power of SQL with vector search capabilities . This combination enables semantic search, a method that improves search accuracy by understanding the meaning of the query. Here’s how you can leverage ClickHouse (MyScale) for semantic search, highlighting the advantages and potential use cases.
For comparison, I will demonstrate both approaches: one using Python and another showcasing how to build an enterprise application with Spring and Java.
By the end of this article, we will set up CookGPT, my personal open source Indian chef, by loading thousands of recipes into vectors and performing semantic searches on them.
As always, I highly encourage you to read the complete article. However, if you want to directly look into the code, you can find it here. A demo version deployed on Hugging Face is available here
Why Combine SQL with Vector Search?
Combining SQL with vector search allows you to harness the strengths of both paradigms:
SQL’s Power and Flexibility: SQL is a powerful language for querying structured data, offering robust aggregation, filtering, and joining capabilities.
Vector Search for Semantic Understanding: Vector search enables finding similarities based on the meaning of the data rather than simple keyword matching. This is especially useful for applications involving natural language processing (NLP), image retrieval, and recommendation systems.
By integrating these two approaches, ClickHouse (MyScale) allows you to perform complex queries that can understand and process the semantic context of your data.
Setting Up MyScale for Semantic Search
To demonstrate how to use MyScale for semantic search, let’s walk through an example project setup using Spring Boot. Our goal is to create a table, process and insert records, and create a vector index for efficient querying.
Getting access to MyScale
Free version of MyScale can support 5 million 768-dimensional vectors . You can sign in with your GitHub account and setup is pretty straight forward and has good community support. Once setup you can access your dashboard with this link
MyScale also features a web-based SQL Workspace, eliminating the need for an additional SQL client.
Python Example
Code for this example is here
import torch
from sentence_transformers import SentenceTransformer
# set device to GPU if available
device="cuda" if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer('all-minilm-l6-v2', device=device)
This code snippet initializes a natural language processing model for semantic text similarity using the SentenceTransformer library with the ‘all-minilm-l6-v2’ model from Hugging Face. It first checks for GPU availability using torch, setting the device accordingly (‘cuda’ for GPU or ‘cpu’ for CPU). This setup ensures optimal performance by leveraging GPU acceleration when possible, enhancing the efficiency of semantic similarity computations for text data.
import pandas as pd
dataset = load_dataset("VishalMysore/newIndianCuisine")
data_raw = pd.DataFrame(dataset['train'])
# Display information about the cleaned DataFrame
print(data_raw.info())
This code snippet uses the datasets library to load the ‘newIndianCuisine’ dataset from Hugging Face’s repository . It converts the ‘train’ split of this dataset into a Pandas DataFrame named data_raw. The print(data_raw.info()) statement then displays concise information about the DataFrame, including its structure, column names, data types, and memory usage, facilitating easy exploration and analysis of the dataset.
summary_raw = data_raw['Method'].values.tolist()
method_feature = []
for i in tqdm(range(0, len(summary_raw), 1)):
i_end = min(i+1, len(summary_raw))
# generate embeddings for summary
emb = retriever.encode(summary_raw[i:i_end]).tolist()[0]
method_feature.append(emb)
data_raw['method_feature'] = method_feature
This code snippet focuses on generating embeddings for the ‘Method’ column of the DataFrame data_raw. Initially, it extracts the textual summaries from the ‘Method’ column and converts them into a list (summary_raw). Using a loop that iterates through the list, it employs a retriever model (retriever) to encode each summary into embeddings (emb). These embeddings are then appended to the method_feature list. Finally, the generated embeddings are added back to the DataFrame as a new column named ‘method_feature’. This process allows for the creation of numerical representations (embeddings) of textual data, facilitating tasks such as semantic analysis or similarity searches based on the content of the ‘Method’ descriptions.
client.command("""
CREATE TABLE default.myscale_cookgpt
(
id UInt64,
Recipe String,
"Total Time" String,
Method String,
Category String,
Ingredients String,
method_feature Array(Float32),
CONSTRAINT vector_len CHECK length(method_feature) = 384
)
ORDER BY id
""")
This code snippet executes a SQL command using client.command() to create a table named myscale_cookgpt in the ClickHouse database schema default. The table schema includes columns such as id of type UInt64, Recipe, Total Time, Method, Category, Ingredients, and method_feature as an array of Float32. Additionally, a constraint vector_len ensures that the method_feature array always has a length of 384 elements. The table is ordered by the id column during creation. This schema is designed to store data related to recipes, including their details and a vector representation ( method_feature ) for semantic analysis or similarity searches
client.command("""
ALTER TABLE default.myscale_cookgpt
ADD VECTOR INDEX method_feature_index method_feature
TYPE MSTG
('metric_type=Cosine')
""")
The command adds a vector index named method_feature_index to the column method_feature . The index is configured with a metric type of Cosine, indicating that it will be optimized for cosine similarity searches. Adding this vector index enhances the database’s capability to efficiently perform similarity searches based on the embeddings stored in the method_feature column, thereby optimizing queries that involve semantic analysis or similarity computations in large datasets.
question = 'what recipe is made with Paneer?'
emb_query = retriever.encode(question).tolist()
Java Example
Now lets see the same example in Java and Spring. Connecting to MyScale using Java and Spring is straightforward, leveraging Hikari DataSource for simplicity and efficiency.
HikariConfig config = new HikariConfig();
config.setJdbcUrl(jdbcUrl);
config.setUsername(username);
config.setPassword(password);
// Set the maximum lifetime of a connection in the pool in milliseconds (e.g., 30 minutes)
config.setMaxLifetime(1800000);
// Set the maximum amount of time a connection is allowed to sit idle in the pool (e.g., 10 minutes)
config.setIdleTimeout(600000);
// Set the minimum number of idle connections that HikariCP tries to maintain in the pool
config.setMinimumIdle(2);
// Set the maximum size that the pool is allowed to reach, including both idle and in-use connections
config.setMaximumPoolSize(10);
dataSource = new HikariDataSource(config);
This code snippet configures a HikariCP data source with settings optimized for JDBC connection management: it sets a maximum connection lifetime of 30 minutes, an idle timeout of 10 minutes, maintains a minimum of 2 idle connections, and allows a maximum pool size of 10 connections. HikariConfig initializes these parameters, while HikariDataSource encapsulates the configuration for efficient database connection pooling, enhancing application performance by ensuring reliable and responsive database access.
public static float[] embed(String str) {
EmbeddingModel embeddingModel = new AllMiniLmL6V2EmbeddingModel();
TextSegment segment1 = TextSegment.from(str);
Embedding embedding1 = embeddingModel.embed(segment1).content();
return embedding1.vector();
}
public static Float[] embedAsObject(String str) {
float[] embQuery = embed(str);
Float[] embQueryObj = new Float[embQuery.length];
for (int i = 0; i < embQuery.length; i++) {
embQueryObj[i] = embQuery[i];
}
return embQueryObj;
}
This Java code defines two methods for generating embeddings using the AllMiniLmL6V2EmbeddingModel. The embed method takes a string input, creates a TextSegment from it, and generates an embedding vector as a float[]. The embedAsObject method converts this float[] into a Float[] to facilitate its use in contexts where an array of object type is required.
String insertSQL = """
INSERT INTO default.myscale_cookgpt
(id, Recipe, "Total Time", Method, Category, Ingredients, method_feature)
VALUES (?, ?, ?, ?, ?, ?, ?)
"""; pstmt.setArray(7, connection.createArrayOf("Float32", convertToFloatObjectArray(methodFeature)));
Prepares an SQL INSERT statement to add records into the myscale_cookgpt table in the ClickHouse database. The SQL statement inserts values for the columns id, Recipe, Total Time, Method, Category, Ingredients, and method_feature. The method feature is converted to a Float[] array using a helper method (convertToFloatObjectArray) which contains the embedding array prepared from the method string. This conversion ensures compatibility with the ClickHouse database. The PreparedStatement is then used to set the array for the method_feature column, ensuring that each feature vector is correctly inserted into the database.
Connection connection = myScaleConnection.getConnection();
String query = "SELECT Recipe, Method, distance(method_feature, ?) as dist " +
"FROM default.myscale_cookgpt " +
"ORDER BY dist LIMIT ?";
int topK = 2;
try (PreparedStatement pstmt = connection.prepareStatement(query)) {
Float[] embQuery = CreateEmbedding.embedAsObject(queryStr);
Array array = connection.createArrayOf("Float32", embQuery);
pstmt.setArray(1, array);
pstmt.setInt(2, topK);
try (ResultSet rs = pstmt.executeQuery()) {
List<String[]> summaries = new ArrayList<>();
while (rs.next()) {
String recipe = rs.getString("Recipe");
String method = rs.getString("Method");
summaries.add(new String[]{recipe, method});
}
// Print the summaries
for (String[] summary : summaries) {
log.info("Recipe: " + summary[0] + ", Method: " + summary[1]);
}
return summaries;
}
} catch (SQLException e) {
log.severe(e.getMessage());
throw new RuntimeException(e);
}
Connects to a MyScale database and performs a semantic search using cosine similarity. It retrieves the top-k results based on the distance between a given query’s embedding vector and the stored method_feature vectors in the myscale_cookgpt table. The PreparedStatement sets the query embedding as a Float32 array and specifies the number of top results to return. The SQL query calculates the cosine similarity (distance(method_feature, ?) as dist) and orders the results by this distance. The results are then retrieved and printed, showing the most semantically similar recipes and their methods.
public List queryResult(String queryStr) {
Connection connection = myScaleConnection.getConnection();
String query = "SELECT Recipe, Method, distance(method_feature, ?) as dist " +
"FROM default.myscale_cookgpt " +
"ORDER BY dist LIMIT ?";
int topK = 2;
try (PreparedStatement pstmt = connection.prepareStatement(query)) {
Float[] embQuery = CreateEmbedding.embedAsObject(queryStr);
Array array = connection.createArrayOf("Float32", embQuery);
pstmt.setArray(1, array);
pstmt.setInt(2, topK);
try (ResultSet rs = pstmt.executeQuery()) {
List<String[]> summaries = new ArrayList<>();
while (rs.next()) {
String recipe = rs.getString("Recipe");
String method = rs.getString("Method");
summaries.add(new String[]{recipe, method});
}
// Print the summaries
for (String[] summary : summaries) {
log.info("Recipe: " + summary[0] + ", Method: " + summary[1]);
}
return summaries;
}
} catch (SQLException e) {
log.severe(e.getMessage());
throw new RuntimeException(e);
}
}
Finally query the data and see the result
Please visit here for live demo
Conclusion
Advantages of Using SQL + Vector for Semantic Search
Efficient Query Execution: Columnar storage is optimized for reading large datasets, making it ideal for analytics and search applications.
Scalability: SQL databases designed to handle petabytes of data and high query loads, ensuring your search capabilities can scale with your needs.
Rich SQL Functionality: By combining SQL with vector search, you can perform complex queries that involve filtering, joining, and aggregating data based on semantic context.
Versatility: This setup can be enhanced for various use cases, such as recommendation systems, NLP applications, and image retrieval, where understanding the meaning of the data is crucial.
Enhancing the Solution for Other Use Cases
Recommendation Systems: By storing user preferences and item features as vectors, you can recommend items that are semantically similar to what the user likes.
Natural Language Processing: Store and search text embeddings to find documents or responses that are contextually relevant to a given query.
Image Retrieval: Use image embeddings to find similar images based on visual features rather than metadata alone.
By leveraging the combined power of SQL and vector search, this solution provides a robust and efficient platform for building advanced search and recommendation systems. This approach not only enhances search accuracy but also opens up new possibilities for data-driven applications.
Source link
lol