Researchers at MIT have developed a novel programming system called GenQL that extends SQL to deliver probabilistic AI modeling atop tabular data, giving users a new method for bringing predictive analytics and other AI capabilities to their complex tabular data.
SQL is widely used and loved due to its algebraic completeness and its capability to deliver correct answers from database queries running against structured data. However, SQL’s deterministic approach doesn’t mesh with the world of AI, where algorithms generate probabilistic answers based on their trained model. This impedance mismatch forces data scientists who are working with Bayesian methods and predictive models to switch between SQL and probabilistic technologies and techniques.
Researchers with the Probabilistic Computing Project in the MIT Department of Brain and Cognitive Sciences created GenQL in part to bridge this impedance mismatch and tool gap and bring SQL-like capabilities to the world of generative AI, thereby expanding SQL’s usage and effectiveness. In addition to enabling users to ask probabilistic questions about their tabular data sets in a SQL-like dialect, GenQL lets users do other probabilistic things with their tabular data, like generate synthetic data, guess missing values, find anomalies, and fix errors.
“GenSQL introduces a novel interface and soundness guarantees that decouple user-level specification of high-level queries against probabilistic models from low-level details of probabilistic programming, such as probabilistic modelling, inference algorithm design, and high-performance machine implementations,” write the MIT researchers in a paper introducing GenSQL, titled “GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables.”
According to the paper, the core of GenSQL includes a series of typed extensions to SQL, including SQL scalar expressions and tables, as well as rowModels (probabilistic models of tables) and events (a set of constructs that allow users to issue probabilistic queries that leverage Bayesian conditioning). These elements make probabilistic models first-class constructs within SQL, thereby allowing users to mix and match queries of models and queries of data.
The MIT implementation also includes a query planner that moves queries into plans that execute against a new model interface, dubbed the Abstract Model Interface (AMI), which serves as the integration layer to ensure probabilistic models are compatible with GenSQL. The project also incorporate “exact” and “approximate” soundness theorems. The exact soundness theorems show that shows all deterministic queries are exact, while the approximate theorem prove that all probabilistic queries return consistent results.
The first step in using GenSQL is to create a probabilistic model of their tabular data, using a “probabilistic program synthesis tool,” such as CrossCat. Once a user’s data has been turned into a model, the model is simply uploaded into GenQL, which automatically integrates them, the authors of the paper write. “The user can then issue queries for a variety of tasks,” they wrote.
The MIT researchers benchmarked GenQL using a set of standard queries, and the results show that all the queries return within milliseconds against tables with up to 10,000 rows. It also evaluated GenQL’s usefulness in two real-world tests, one for creating synthetic data generation for a virtual wet lab, and another for detecting anomalies in clinical trials. The tests show that GenQL was not only faster than AI-based approaches for data analysis, but the results were more explainable.
Minimizing the complexity that comes from trying to use SQL for predictive analysis is a big reason why the researchers embarked on the GenQL project, according to MIT research scientist Mathieu Huot, who was the lead author on the paper.
“Looking at the data and trying to find some meaningful patterns by just using some simple statistical rules might miss important interactions,” Huot told MIT News. “You really want to capture the correlations and the dependencies of the variables, which can be quite complicated, in a model. With GenSQL, we want to enable a large set of users to query their data and their model without having to know all the details.”
The researchers see two potential ways that GenSQL could impact database applications and design. First, it could be integrated as a query language within a database management systems, thereby enabling users to query generative models of tabular data directly from the database.
Secondly, GenQL could be used for modularized development of queries and models. By taking advantage of the abstractions that GenQL creates for isolating query developers and query users from model developers, it could lead to a broadening of the development of generative models, which could be beneficial for society, the researchers note.
The paper was published in the Proceedings of the ACM on Programming Languages. You can access the paper here.
Related Items:
DataChat Delivers Data Exploration with a Dose of GenAI
GenAI Doesn’t Need Bigger LLMs. It Needs Better Data
GenAI Is Making Data Science More Accessible, Dataiku Says
Source link
lol