How we store and serve data are critical factors in what we can do with data, and today we want to do oh-so much. That big data necessity is the mother of all invention, and over the past 20 years, it has spurred an immense amount of database creativity, from MapReduce and array databases to NoSQL and vector DBs. It all seems so promising…and then Mike Stonebraker enters the room.
For half a century, Stonebraker has been churning out the database designs at a furious pace. The Turing Award winner made his early mark with Ingres and Postgres. However, apparently not content to having created what would become the world’s most popular database (PostgreSQL), he also created Vertica, Tamr, and VoltDB, among others. His latest endeavor: inverting the entire computing paradigm with the Database-Oriented Operating System (DBOS).
Stonebraker also is famous for his frank assessments of databases and the data processing industry. He’s been known to pop some bubbles and slay a sacred cow or two. When Hadoop was at the peak of its popularity in 2014, Stonebraker took clear joy in pointing out that Google (the source of the tech) had already moved away from MapReduce to something else: BigTable.
That’s not to say Stonebraker is a big supporter of NoSQL tech. In fact, he’s been a relentless champion for the power of the relational data model and SQL, the two core tenets of relational database management systems, for many years.
Back in 2005, Stonebraker and two of his students, Peter Bailis and Joe Hellerstein (members of the 2021 Datanami People to Watch class), analyzed the previous 40 years of database design and shared their findings in a paper called “Readings in Database Systems.” In it, they concluded that the relational model and SQL emerged as the best choice for a database management system, having out-battled other ideas, including hierarchical file systems, object-oriented databases, and XML databases, among others.
In his new paper, “What Goes Around Comes Around…And Around…,” which was published in the June 2024 edition of SIGMOD Record, the legendary MIT computer scientist and his writing partner, Carnegie Mellon University’s Andrew Pavlo, analyze the past 20 years of database design. As they note, “A lot has happened in the world of databases since our 2005 survey.”
While some of the database tech that has been invented since 2005 is good and helpful and will last for some time, according to Stonebraker and Pavlo, much of the new stuff is not helpful, is not good, and will only exist in niche markets.
20 Years of Database Dev
Here’s what the duo wrote about new database inventions of the past 20 years:
MapReduce: MapReduce systems, of which Hadoop was the most visible and (for a time) most successful implementation, are dead. “They died years ago and are, at best, a legacy technology at present.”
Key-value stores: These systems (Redis, RocksDB) have either “matured into RM [relational model] systems or are only used for specific problems.”
Document stores: NoSQL databases that store data as JSON documents, such as MongoDB and Couchbase, benefited from developer excitement over a denormalized data structures, a lower-level API, and horizontal scalability at the cost of ACID transactions. However, document stores “are on a collision course with RDBMSs,” the authors write, as they have adopted SQL and relational databases have added horizontal scalability and JSON support.
Columnar database: This family of NoSQL database (BigTable, Cassandra, HBase) is similar to document stores but with just one level of nesting, instead of an arbitrary number. However, the column store family already is obsolete, according to the authors. “Without Google, this paper would not be talking about this category,” they wrote
Text search engines: Search engines have been around for 70 years, and today’s search engines (such as Elasticsearch and Solr)continue to be popular. They will likely remain separate from relational databases because conducting search operations in SQL “is often clunky and differs between DBMSs,” the authors write.
Array databases: Databases such as Rasdaman, kdb+, and SciDB (a Stonebraker creation) that store data as two-dimensional matrices or as tensors (three or more dimensions) are popular in the scientific community, and likely will remain that way “because RDBMSs cannot efficiently store and analyze arrays despite new SQL/MDA enhancements,” the authors write.
Vector databases: Dedicated vector databases such as Pineone, Milvus, and Weaviate (among others) are “essentially document-oriented DBMSs with specialized ANN [approximate nearest neighbor] indexes,” the authors write. One advantage is they integrate with AI tools, such as LangChain, better than relational databases. However, the long-term viability for vector DBs isn’t good, as RDBMSs will likely adopt all of their features, “render[ing] such specialized databases unnecessary.”
Graph database: Property graph databases (Neo4j, TigerGraph) have carved themselves a comfortable niche thanks to their efficiency with certain types of OLTP and OLAP workloads on connected data, where executing joins in a relational database would lead to an inefficient use of compute resources. “But their potential market success comes down to whether there are enough ‘long chain’ scenarios that merit forgoing a RDBMS,” the authors write.
Trends in Database Architecture
Beyond the “relational or non-relational” argument, Stonebraker and Pavlo offered their thoughts on the latest trends in database architecture.
Column stores: Relational databases that store data in columns (as opposed to rows), such as Google Cloud BigQuery, AWS‘ Redshift, and Snowflake, have grown to dominate the data warehouse/OLAP market, “because of their superior performance.”
Cloud databases: The biggest revolution in database design over the past 20 years has occurred in the cloud, the authors write. Thanks to the big jump in networking bandwidth relative to disk bandwidth, storing data in object stores via network attached storage (NAS) has grown very attractive. That in turn pushed the separation of compute and storage, and the rise of serverless computing. The push to the cloud created a “once-in-a-lifetime opportunity for enterprises to refactor codebases and remove bad historical technology decisions,” they write. “Except for embedded DBMSs, any product not starting with a cloud offering will likely fail.”
Data Lakes / Lakehouses: Building on the rise of cloud object stores (see above), these systems “are the successor to the ‘Big Data’ movement from the early 2010s,” the authors write. Table formats like Apache Iceberg, Apache Hudi, and Databricks Delta Lake have smoothed over what “seems like a terrible idea”–i.e. letting any application write any arbitrary data into a centralized store, the authors write. The capability to support non-SQL workloads, such as data scientists crunching data in a notebook via a Pandas DataFrame API, is another advantage of the lakehouse architecture. This will “be the OLAP DBMS archetype for the next ten years,” they write.
NewSQL systems: The rise of new relational (or SQL) database that scaled horizontally like NoSQL databases without giving up ACID guarantees may have seemed like a good idea. But this class of databases, such as SingleStore, NuoDB (now owned by Dassault Systems), and VoltDB (a Stonebraker creation) never caught on, largely because existing databases were “good enough” and didn’t warrant taking the risk of migrating to a new database.
Hardware accelerators: The last 20 years has seen a smattering of hardware accelerators for OLAP workloads, using both FPGAs (Netezza, Swarm64) and GPUs (Kinetica, Sqream, Brylyt, and HeavyDB [formerly OmniSci]). Few companies outside the cloud giants can justify the expense of building custom hardware for databases these days, the authors write. But hope springs eternal in data. “In spite of the long odds, we predict that there will be many attempts in this space over the next two decades,” they write.
Blockchain Databases: Once promoted as the future data store for a trustless society, blockchain databases are now “a waning database technology fad,” the authors write. It’s not that the technology doesn’t work, but there just aren’t any applications outside of the Dark Web. “Legitimate businesses are unwilling to pay the performance price (about five orders of magnitude) to use a blockchain DBMS,” they write. “An inefficient technology looking for an application. History has shown this is the wrong way to approach systems development.”
Looking Forward: It’s All Relative
At the end of the paper, the reader is left with the indelible impression that “what goes around” is the relational model and SQL. The combination of these two entities will be tough to beat, but they will try anyway, Stonebraker and Pavlo write.
“Another wave of developers will claim that SQL and the RM are insufficient for emerging application domains,” they write. “People will then propose new query languages and data models to overcome these problems. There is tremendous value in exploring new ideas and concepts for DBMSs (it is where we get new features for SQL). The database research community and marketplace are more robust because of it. However, we do not expect these new data models to supplant the RM.”
So, what will the future of database development hold? The pair encourage the database community to “foster the development of open-source reusable components and services. There are some efforts towards this goal, including for file formats [Iceberg, Hudi, Delta], query optimization (e.g., Calcite, Orca), and execution engines (e.g., DataFusion, Velox). We contend that the database community should strive for a POSIX-like standard of DBMS internals to accelerate interoperability.”
“We caution developers to learn from history,” they conclude. “In other words, stand on the shoulders of those who came before and not on their toes. One of us will likely still be alive and out on bail in two decades, and thus fully expects to write a follow-up to this paper in 2044.”
You can access the Stonebraker/Pavlo paper here.
Related Items:
Stonebraker Seeks to Invert the Computing Paradigm with DBOS
Cloud Databases Are Maturing Rapidly, Gartner Says
The Future of Databases Is Now
AWS, Brytlyt, Couchbase, Databricks, Elastic, Google Cloud, HeavyDB, Kinetica, KX, Milvus, MongoDB, Neo4j, NuoDB, Pinecone, Redis, SingleStore, Snowflake, Tamr, Teradata, TigerGraph, VoltDB, Weaviate
acid, Andrew Pavlo, array database, BigTable, cassandra, cloud database, database, document-store, graph database, json, KV store, lakehouse, mapreduce, Michael Stonebraker, NoSQL, relational database, relational model, sql, text search, vector database
Source link
lol