Why Data Lakehouses Are Poised for Major Growth in 2025

(bsd-studio/Shutterstock)

The humble data lakehouse emerged about eight years ago as organizations sought a middle ground between the anything-goes messiness of data lakes and the locked-down fussiness of data warehouses. The architectural pattern attracted some followers, but the growth wasn’t spectacular. However, as we kick off 2025, the data lakehouse is poised to grow quite robustly, thanks to a confluence of factors.

As the big data era dawned back in 2010, Hadoop was the hottest technology around, as it provided a way to build large clusters of inexpensive industry-standard X86 servers to store and process petabytes of data much more cheaply than the pricey data warehouses and appliances built on specialized hardware that came before them.

By allowing customers to dump large amounts of semi-structured and unstructured data into a distributed file system, Hadoop clusters garnered them the nickname “data lakes.” Customers could process and transform the data for their particular analytical needs on-demand, or what’s called a “structure on read” approach.

This was quite different than the “structure on write” approach used with the typical data warehouse of the day. Before Hadoop, customers would take the time to transform and clean their transactional data before loading it into the data warehouse. This was obviously more time-consuming and more expensive, but it was necessary to maximize the use of pricey storage and compute resources.

As the Hadoop experiment progressed, many customers discovered that their data lakes had turned into data swamps. While dumping raw data into HDFS or S3 radically increased the amount of data they could retain, it came at the cost of lower quality data. Specifically, Hadoop lacked the controls that allowed customers to effectively manage their data, which led to lower trust in Hadoop analytics.

By the mid-2010s, several independent teams were working on a solution. The first team was led by Vinoth Chandar, an engineer at Uber, who needed to solve the fast-moving file problem for the ride-sharing app. Chandar led the development of a table format that would allow Hadoop to process data more like a traditional database. He called it Hudi, which stood for Hadoop upserts, deletes, and incrementals. Uber deployed Hudi in 2016.

A year later, two other teams launched similar solutions for HDFS and S3 data lakes. Netflix engineer Ryan Blue and Apple engineer Daniel Weeks worked together to create a table format called Iceberg that sought to bring ACID-like transaction capabilities and rollbacks to Apache Hive tables. The same year, Databricks launched Delta Lake, which melded the data structure capabilities of data warehouses with its cloud data lake to bring a “good, better, best” to data management and data quality.

These three table formats largely drove the growth of data lakehouses, as they allowed traditional database data management techniques to be applied as a layer on top of Hadoop and S3-style data lakes. This gave customers the best of both worlds: The scalability and affordability of data lakes and the data quality and reliability of data warehouses.

Other data platforms began adopting one of the table formats, including AWS, Google Cloud, and Snowflake. Iceberg, which became a top-level Apache project in 2020, garnered much of its traction from the open source Hadoop ecosystem. Databricks, which initially kept close tabs on Delta Lake and its underlying table format before gradually opening up, also became popular as the San Francisco-based company rapidly added customers. Hudi, which became a top-level Apache project in 2019, was the third most-popular format.

The battle between Apache Iceberg and Delta Lake for table format dominance was at a stalemate. Then in June of 2024, Snowflake bolstered its support for Iceberg by launching a metadata catalog for Iceberg called Polaris (now Apache Polaris). A day later, Databricks responded by announcing the acquisition of Tabular, the Iceberg company founded by Blue, Weeks, and former Netflix engineer Jason Reid, for between $1 billion and $2 billion.

Databricks executives announced that Iceberg and Delta Lake formats would be brought together over time. “We are going to lead the way with data compatibility so that you are no longer limited by which lakehouse format your data is in,” the executives, led by CEO Ali Ghodsi, said.

Tabular CEO Ryan Blue (right) and Databricks CEO Ali Ghodsi on the stage at Data + AI Summit in June, 2024

The impact of the Polaris launch and Tabular acquisitions were huge, particularly for the community of vendors developing independent query engines, and it immediately drove an uptick in momentum behind Apache Iceberg. “If you’re in the Iceberg community, this is go time in terms of entering the next era,” Read Maloney, Dremio’s chief marketing officer, told this publication last June.

Seven months later, that momentum is still going strong. Last week, Dremio published a new report, titled “State of the Data Lakehouse in the AI Era,” which found growing support for data lakehouses (which are now considered to be Iceberg based, by default).

“Our analysis reveals that data lakehouses have reached a critical adoption threshold, with 55% of organizations running the majority of their analytics on these platforms,” Dremio said in its report, which is based on a fourth-quarter survey of 563 data decision-makers by McKnight Consulting Group. “This figure is projected to reach 67% within the next three years according to respondents, indicating a clear shift in enterprise data strategy.”

Dremio says that cost efficiency remains the primary driver behind the growth in data lakehouse, cited by 19% of respondents, followed by unified data access and enhanced ease of use (17% respectively) and self service analytics (13%). Dremio found that 41% of lakehouse users have migrated from cloud data warehouses and 23% have transitioned from standard data lakes.

Better, more open data analytics is high on the list of reasons to move to a data lakehouse, but Dremio found a surprising number of customers using their data lakehouse to back another use case: AI development.

The company found an astounding 85% of lakehouse users are currently using their warehouse to develop AI models, with another 11% stating in the survey that they planned to. That leaves a stunning 4% of lakehouse customers saying they have no plans to support AI development; it’s basically everyone.

While AI aspirations are universal at this point, there are still big hurdles to overcome before organizations can actually achieve the AI dream. In its survey, Dremio found organizations reported serious challenges to achieving success with AI data prep. Specifically, 36% of respondents say governance and security for AI use cases is the top challenge, followed by high cost and complexity (cited by 33%) and a lack of a unified AI-ready infrastructure (20%).

The lakehouse architecture is a key ingredient for creating data products that are well-governed and widely accessible, which are critical for enabling organizations to more easily develop AI apps, said James Rowland-Jones (JRJ), Dremio’s vice president of product management.

“It’s how they share [the data] and what comes with it,” JRJ told BigDATAwire at the re:Invent conference last month. “How is that enriched. How do how do you understand it and reason over it as an end user? Do you get a statistical sample of the data? Can you get a feel for what that data is? Has it been documented? Is it governed? Is there a glossary? Is the glossary reusable across views so people aren’t duplicating all of that effort?”

Dremio is perhaps best known for developing an open query engine, available under an Apache 2 license, that can run against a variety of different backends, including databases, HDFS, S3, and other file systems and object stores. But the company has been putting more effort lately into building a full lakehouse platform that can run anywhere, including on major clouds, on-prem, and in hybrid deployments. The company was an early backer of Iceberg with Project Nessie, its metadata catalog. In 2025, the company plans to put more focus on bolstering data governance, security, and building data products, company executives said at re:Invent.

The biggest beneficiary of the rise of open, Iceberg-based lakehouse platforms are enterprises, who are no longer beholden to monolithic cloud platforms vendors that want to lock customers’ data in so they can extract more money from them. A side effect of the rise of lakehouses is that vendors like Dremio now have the ability to sell their wares to customers, who are free to pick and choose a query engine to meet their specific needs.

“The data architecture landscape is at a pivotal point where the demands of AI and advanced analytics are transforming traditional approaches to data management,” Maloney said in a press release. “This report underscores how and why businesses are leveraging data lakehouses to drive innovation while addressing critical challenges like cost efficiency, governance, and AI readiness.”

It’s Go Time for Open Data Lakehouses

Databricks Nabs Iceberg-Maker Tabular to Spawn Table Uniformity

Source link
lol