While GenAI is the focus today, most enterprises have been working for a decade or longer to make data intelligence a reality within their operations.
Unified data environments, faster processing speeds, and more robust governance; every improvement was a step forward in helping companies do more with their own information. Now, users of all technical backgrounds have the ability to interact with their private data – whether that’s a business team querying data in natural language or a data scientist being able to quickly and efficiently customize an open source LLM.
But the capabilities of data intelligence continue to evolve, and the foundation that businesses establish today will be pivotal to success over the next 10 years. Let’s take a look at how data warehousing transformed into data intelligence – and what the next step forward is.
The early days of data
Before the digital revolution, companies gathered information at a slower, more consistent pace. It was mostly all ingested as curated tables in Oracle, Teradata or Netezza warehouses And compute was coupled with storage, limiting the organization’s ability to do anything more than routine analytics.
Then, the Internet arrived. Suddenly, data was coming in faster, at significantly larger volumes. And a new era, one where data is considered the “new oil,” would soon begin.
The onset of big data
It all started in Silicon Valley. In the early 2010s, companies like Uber, Airbnb, Facebook and Twitter (now X) were doing very innovative work with data. Databricks was also built during this golden age – out of the desire to make it possible for every company to do the same with their private information.
It was perfect timing. The next several years were defined by two words: big data. There was an explosion in digital applications. Companies were gathering more than ever before, and increasingly trying to translate those raw assets into information that can help with decision-making and other operations.
But there were many challenges that they faced in this transformation to a data-driven operating model, including eliminating data silos, keeping sensitive assets secure, and enabling more users to build on the information. And ultimately, companies didn’t have the ability to efficiently process the data.
This led to the creation of the Lakehouse, a way for companies to unify their data warehouses and data lakes into one, open foundation. The architecture enabled organizations to more easily govern their entire data estate from one location, as well as query all data sources in an organization – whether that’s business intelligence, ML or AI.
Along with the Lakehouse, pioneering technology like Apache Spark™ and Delta Lake helped businesses turn raw assets into actionable insights that enhanced productivity, drove efficiency, or helped grow revenue. And they did so without locking companies into another proprietary tool. We are immensely proud to continue building on this open source legacy today.
Related: Apache Spark and Delta Lake Under the Hood
The age of data intelligence is here
The world is on the cusp of the next technology revolution. GenAI is upending how companies interact with data. But the game-changing capabilities of LLMs weren’t created overnight. Instead, continual innovations in data analytics and management helped lead to this point.
In many ways, the journey from data warehousing to data intelligence mirrors Databricks’ own evolution. Understanding the evolution of data intelligence is critical to avoiding the mistakes of the past.
Big data: Laying the groundwork for innovation
For many of us in the field of data and AI, Hadoop was a milestone and helped to ignite much of the progress that led to the innovations of today.
When the world went digital, the amount of information companies were collecting grew exponentially. Quickly, the scale overwhelmed traditional analytic processing and increasingly, the information wasn’t stored in organized tables. There was a lot more unstructured and semi-structured data, including audio and video files, social posts and emails.
Companies needed a different, more efficient way to store, manage and use this huge influx of information. Hadoop was the answer. It essentially took a “divide and conquer” approach with analytics. Files would be segmented, analyzed and then grouped back with the rest of the information. It did this in parallel, across many different compute instances. That significantly sped up how quickly enterprises processed large amounts of information. Data was also replicated, improving access and protecting from failures in what was basically a complex distributed processing solution.
The huge data sets that businesses began to build up during this era are now critical in the move to data intelligence and AI. But the IT world was poised for a major transformation, one that would render Hadoop much less useful. Instead, fresh challenges in data management and analytics arose that required innovative new ways of storing and processing information.
Apache Spark: Igniting a new generation of analytics
Despite its prominence, Hadoop had some big drawbacks. It was only accessible to technical users, couldn’t handle real-time data streams, processing speeds were still too slow for many organizations, and companies couldn’t build machine learning applications. In other words, it wasn’t “enterprise ready”.
That led to the birth of Apache Spark™, which was much faster and could handle the vast amount of data being collected. As more workloads moved to the cloud, Spark quickly overtook Hadoop, which was designed to work best on a company’s own hardware.
This desire to use Spark in the cloud is actually what led to the creation of Databricks. Spark 1.0 was released in 2014, and the rest is history. Importantly, Spark was open-sourced in 2010, and it continues to play an important role in our Data Intelligence Platform.
Delta Lake: The power of the open file format
During this “big data” era, one of the early challenges that companies faced was how to structure and organize their assets to be processed efficiently. Hadoop and early Spark relied on write-once file formats that did not support editing and had only rudimentary catalog capability. Increasingly, enterprises built huge data lakes, with new information constantly being poured in. The inability to update data and the limited capability of the Hive Metastore resulted in many data lakes becoming data swamps. Companies needed an easier and quicker way to find, label and process data.
The requirement to maintain data led to the creation of Delta Lake. This open file format provided a much-needed leap forward in capability, performance and reliability. Schemas were enforced but could also be quickly changed. Companies could now actually update data. It enabled ACID-compliant transactions on data lakes, offered unified batch and streaming, and helped companies optimize their analytics spending.
With Delta Lake, there’s also a transactional layer called “DeltaLog” that serves as a “source of truth” for every change made to the data. Queries reference this behind the scenes to ensure users have a stable view of the data even when changes are in progress.
Delta Lake injected consistency into enterprise data management. Companies could be sure they were using high-quality, auditable and reliable data sets. That ultimately empowered companies to undertake more advanced analytics and machine learning workloads – and scale those initiatives much faster.
In 2022, Databricks donated Delta Lake to the Linux Foundation, and it is continuously improved by Databricks along with significant contributions from the open source community. Among them, Delta inspired other OSS file formats, including Hudi and Iceberg. This year, Databricks bought Tabular, a data management company founded by the creators of Iceberg.
MLflow: The rise of data science and machine learning
As the decade of big data progressed, companies naturally wanted to start doing more with all the data they had been diligently capturing. That led to a huge surge in analytic workloads within most businesses. But while enterprises have long been able to query the past, they wanted to also now analyze data to draw new insights about the future.
But at the time, predictive analytics techniques only worked well for small data sets. That limited the use cases. But as companies moved systems to the cloud, and distributed computing became more common, they needed a way to query much larger sets of assets. This led to the rise of data science and machine learning.
Spark became a natural home for ML workloads. However, the trouble became tracking all the work that went into building the ML models. Data scientists largely kept manual records in Excel. There was no unified tracker. But governments around the world were growing increasingly concerned about the uptick in use of these algorithms. Businesses needed a way to ensure the ML models in use were fair/unbiased, explainable and reproducible.
MLflow became that source of truth. Before, development was a very ill-defined, unstructured and inconsistent process. MLflow provided all the tools that data scientists needed to do their jobs. It helped to eliminate steps, like stitching together different tools or tracking progress in Excel, that prevented innovation from reaching users quicker and made it harder for businesses to track value. And ultimately, MLflow put in a sustainable and scalable process for building and maintaining ML models.
In 2020, Databricks donated MLflow to the Linux Foundation. The tool continues to grow in popularity—both inside and outside of Databricks—and the pace of innovation has only been increasing with the rise of GenAI.
Data lakehouse: Breaking down the data barriers
By the mid-2010s, companies were gathering data at breakneck speeds. And increasingly, it was a wider array of data types, including video and audio files. Volumes of unstructured and semi-structured data skyrocketed. That quickly split enterprise data environments into two camps: data warehouses and data lakes. And there were major drawbacks with each option.
With data lakes, companies could store vast quantities of information in many different formats for cheap. But that quickly becomes a drawback. Data swamps grew more common. Duplicate data ended up everywhere. Information was inaccurate or incomplete. There was no governance. And most environments weren’t optimized to handle complex analytical queries.
Meanwhile, data warehouses provide great query performance and are optimized for quality and governance. That’s why SQL continues to be such a dominant language. But that comes at a premium cost. There’s no support for unstructured or semi-structured data. Because of the time it takes to move, cleanse and organize the information, it’s outdated by the time it reaches the end user. The process is far too slow to support applications that require instant access to fresh data, like AI and ML workloads.
At the time, it was very difficult for companies to traverse that boundary. Instead, most companies operated each ecosystem separately. There was different governance, different specialists and different data tied to each architecture. The structure made it very challenging to scale data-related initiatives. It was widely inefficient.
The operation of multiple, occasionally overlapping solutions at the same time resulted in increased costs, data duplication, increased reconciliation and data quality issues. Companies had to rely heavily on multiple overlapping teams of data engineers, scientists and analysts and each of these audiences suffered due to delays in data arrival and challenges with respect to handling streaming workloads.
The data lakehouse emerged as the best data warehouse choice – a place for both structured and unstructured data to be stored, managed and governed centrally. Companies got the performance and structure of a warehouse with the low cost and flexibility that data lakes offered. They had a home for the huge amounts of data coming in from cloud environments, operational applications, social media feeds, etc.
Notably: there was a built-in management and governance layer – what we call Unity Catalog. This provided customers with a massive uplift in metadata management and data governance. (Databricks open sourced Unity Catalog in June 2024.) As a result, companies could greatly expand access to data. Now, business and technical users could run traditional analytic workloads and build ML models from one central repository. Meanwhile, when the Lakehouse launched, companies were just starting to use AI to help augment human decision-making and produce new insights, among other early applications.
The data lakehouse quickly became critical to those efforts. Data could be consumed quickly, but still with the proper governance and compliance policies. And ultimately, the data lakehouse was the catalyst that enabled businesses to gather more data, give more users access to it, and power more use cases.
GenAI / MosaicAI
By the end of the last decade, businesses were taking on more advanced analytic workloads. They were starting to build more ML models. And they were beginning to explore early AI use cases.
Then GenAI arrived. The technology’s jaw-dropping pace of progress changed the IT landscape. Nearly overnight, every business quickly started trying to figure out how to take advantage. However, over the past year, as pilot projects started to scale, many companies began running into a similar set of issues.
Data estates are still fragmented, creating governance challenges that stifle innovation. Companies won’t deploy AI into the real world until they can ensure the supporting data is used properly and in accordance with local regulations. This is why Unity Catalog is so popular. Companies are able to set common access and usage policies across the workforce, as well as at the user level, to protect the whole data estate.
Companies are also realizing the limitations of general purpose Generative AI models. There’s a growing appetite to take these foundational systems and customize them to the organization’s unique needs. In June 2023, Databricks acquired MosaicML, which has helped us to provide customers with the suite of tools they need to build or tailor GenAI systems.
From information to intelligence
GenAI has completely changed expectations of what’s possible with data. With just a natural language prompt, users want instant access to insights and predictive analytics that are hyper-relevant to the business.
But while large, general purpose LLMs helped ignite the GenAI craze, companies increasingly care less about how many parameters a model has or what benchmarks it can achieve. Instead, they want AI systems that really understand a business and can turn their data assets into outputs that give them a competitive advantage.
That’s why we launched the Data Intelligence Platform. In many ways, it’s the pinnacle of everything Databricks has been working toward for the last decade. With GenAI capabilities at the core, users of all expertise can draw insights from a company’s private corpus of data – all with a privacy framework that aligns with the organization’s overall risk profile and compliance mandates.
And the capabilities are only growing. We released Databricks Assistant, a tool designed to help practitioners create, fix and optimize code using natural language. Our in-product search is also now powered by natural language, and we added AI-generated comments in Unity Catalog.
Meanwhile, Databricks AI/BI Genie and Dashboards, our new business intelligence tools, give users of technical and non-technical backgrounds the ability to use natural language prompts to generate and visualize insights from private data sets. It democratizes analytics across the organization, helping businesses integrate data deeper into operations.
And a new suite of MosaicAI tools is helping organizations build compound AI systems, built and trained on their own private data to take LLMs from a general-purpose engine, to specialized systems designed to provide tailored insights that reflect every enterprise’s unique culture and operations. We make it easy for businesses to take advantage of the plethora of LLMs available on the market today as a basis for these new compound AI systems, including RAG models and AI agents. We also give them the tools needed to further fine-tune LLMs to drive even more dynamic results. And importantly, there are features to help continually track and retrain the models once in production to ensure continual performance.
Most organizations’ journey to becoming a data and AI company is far from over. In fact, it never really ends. Continual advancements are helping organizations pursue increasingly advanced use cases. At Databricks, we’re always introducing new products and features that help clients tackle these opportunities.
For example, for too long, opposing file formats have kept data environments separate. With UniForm, Databricks users can bridge the gap between Delta Lake and Iceberg, two of the most common formats. Now, with our acquisition of Tabular, we are working toward longer-term interoperability. This will ensure that customers no longer have to worry about file formats; they can focus on picking the most performative AI and analytics engines.
As companies begin to use data and AI more ubiquitously across operations, it will fundamentally change how businesses run – and unlock even more new opportunities for deeper investment. It’s why companies are no longer just selecting a data platform; they’re picking the future nerve center of the whole business. And they need one that can keep up with the pace of change underway.
To learn more about the shift from general knowledge to data intelligence, read the guide GenAI: The Shift to Data Intelligence.
Source link
lol