Startup Datavolo raises over $21M to transform how generative AI models access unstructured data

Multimodal data pipeline startup Datavolo Inc. today revealed its ambitious plans to transform the way data is fed into artificial intelligence systems, after closing on more than $21 million in funding.

The round was led by General Catalyst and saw participation from Citi Ventures, Human Capital, MVP Ventures and Rob Bearden, the former chief executive officer at Cloudera Inc. It brings its total amount raised to more than $25 million.

Datavolo is led by its co-founders, CEO Joe Witt (pictured, left) and Chief Operating Officer Luke Roquet (right). It has built what they says is a revolutionary new data pipeline system based on the open-source Apache NiFi project that they originally designed while working for the U.S. National Security Agency. Apache NiFi was built to automate the flow of data between software systems, and Datavolo is repurposing the software to handle multimodal data for generative AI workloads.

The startup said it wants to help companies make use of all of their data – not just the traditional structured data housed in databases, but also the unstructured data that accounts for the vast majority of information locked within their computer systems. According to 2023 report by International Data Corp., about 90% of the information generated by organizations falls into the latter category, but existing data pipeline software is ill-suited to handling this kind of data.

Until organizations have an easier way to tap this unstructured information, the startup says, they’ll never be able to realize the full potential of generative AI.

The Apache NiFi project is currently used by thousands of organizations around the world and is especially popular in highly regulated industries such as government, healthcare, finance and telecommunications. However, though most of those companies use the software primarily to handle their structured data needs, Apache NiFi can be just as useful for unstructured data.

The inefficiencies of existing data pipelines

Datavolo wants to transform Apache NiFi by leveraging it as the basis of a multimodal data pipeline for generative AI. In an interview with SiliconANGLE, Witt said that the primary advantage of Datavolo’s software is that it can replace the single-use, point-to-point code that’s currently used to deliver unstructured data to AI systems with fast, flexible and reusable pipelines that can be applied to any kind of data source. In this way, Witt said, the company is uniquely able to help companies leverage all of their data, from every source, to build more powerful and capable AI models.

Asked to elaborate, Witt explained that the industry is being held back by the lack of decent data pipeline solutions for unstructured data, which forces people to write custom code for each application. He said existing data pipelines are based on row-oriented abstractions built for data with established structures and schemas.

“In the multimodal data world, datasets tend to be quite large and they’re not structured as rows,” he explained. “In addition, traditional data platforms use point-to-point ELT architectures that don’t work well for the target systems relevant to LLM applications.”

Existing data pipelines also come with significant limitations, Witt said. For instance, once chunks of text are transformed into embeddings and stored in a vector database or search index, it’s impossible to transform or enrich such information further, unlike what can be done with traditional structured data in data warehouses.

“What’s more, the custom code companies are forced to write can be challenging to maintain, secure and operate,” he said. “And enterprise users would strongly prefer to adopt an established platform whom they can transfer these important risks too.”

Unlocking access to unstructured data

What’s different about Datavolo’s data pipeline model is that it leverages out-of-the-box processors to extract, clean, transform, enrich and publish both structured and unstructured data, Witt said. Most importantly, it’s designed for continuous, event-driven ingest that can scale up on demand to cope with bursts of high-volume data.

“Our platform can handle a variety of data, including audio and video image streams, a raw signal captured by a sensor, a deeply nested hierarchical structured JSON or XML, text-based log entries, and a highly structured database of rows and records,” Witt added. “We know that flexibility will be a critical component for data engineers as the stack continues to evolve and open questions are answered. That’s why Datavolo’s data pipelines and orchestration capabilities are purpose-built to provide flexibility to easily swap APIs, sources, targets and models.”

According to Witt, unstructured data is going to be essential for enterprises to get the most value out of their integrations with foundational language models in future. As a rule, LLMs are trained on proprietary or publicly available datasets, but companies can significantly enhance their usefulness by fine-tuning them on their own business data, which is mostly unstructured.

“We strongly believe that the most successful AI applications will be built on AI systems rather than directly on top of AI models,” Witt said. “The most useful AI systems must include the ability to retrieve contextual data from enterprises’ data systems to supplement the generative capabilities of LLMs and drive business value.”

Datavolo’s new offering finally gives enterprises an opportunity to extract the maximum value from their data and unlock unprecedented innovation for companies embracing AI, Witt added. “Datavolo is a tool for data engineers supporting AI teams,” he continued. It bridges the organizational gap between data and AI teams, providing a framework, feature set and catalog of repeatable patterns to build multimodal data pipelines that are secure, simple and scalable.”

Doug Henschen, an analyst with Constellation Research Inc., said the most interesting aspect of Datavolo will be how it evolves the Apache NiFi platform, which also underpins Cloudera’s DataFlow. He explained that Witt formerly worked at Cloudera as the vice president of its Flow and Streaming products, and may well have been inspired to launch Datavolo after his former employer’s decision to prioritize its core Cloudera Data Platform ahead of those offerings.

“Systems built for structured and semi-structured data, like Cloudera Data Platform, continue to dominate the big data market, even though NiFi has been around for quite a long time,” Henschen said. “It will be interesting to see how Datavolo applies NiFi to generative AI use cases involving multimodal data. If Datavolo turns out to be on the cutting edge, tapping into a great new growth market, I’m sure Cloudera and plenty of other vendors will go after the same market opportunity. In short, it’s another example of the longstanding tech pattern of seeing nimble upstart companies attempting to innovate and disrupt incumbents.”

A new data model for the generative AI era

General Catalyst Managing Director Quentin Clark shed more light on Datavolo’s plans in a blog post announcing today’s round, saying that as AI systems are evolving to become the backbone of daily business operations, there’s an urgent need to rethink how data architectures are structured.

“Joe and Luke are not just building another data platform; they’re setting the stage for a future where data isn’t merely handled but intelligently harnessed to fulfill the evolving requirements driven by AI,” Clark said. “We believe Datavolo has one of the best open-source teams out there, and has the product and partners in place to make this vision a reality.”

According to Clark, the existing relational data model that serves as the foundation of the modern economy will be joined by an entirely new data model that’s designed to meet the specific needs of AI. In future, AI systems won’t just aid business operations, but instead they will evolve to run entire parts of the business. As a result, AI applications are going to need access to the right data at the right time.

“Extrapolating data patterns and extracting what is needed is not how databases have been built. They were historically oriented to transactional events and batch processing,” Clark explained. “We have the opportunity with AI to build systems that are working with the business — assisting a sales agent, supply chain manager, field technician or any number of the countless jobs people are doing every day.”

Datavolo said it will use the funds from today’s round to focus on transforming Apache NiFi into a cloud-native managed service with specific capabilities that will enable the rapid development of new, multimodal data pipelines for AI. The startup has already made significant progress in that respect, and is now launching a private beta program for customers that want to leverage its data architecture for retrieval augmented generation applications, which are generative AI apps that can tap into unstructured data sets to enhance their existing capabilities.

The startup said its ideal customers are those organizations who are looking to automate the continuous capture, transformation and loading of unstructured data from hundreds of different sources, out of the box.

Image: Datavolo

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU

Source link
lol

Startup Datavolo raises over $21M to transform how generative AI models access unstructured data – SiliconANGLE