Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
As data continues to be key to business success, enterprises are racing to drive maximum value from the information in hand. But the volume of enterprise data is growing so quickly — doubling every two years — that the computing power to process it in a timely and cost-efficient manner is hitting a ceiling.
California-based DataPelago aims to solve this with a “universal data processing engine” that allows enterprises to supercharge the performance of existing data query engines (including open-source ones) using the power of accelerating computing elements such as GPUs and FPGAs (Fixed Programming Gate Arrays). This enables the engines to process exponentially increasing volumes of complex data across varied formats.
The startup has just emerged from stealth but is already claiming to deliver a five-fold reduction in query/job latency while providing significant cost benefits. It has also raised $47 million in funding with the backing of multiple venture capital firms, including Eclipse, Taiwania Capital, Qualcomm Ventures, Alter Venture Partners, Nautilus Venture Partners and Silicon Valley Bank.
Addressing the data challenge
More than a decade ago, structured and semi-structured data analysis was the go-to option for data-driven growth, providing enterprises with a snapshot of how their business was performing and what needed to be fixed.
The approach worked well, but the evolution of technology also led to the rise of unstructured data — images, PDFs, audio and video files – within enterprise systems. Initially, the volume of this data was small, but today, it accounts for 90% of all information created (far more than structured/semi-structured) and is very critical for advanced enterprise applications like large language models.
Now, as enterprises are looking to mobilize all their data assets, including large volumes of unstructured data, for these use cases, they are running into performance bottlenecks and struggling to process them timely and cost-effectively.
The reason, as DataPelago CEO Rajan Goyal says, is the computing limitation of legacy platforms, which were originally designed for structured data and general-purpose computing (CPUs).
“Today, companies have two choices for accelerated data processing…Open-source systems offered as a managed service by cloud service providers have smaller licensing fees but require users to pay more for cloud infrastructure compute costs to reach an acceptable level of performance. On the other hand, proprietary services (built with open-source frameworks or otherwise) can be inherently more performant, but they have much higher licensing fees. Both choices result in higher total cost of ownership (TCO) for customers,” he explained.
To address this performance and cost gap for next-gen data workloads, Goyal started building DataPelago, a unified platform that dynamically accelerates query engines with accelerated computing hardware like GPUs and FPGAs, enabling them to handle advanced processing needs for all types of data, without massive increase in TCO.
“Our engine accelerates open-source query engines like Apache Spark or Trino with the power of GPUs resulting in a 10:1 reduction in the server count, which results in lower infrastructure cost and lower licensing cost in the same proportion. Customers see disruptive price/performance advantages, making it viable to leverage all the data they have at their disposal,” Goyal said.
At the core, DataPelago’s offering uses three main components – DataApp, DataVM and DataOS. The DataApp is a pluggable layer that allows integration of DataPelago with open data processing frameworks like Apache Spark or Trino, extending them at the planner and executor node level.
Once the framework is deployed and the user runs a query or data pipeline, it is done unmodified, with no change required in the user-facing application. On the backend, the framework’s planner converts it into a plan, which is then taken by DataPelago. The engine uses an open-source library like Apache Gluten to convert the plan into an open-standard, Intermediate Representation called Substrait. This plan is sent to the executor node where DataOS converts the IR into an executable Data Flow Graph (DFG).
Finally, the DataVM evaluates the nodes of the DFG and dynamically maps them to the right computing element – CPU, FPGA, Nvidia GPU or AMD GPU – based on availability or cost/performance characteristics. This way, the system redirects the workload to the most suitable hardware available from hyperscalers or GPU cloud providers for maximizing performance and cost benefits.
Significant savings for early DataPelago adopters
While the technology to dynamically accelerate query engines with accelerated computing is new, the company is already claiming it can deliver a five-fold reduction in query/job latency with a two-fold reduction in TCO compared to existing data processing engines.
“One company we’re working with was spending $140M on one workload, with 90% of this cost going to compute. We are able to decrease their total spend to less than $50M,” Goyal said.
He did not share the total number of companies working with DataPelago, but he did point out that the company is seeing significant traction from enterprises across verticals such as security, manufacturing, finance, telecommunications, SaaS and retail. The existing customer base includes notable names such as Samsung SDS, McAfee and insurance technology provider Akad Seguros, he added.
“DataPelago’s engine allows us to unify our GenAI and data analytics pipelines by processing structured, semi-structured, and unstructured data on the same pipeline while reducing our costs by more than 50%,” André Fichel, CTO at Akad Seguros, said in a statement.
As the next step, Goyal plans to build on this work and take its solution to more enterprises looking to accelerate their data workloads while being cost-efficient at the same time.
“The next phase of growth for DataPelago is building out our go-to-market team to help us manage the high number of customer conversations we’re already engaging in, as well as continue to grow into a global service,” he said.
Source link lol