Data is dynamic. This is self-evident.
Strangely, though, nearly all of today’s data systems are not designed for the concept of changing data. Every frame
of the movie is a new picture: most data engines operate on and transact in complete data sets, not changes in data
sets. Efficient handling of deltas is a missing principle. Their engineers do not talk about insertions or shifts.
But the paradigm is changing.
Two modern query engines position deltas as the cornerstone architectural object. Both provide an
incremental-update model: the ability to recompute state by processing only inherited deltas.
This benefits modern engineers. In many cases, an update model can offer meaningful performance improvements, a simpler
data development pipeline, and a flexible processor for streams and CDC. In also providing real-time pub-sub of derived
views, these two frameworks are changing the game for AI/ML, data-driven apps, ETL, and analytics of all sorts.
Deephaven, the engine my colleagues and I developed, was until recently a closed, commercial product. After over a
decade tackling real-time and big data problem for quant hedge funds and other Wall Street players,
Deephaven is today an open, community platform, bringing the power once
reserved for capital markets to the rest of the technology community.
Materialize is another recent incremental-update engine. Its Postgres-compatibility and Timely-Dataflow guts combine with a team grounded in west-coast hyperscaler research and OLTP shops — an understandbly attractive package. For SQL engineers seeking incremental compute of CDC sources and streamlined registration of views, Materialize has become an option to consider.
As a software company CEO, I instinctively expect the question, “Aren’t you like such-and-such?”
in any dialog about Deephaven. Pandas? Grafana? ksqlDB? Yes — but only in some ways. I always sympathize with the
person posing the question. Like them, I, too, try to understand the new from my framework of the familiar, and
invariably the Venn diagram of data system capabilities has many dimensions and much overlap. Differentiators are
often difficult to parse, because phrases like “real-time”, “self-service”, “democratized access”, or
“high-performance” are laughably ubiquitous in marketing, and deep technical exploration is an
unwise opening salvo.
“So Pete… isn’t Deephaven like Materialize?”
In some ways, yes. We both have an empowering incremental-compute capability at the core of our query engine.
We both transact in deltas. We both position derived streams as a very important construct. That two systems developed
independently and contemporaneously have found common north stars reinforces the strength of these decisions for the
use cases we target.
In our capital markets enterprise business, customers do not run Deephaven against Materialize, so we had not seen
a detailed comparison to Materialize or resourced one ourselves. However, the recent release of our Community
product and its focus on interoperability has piqued our interest in more meaningfully understanding this like-minded
query engine.
For years, Deephaven’s users have led us to evolve the framework that makes an incremental-update model accessible and
valuable. As our users have found that static-data tools don’t work well with dynamic tables, Deephaven has developed
APIs, libraries, and UI experiences that expand their ability to work with real-time data. Materialize and
its users may find these valuable as well. Open source offers a compelling town square for discovery, discussion,
and development of adjacent tools.
To both teach ourselves about Materialize and provide a framework for community discussion, we thought it most
effective to explore a use case of Materialize’s choosing. It seemed right to play ball on Materialize’s home court.
Accordingly, this piece and the series described below focus on
an e-commerce demo on the Materialize GitHub.
With a flow that combines MySQL, Debezium, Kafka sources, joins and decorations within the engine, and integration with
a visualization agent, the workload seemed just right. Their demo articulates Materialize’s value, seemed a workflow
Deephaven could address well, and represented a relevant pipeline to implement in Deephaven.
My colleague Cristian Ferretti, a veteran developer and straight shooter, took on the
tasks of (i) spinning up the Materialize demo, as made available in their GitHub, and (ii) implementing a
precisely-engineered lookalike that swaps in Deephaven as the query engine.
Below are links to write-ups detailing this process and further articulating the differences between Deephaven and Materialize:
-
The exact process of coding and running the performance of the two identical demos, found in the deephaven-example GitHub project.
-
A study of relative resource consumption and performance.
-
A very high-level design comparison between Deephaven and Materialize.
-
A description of parts of the framework beyond the engine and why working on the stack is necessary in a streaming world.
Though we encourage you to review the unabridged series, particularly while contemplating the potential for
collaborative development leveraging the complete Deephaven stack, below is my summary of Cristian’s work:
- Deephaven and Materialize both provide a query engine founded on an incremental-update model. For many of your workloads, this is an empowering and forward-thinking way to process data and provide it to downstream consumers.
- In Cristian’s comparison, based on Materialize’s functional demo, Deephaven outperforms Materialize: it consumes fewer resources while processing joins and computing projections with greater throughput. These differences are not marginal, but order-of-magnitude differences.
- Where Kafka-like principles for consistency satisfy your requirements, you will often find Deephaven is a more empowering match. Materialize may have an advantage for OLTP and Postgres engineers in the way it handles transactions. If a global,
fully-consistent model is vital, you will be pleased with the care Materialize provides and willing to pay its cost. This may also be true for queries highly reliant on the SQL optimizer. - Deephaven is a more natural fit for Pandas scientists, Python developers, JavaScript developers building real-time interactive experiences, and those who need to bring custom Java or Python code/libraries to data.
- We missed the Deephaven UI when working on the Materialize-side of the demo.
Cristian and I invite you to repeat the experiment on your own. You will find the necessary code and guidance at the deephaven-examples repo. Like you, we see this exercise as far from exhaustive in its consideration. There are countless other use case dimensions to explore. We look forward to conversations about any weaknesses in our approach or challenges to our conclusions. We welcome you to join our Slack community to get in touch with our team.
Today’s big data is dynamic. Embrace it.
Source link
lol