In response to Rockset-OpenAI: a brief real-time analytics manifesto | Deephaven

In response to Rockset-OpenAI: a brief real-time analytics manifesto | Deephaven


OpenAI’s acquisition of Rockset sent waves through the data infrastructure industry. On the heels of Databricks’ billion-dollar reach for Tabular, the industry is both in play and in flux.

Companies like Clickhouse, StarTree, and Imply have published pieces that pontificate about the impact of OpenAI’s acquisition on the landscape and pitch their gear to Rockset customers looking for a Plan B.

Even my loved ones aren’t interested in my armchair quarterbacking of OpenAI’s strategy, so I’ll let the All-In, Ben & Mark, and Acquired podcasts weigh in. However, as the CEO of Deephaven Data Labs, a company developing software for today’s most challenging “real-time analytics” workloads (sorry, Palantir), it seemed wise to use OpenAI’s spotlight to start a conversation—or perhaps pick a fight.

Transactionality should be supported only where and when it matters.

Combining Online Transaction Processing (OLTP) and its analytics counterpart (OLAP) in real-time systems per Rockset, Flink, Materialize, and SingleStore may seem appealing, but it trivializes the very real costs of casually imposing full consistency. Real-time analytics should focus solely on OLAP because not doing so adds complexity and overhead.

  • Performance Overhead: Transactional integrity requires locking and logging, slowing the system.
  • High Resource Utilization: More CPU, memory, and storage are needed, raising costs.
  • Scalability Issues: Maintaining ACID properties in a distributed setup reduces scalability and increases latency.
  • Fault tolerance and development complexity: If all your queries and client code have to be transaction-aware, your development costs will explode.

Most real-time analytics and business applications involve data streams, metrics, messages, logs, and other events. These rarely need transactional guarantees. Rather, the eventual consistency model the world has embraced with Kafka is sufficient for such data and their downstream real-time analytics and apps. Simply put, processing systems that force transactional support on top of this data bear too heavy a burden.

Too often, we see the cloud’s elasticity as the solution for every problem. Yes, one can indeed hire 100 people to parallelize the task of carrying dead weight up the mountain, but we think it’s better to lighten your load and run up the slope yourself.

In mainstreaming the relational database, IBM and Oracle delivered a superpower: composability. Outputs from one query could always be used as inputs to other queries, a guarantee that yields big benefits.

Composability allows you to break big use cases into smaller ones and snap building blocks together. Analytics as Legos.

Composability allows you to break big problems into smaller ones and construct solutions with building blocks that perfectly snap together -–“analytics as Legos.” Composability also facilitates asynchronous development. Some ex-employee built an application a year ago, and you want to add to it today? With composable systems, just add your logic to the pertinent node in the graph. Voilà.

The world has shrugged its shoulders and given up on composability for real-time analytics. “Reevaluating” systems like Rockset or Tinybird/Clickhouse require you to recalculate the whole world at some artificial frequency. This increases costs and limits the complexity of the solutions you can stack together. Index-aligned aggregations, these solutions’ sweet spot, are pretty meager compared to actually running sophisticated businesses in real time; ask any internet scaler or player in the capital markets, industrial telemetry, energy, IOT, or payments space.

Stream processors, the other popular real-time analytics model, also lack composability. While their outputs—event streams and key-value pairs—can sometimes serve as natural inputs for subsequent analytics, applications, or nodes (such as when a stream is processed into a filtered version of itself), this is too often not the case. For instance, a table of ‘ranked risks that meet some threshold’ is best represented as an ordered table that updates, not as a stream. A system with a real-time version of that analytic as a middle node in the graph needs an engineer to do custom work to support downstream calculations and consumers. As a result, such systems often feel like hacks: brittle, slow to evolve, and costly to maintain as requirements change.

Deephaven is built on ease of use and composability, utilizing dataframes—highly structured, ordered tables. Deephaven updates dataframes in real time by tracking and evaluating changes (additions, removals, modifications, and shifts). Connecting these live dataframes in an acyclic graph presents a generalization of stream processing. In other words, Deephaven offers stream processing and more: it emphasizes the structure of dataframes and handles changes far beyond simple appends and logs of key-value pair changes. Moreover, a column-oriented dataframe model eliminates the compromises imposed by row-oriented stream processing when combining batch data in modern formats (e.g. Apache Parquet and Apache Iceberg) with disparate streaming sources (e.g. Apache Kafka and Apache Arrow Flight). It also enables vectorized, high-throughput analytics.

Real-time DB companies (i.e., “reevaluating engines”) love to fight with one another about who’s best at indexing data. Rockset even called their May conference “Index 2024.”

In real-time analytics, however, fast indexing misses the point. To support high-throughput or low-latency use cases, or to address data efficiently, updating the previous state of a calculation by operating only on changes is the optimal approach. An incrementally updating engine provides a superior architecture for real-time analytics because it performs calculations—whether simple table operations or more complex processes—on less data compared to real-time databases like Rockset, Imply, StarTree, or Tinybird. The difference in scale between handling incremental changes and the entire dataset is monumental in big data scenarios.

The combination of dataframes’ composability and the efficiency of incremental updates are keys to Deephaven’s extreme performance and versatility in the real-time space. The easy mental model dataframes afford is the icing on the cake.

Most real-time analytics providers seem happy to deliver a microbatching aggregation and then leave the rest to the customer. Connect your Python/C++/Java/C#/Rust/Go/JavaScript applications to the real-time engine? That’s your problem. Interoperate with batch? That’s your problem. Support ticking UI/UX or widgets? That’s your problem.

They might give you an engine that can do some things, but you’re trying to deliver a full soup-to-nuts business use case, not just create a simple analytic. You need a car that can pick up kids and take a road trip, not an engine that whirs in the garage.

It may be tempting to see real-time engine companies leaving “the car stuff” to others as simply a little unhelpful or uninspiring, but the truth is that their architectural approach fundamentally blocks elegance in other parts of the stack. It’s a mess, so they make it your mess.

Their architectural approach fundamentally blocks elegance in other parts of the stack. It’s a mess, so they make it your mess.

Deephaven takes a different approach. It provides technologies to serve your business beyond just the query engine layer. For example, enterprises often must connect their applications to the real-time data framework. Deephaven solves that problem for you by providing a gRPC-Arrow-Flight-based API, with all of the goodness of moving “dataframe changes” on the wire. With it, you can control query services remotely, send code to the compute cluster, and bidirectionally stream raw and derived data between your client application and Deephaven servers in seven different programming languages.

One of those API languages is JavaScript. Supporting IDE interactivity, self-serve dashboarding, Jupyter widgets, and UI development with real-time data is important. So, Deephaven makes available an open-source WebUI framework with a compelling grid UX, embeddable widgets, programmatic dashboards, and callback support–all designed to work with real-time data at scale. Though its data engine starts the Deephaven story, the differentiators higher up the stack often matter. Plotly charts that tick, rollups and pivot tables that update, and interactive UI dropdowns driven by live tables are just a few examples. In aggregate, this delivers experiences you might call “Real-time Streamlit” or “Tableau meets live updating data.”

Silicon Valley often complains that “streaming data has been on the cusp of changing the analytics paradigm for a decade, but it never arrives.” In this case, the industry lacks imagination, focusing on “mid” tech and looking to the wrong coast. Real-time applications and analytics-–including those heavy in ML and AI-–have been table stakes on Wall Street for twenty years. Deephaven is the solution of choice for bulge bracket capital markets firms serving their most demanding use cases.

As open, modular software principled on open formats, interoperability, and seamless service of stream and batch data, Deephaven provides solutions for community users and enterprises to get work done. Real-time data should speed you up. Let’s roll.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.