The next release will include an alpha feature that allows users to address Deephaven’s real-time Streaming Tables with SQL, automatically inheriting updates and changes via familiar declarative patterns. While the team lays that groundwork, version 0.24 delivers some exciting new table methods, improvements to the Python client, and performance-related enhancements.
Full release notes are found on GitHub.
Many users have asked for range_join
table operations (sometimes preferring the term window_join
). Though versatile in many ways, this is often used to join records from a right table within a particular range of time associated with an event in a left table – “market trade events between my order send time and my order fill time” for a stock trader, or “website activity during the time range user-X was on the site” for a marketing-tech analyst.
A full description of range_join
is found in the PyDoc. The script below provides an artificial, demonstrative example.
from deephaven import empty_table
import random
left_table = empty_table(100)
.update(["Row_Num = ii", "Start_Time = now()", "End_Time = Start_Time + 'PT00:00:00.500' * Row_Num"])
right_table = empty_table(1000)
.update(["Row_Num = ii", "Event_Time = now() + 'PT00:00:00.100' * Row_Num", "Event_Measure = (int)random.randint(0,100)"])
from deephaven.agg import group
aggs = [
group(cols=["Grouped_Events=Event_Measure"]),
]
rj_example = left_table.range_join(table=right_table,on="Start_Time < Event_Time < End_Time", aggs=aggs)
rj_with_aggs = rj_example.where("!isNull(Grouped_Events)")
.update(["Joined_Row_Count = len(Grouped_Events)",
"Last_Joined_Event = last(Grouped_Events)",
"Joined_Sum = sum(Grouped_Events)"])
As highlighted in recent release blogs, Deephaven introduced update_by()
as an operation on which to deliver cumulative, rolling, and window-based operations. In 0.24.0, the team has added the following operators to the update_by
universe:
The PyDocs detail dozens of available update_by
operators.
Pandas 2.0
Deephaven has upgraded its pandas integration to support Python pandas 2.0. Of the list of upgrades inherent in the 2.0 library, our users will appreciate that pandas can now return PyArrow-backed tables. Formerly, NumPy arrays were the only option. Given the nice integrations between Arrow and Deephaven, this is an empowering inherited upgrade.
Vector iteration for better performance
When accessing vectors in UDFs embedded in queries, users benefit from the engine’s use of vector iteration. Historically the implementation relied on direct access.
Parallelized where
Users now automatically inherit multi-threading in their filtering. Even when using Python, the core engine will parallelize the application of the where
operation across the table. This also applies to real-time tables inheriting updates. The execution of user scripts and applications inherits this multi-threading automatically.
Contributors are working on a slew of enhancements for the next release. Alpha SQL features and a beta-version R client lead a pack of exciting developments. Stay tuned.
We look forward to interacting with you via Deephaven’s Slack or GitHub Discussions.
Source link
lol