Deephaven Community Core version 0.33.0 is now out. We’re excited about it and hope you will be too after reading the release notes! Let’s take a closer look at what it includes.
Read Parquet from AWS S3
Deephaven can now read single Parquet files from AWS S3. The code block below fetches data from a public S3 bucket. This new experimental feature is under active development, so stay tuned for future developments as we expand on it.
from deephaven import parquet
from deephaven.experimental import s3
from datetime import timedelta
drivestats = parquet.read(
"s3://drivestats-parquet/drivestats/year=2023/month=02/2023-02-1.parquet",
special_instructions=s3.S3Instructions(
"us-west-004",
endpoint_override="https://s3.us-west-004.backblazeb2.com",
anonymous_access=True,
read_ahead_count=8,
fragment_size=65536,
read_timeout=timedelta(seconds=10),
),
)
Rollup table weighted average
Weighted average aggregations can now be calculated from a rollup table, like in the code block below:
from deephaven import read_csv, agg
insurance = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/Insurance/csv/insurance.csv")
agg_list = [agg.weighted_avg(wcol="age", cols=["bmi", "expenses"])]
by_list = ["region", "age"]
test_rollup = insurance.rollup(aggs=[], by=by_list, include_constituents=True)
insurance_rollup = insurance.rollup(aggs=agg_list, by=by_list, include_constituents=True)
Custom formulas in rolling operations
The update_by
table operation now supports custom user-defined formulas. Like other update by operations, these formulas can be cumulative, windowed by ticks (rows), or windowed by time. Custom formulas used in update_by
operations follow the same rules as custom formulas in aggregations.
The following code block uses the new rolling formula update by operations to calculate a rolling sum of squares of prices by ticker.
from deephaven.updateby import rolling_formula_tick, rolling_formula_time
from deephaven import empty_table
prices = empty_table(20).update(["Timestamp = '2024-02-23T09:30:00 ET' + ii * SECOND", "Ticker = (i % 2 == 0) ? `NVDA` : `GOOG`", "Price = randomDouble(100.0, 500.0)"])
formula_tick = rolling_formula_tick(formula="sum(x * x)", formula_param="x", cols="SumPriceSquared_Tick = Price", rev_ticks=5)
formula_time = rolling_formula_time(ts_col="Timestamp", formula="sum(x * x)", formula_param="x", cols="SumPriceSquared_Time = Price", rev_time="PT10s")
result = prices.update_by(ops=[formula_tick, formula_time], by="Ticker")
Support for 1D arrays in Numba decorators
Version 0.33.0 has added support for Numba’s guvectorize
decorator to be used in table operations. It currently supports 1-dimensional arrays, with support for multi-dimensional arrays being eyed for a future release.
The following code block uses this decorator on the function g
, which is used in a table operation. g
takes a 1-dimensional array and scalar value as input, and returns another 1-dimensional array.
from numba import guvectorize, int64
from deephaven import empty_table
from numpy import typing as npt
import numpy as np
def array_from_val(val) -> npt.NDArray[np.int64]:
return np.array([val, val + 1, val + 2], dtype=np.int64)
@guvectorize([(int64[:], int64, int64[:])], '(n),()->(n)')
def g(x, y, res) -> npt.NDArray[np.int64]:
for i in range(x.shape[0]):
res[i] = x[i] + y
source = empty_table(5).update(["X = i", "Y = array_from_val(X)"])
result = source.update(["Z = g(Y, X)"])
Partitioned table viewer
Partitioned tables are tables containing a column containing other tables (constituent tables or subtables) with the same schema. They can provide a nice boost to query performance if used properly. The biggest drawback of partitioned tables has always been the inability to visualize the data they contain. That is no longer the case – we’ve added a partitioned table viewer to the Deephaven UI. Now, create a partitioned table, and you can see its data by default.
The following code block creates a partitioned table from the same table used in the previous section using a single partitioning column.
from deephaven import empty_table
prices = empty_table(20).update(["Timestamp = '2024-02-23T09:30:00 ET' + ii * SECOND", "Ticker = (i % 2 == 0) ? `NVDA` : `GOOG`", "Price = randomDouble(100.0, 500.0)"])
prices_by_ticker = prices.partition_by(by="Ticker")
In Deephaven Community Core 0.32.1 and earlier, visualizing prices_by_ticker
can only be done with one or more table operations that return a normal table. Now, in 0.33.0, the viewer allows you to view any of its constituents simply from the UI.
Filter by multiple selections in the UI
You can now filter by multiple rows easily via the UI. Right-clicking inside a selection of multiple rows allows filtering by all distinct values in that selection.
Blink tables, select
, and update
Prior to version 0.33.0, calling select
or update
on a blink table did not propagate an attribute that would cause aggregations to remember data history. This has been fixed, so aggregations on blink tables now work as you’d expect. Blink tables still provide all of the same memory and performance benefits as they always have.
Python performance
The developer team found some areas of our Python API whose performance could be improved. One of those improvements is included in the 0.33 release, so your Python queries could benefit from bumping to this latest version. More improvements to Python performance are coming in future releases, so stay tuned for future announcements.
Our Slack community continues to grow! Join us there for updates and help with your queries.
Source link
lol