The wait is over, and Deephaven Community Core version 0.34.0 is out. This is a big release with significant enhancements and new features. Are you ready to explore the latest updates? Let’s dive in and discover what’s new!
Command line interface for pip-installed Deephaven
Do you run Deephaven from Python without Docker? If so, chances are it’s because:
- You don’t like Docker.
- You want to keep everything in Python.
- You like the Jupyter experience.
Well, we have good news. It just got even easier to start Deephaven from Python with the introduction of a new command line interface.
If you pip install Deephaven 0.34.0 or later via pip install deephaven-server
, you can start a Deephaven server with a single deephaven
command:
# Start a server on port 8080 with a random PSK
deephaven server
# Start a server on port 10000 with a random PSK
deephaven server --port 10000
# Start a server on port 9999 with 12GB of heap
deephaven server --port 9999 --jvm-args="-Xmx12g"
# Start a server with `/tmp/deephaven` as the data directory
deephaven server --jvm-args="-Ddeephaven.dataDir=/tmp/deephaven"
# Start a server with the console disabled
deephaven server --jvm-args="-Ddeephaven.console.disable=true"
# Get help with the Deephaven command
deephaven --help
# Get help with the Deephaven server command
deephaven server --help
For more information about installing Deephaven with pip, see the pip install guide.
Use the Python client to ingest tables from remote servers
Have you ever wanted to use the Deephaven Python client from within a server? Now you can! Client tables can be made available for use in server-side queries. By running a Python client on a server, you can create and ingest tables from remote servers and use them in your own queries. This is exciting because:
- Distributing workloads across multiple Deephaven servers just got a lot easier.
- It leverages gRPC to support large and complex queries.
To demonstrate this new feature, consider the following configuration:
- You have a Deephaven server up and running with Python locally on port
10000
. It has pydeephaven installed. - You have another server running locally on port
9999
with anonymous authentication. You want to create a table on this other instance, and subscribe to it on the Deephaven server at port10000
.
From the Deephaven server running on port 10000
, run:
from deephaven.barrage import barrage_session
from pydeephaven.session import SharedTicket
from pydeephaven import Session
client_session = Session(port=9999)
client_table = client_session.time_table("PT1s").update(["X = 0.1 * i", "Y = sin(X)"])
client_ticket = SharedTicket.random_ticket()
client_session.publish_table(client_ticket, client_table)
my_barrage_session = barrage_session(port=9999)
local_table = my_barrage_session.subscribe(client_ticket.bytes)
new_local_table = local_table.last_by()
Parquet
Read partitioned Parquet datasets from AWS S3
In the 0.33 release, we added support to read single Parquet files from AWS S3. That has now been expanded to read partitioned datasets from S3. The best part? It’s just as easy to do! Check out this code, which reads a folder of publicly available Parquet data from an S3 bucket directly into Deephaven as a table:
from deephaven import parquet
from deephaven.experimental import s3
from datetime import timedelta
ookla_performance = parquet.read(
"s3://ookla-open-data/parquet/performance/type=mobile/year=2023",
special_instructions=s3.S3Instructions(
region_name="us-east-1",
anonymous_access=True,
read_ahead_count=8,
fragment_size=65536,
read_timeout=timedelta(seconds=10),
),
).coalesce()
Write partitioned Parquet files and metadata files
You can now not only read these files into Deephaven tables, but also write them from Deephaven tables. The following code block writes a partitioned table to a partitioned Parquet dataset.
from deephaven.parquet import write_partitioned
from deephaven import empty_table
import os
t = empty_table(10).update(["X = (i % 2 == 0) ? `A` : `B`", "Y = i"])
pt = t.partition_by("X")
write_partitioned(pt, "/data/PartitionedParquet")
print(os.listdir("/data/PartitionedParquet"))
Data indexing for tables
A DataIndex allows users to improve speed of data access operations in a table. It applies to one or more indexed key columns. It’s now available in the Python API through deephaven.experimental.data_index module. Sort, join, aggregation, and filter operations all benefit from this new feature.
Keep an eye out for additions to our documentation on this topic soon!
Built-in query library functions for array columns
Two new built-in query library functions, diff
and rank
have been added that can be used on array columns. They:
- Compute differences between values in an array.
- Rank values in an array.
The code block below uses these two on a table with an array column.
from deephaven import empty_table
t = (
empty_table(10)
.update(["X = randomInt(0, 10)"])
.group_by()
.update(["DiffX = diff(1, X)", "RankX = rank(X)"])
)
Batch formula compilation
A core port of the Deephaven engine has been reworked to be significantly more performant when batching formulas together. For instance, the following code runs over 10x faster in 0.34.0 than it does in 0.33.3.
from deephaven import empty_table
formulas = [""] * 1000
values = [0] * 1000
for idx in range(1000):
values[idx] = idx * 1024
formulas[idx] = f"C{idx} = (long)values[{idx}]"
t = empty_table(1).update(formulas)
If you create tables with a lot of columns created from formulas, you’ll see a noticeable difference.
Improved error messages
Error messages in Deephaven now contain the query string that caused them, which makes them more searchable and easier to understand.
Blink input tables in the Python client
Server-side APIs have been able to create blink input tables for some time, so it’s about time the Python client caught up.
The C++ client now works on Windows
The Deephaven C++ client previously only worked on Linux distributions, but that is no longer the case! It can now be built on both Windows 10 and Windows 11. For full instructions on building on Windows, see here.
Null to NaN conversions for NumPy arrays
Users whose Deephaven queries leverage NumPy have likely converted a table with null values to a NumPy array, and found the null values to be frustrating to deal with. Deephaven now offers a helper function to convert those to NumPy NaN
values, which are much easier to handle from Python.
Before version 0.34.0, this conversion was done automatically for user-defined functions. That is no longer the case. For more information on this breaking change, see engine handling of type hints.
from deephaven.jcompat import dh_null_to_nan
from deephaven.numpy import to_numpy
from deephaven import empty_table
t = empty_table(10).update("X = (i % 2 == 0) ? 0.1 * i : NULL_DOUBLE")
np_t = dh_null_to_nan(to_numpy(t).squeeze())
print(np_t)
Simple date formatting in Python
A helper function, simple_date_format
, has been added to the Python API. It makes date parsing easier in Python if your date-time data isn’t in an ISO-8601 format:
from deephaven import new_table
from deephaven.column import string_col
from deephaven.time import simple_date_format
source = new_table(
[
string_col(
"Timestamp",
["20230101-12:30:01 CST", "20230101-12:30:02 CST", "20230101-12:30:03 CST"],
)
]
)
input_format = simple_date_format("yyyyMMdd-HH:mm:ss z")
result = source.update("NewTimestamp = input_format.parse(Timestamp).toInstant()")
source_meta = source.meta_table
result_meta = result.meta_table
- result
- source
- result_meta
- source_meta
Time of day methods properly handle daylight savings
New time of day methods have been added to the Deephaven Query Language. These new methods take an additional boolean value depending on the desired behavior with respect to daylight savings time. See secondOfDay
as an example.
The older ___ofDay
methods are considered deprecated moving forward and will be removed within the next few releases.
Time binning methods now accept durations
Popular time binning methods upperBin
and lowerBin
now accept Java Durations as bin size and offset. These methods work all the same as the ones that take an integer number of nanoseconds:
from deephaven import empty_table
t = empty_table(10).update(
[
"Timestamp = '2024-05-01T09:00:00 ET' + i * MINUTE",
"LowerBin2Min = lowerBin(Timestamp, 2 * MINUTE)",
"UpperBin3Min = upperBin(Timestamp, 'PT3m')",
]
)
Ticking Python client on PyPi
pydeephaven-ticking, the Python client API that works with ticking data, is now available on PyPi!
Engine handling of type hints
The NumPy null to NaN conversion is no longer done automatically for user-defined functions that use NumPy arrays in their type hints. Users now must perform this conversion if their data contains null values where NaN is correct, for instance when the data type in the array is np.double
.
Additionally, the data types specified in type hints are checked against those used in the corresponding query string to ensure compatibility. If they are not, an error is thrown with information about any incompatibility. This ensures safe usage of type hints in functions called in query strings.
These breaking changes result user-defined functions being significantly more performant when called in query strings.
Other breaking changes
-
Parquet read/write Java APIs that used File objects have been replaced by ones that use strings instead.
-
The internal public class
io.deephaven.engine.table.impl.locations.local.KeyValuePartitionLayout
has been renamed toio.deephaven.engine.table.impl.locations.local.FileKeyValuePartitionLayout
.
-
Blink tables previously could not use the special variable row indices
i
,ii
, andk
. This has been fixed, and these can all now be used in blink tables. -
In certain rare cases, the
ungroup
operation produced a null pointer exception where it shouldn’t have. This has been fixed. -
FilterComparison
now works with string literals:
FilterComparison.geq(ColumnName.of("ColumnName"), Literal.of("A"))
-
A bug has been fixed that could cause searching for a value with the UI to not work as expected on sorted columns.
-
Arrays of Instant strings are now properly handled by deephaven.dtypes.instant_array.
-
Fixed a bug where duplicate headers could be passed between the ticking Python client and server, resulting in an error.
-
move_columns
could previously erroneously remove columns from tables. This has been fixed, and the method now inserts new columns at the specified locations without losing existing data. -
deephaven.pandas.to_pandas now supports the
numpy_nullable
option when the Pandas version is > 1.5.0.
Time zone conversions in Python now handle more cases. Previously, it was possible for time zone conversions from Python types to Java ZoneId types to throw errors when the conversion should have worked. This has been fixed.
- A bug in deephaven.learn was found that could cause null pointer exceptions; it has been fixed.
Required Java version
If you run the embedded Deephaven server, it will raise an error on startup if your Java version is below a threshold:
The error will notify you that the outdated version is the cause.
Docker Compose v2
Deephaven officially supports Docker Compose v2 by default. All of the pre-built Docker Compose files we publish have been updated to use V2.
Our Slack community continues to grow! Join us there for updates and help with your queries.
Source link
lol