Deephaven Community Core version 0.36.0 is available now, with several new features, improvements, bug fixes, and more. We’ve rounded up the highlights below.
Native table iteration in Python
Four new table operations are now available that allow you to iterate over table data in Python efficiently. They are:
The first two iterate over the table one row at a time, while the latter iterate over chunks of rows. All four methods use efficient chunked operations on the backend and return generators to minimize data copies and memory usage, making them ideal for large tables. Take a look at how they’re used below:
from deephaven import empty_table
source = empty_table(4096).update(["I=i", "D=(double)i", "S=String.valueOf(i)"])
n_rows_dict = 0
n_rows_tuple = 0
n_chunks_dict = 0
n_chunks_tuple = 0
for row_dict in source.iter_dict():
n_rows_dict += 1
for row_tuple in source.iter_tuple("D"):
n_rows_tuple += 1
for chunk_dict in source.iter_chunk_dict():
n_chunks_dict += 1
for chunk_tuple in source.iter_chunk_tuple(chunk_size=1024):
n_chunks_tuple += 1
print(f"Rows: {n_rows_dict}, {n_rows_tuple}, Chunks: {n_chunks_dict}, {n_chunks_tuple}")
Multi-table merged listeners
Prior to 0.36.0, you could only listen to a single at a time with a table listener. If you wanted to listen to multiple tables, you had two options: use multiple listeners or combine the tables. Merged listeners now allow you to listen to an arbitrary number of tables, giving you added, modified, and removed rows from each one of them on every update cycle. Here’s how you can listen to multiple tables at once:
from deephaven.table_listener import merged_listen
from deephaven import time_table
t1 = time_table("PT2s").update("RowNum = i")
t2 = time_table("PT3s").update("X = randomDouble(0, 10)")
t3 = time_table("PT5s").update("Y = randomBool()")
def listener_function(updates, is_replay):
if tu1 := updates[t1]:
added = tu1.added()
row = added["RowNum"].item()
print(f"t1: {row}")
if tu2 := updates[t2]:
added = tu2.added()
x = added["X"].item()
print(f"t2: {x}")
if tu3 := updates[t3]:
added = tu3.added()
y = added["Y"].item()
print(f"t3: {y}")
handle = merged_listen([t1, t2, t3], listener_function)
Table definitions in Python
Want to export a table definition from Python? Now, tables have a definition
attribute that returns a JSON table definition:
from deephaven.table import TableDefinition
from deephaven import empty_table
source = empty_table(10).update(["X = i", "Y = randomDouble(5, 10)"])
print(source.definition)
Compare tables more easily
There’s a new method that makes comparing tables easier. Use it to find differences in tables, such as columns, size, and more. Here’s how it’s used:
from deephaven.table import table_diff
from deephaven import empty_table
t1 = empty_table(10).update(["X = i", "Y = randomDouble(0, 10)"])
t2 = empty_table(3).update(["Z = randomBool()", "M = `This is a string!`"])
print(table_diff(t1, t2, max_diffs=1))
print(table_diff(t1, t2, max_diffs=5))
Parquet and S3
Two new features have been added to Deephaven’s Parquet integration:
- It now supports reading Parquet files from S3 that include metadata files.
- It now supports writing Parquet files to S3.
pip-installed Deephaven CLI
In release 0.34, a command line interface was added for pip-installed Deephaven. This would always automatically open a browser window. Now, the boolean config flags --no-browser
and --browser
have been added to control this behavior. The default behavior is still the same.
Iceberg
- Deephaven can now get a table definition for an Iceberg table without having to read the table first.
- Iceberg tables with invalid Deephaven column names will automatically be renamed to follow Deephaven conventions when consumed into tables.
- Iceberg snapshot tables produce
Timestamp
columns of Instant data type.
Performance
- Improved performance and memory use of
naturalJoin
in incremental cases where there are no responsive rows in either table. - Increased parallelism in partition-aware source tables, as well as an option to assume partitions are non-empty.
- Parallel table snapshots, which can improve performance particularly in cases when reading tables with many columns from S3.
Dependencies
- Upgraded to jedi autocomplete 0.19.1. See the jedi changelog here.
Client APIs
- The Java client now has a gRPC user agent, which includes relevant version information by default.
Server-side APIs: Python
- Liveness scopes can now manage table listeners in Python.
- Errors raised by table listeners in Python now properly notify any applications used by the server.
Server-side APIs: General
- Sorting dictionary-encoded string columns with null values will now work as expected.
- URI path conversions now work correctly on Windows.
- Floating point comparisons are now consistent with floating point hash code standards.
- Java and Python wheel artifacts now have the same dependencies.
- Reading from Parquet with a millis- or micros-since-epoch timestamp column no longer fails with a null pointer exception.
Client APIs
- A bug in the Go and JS client authentication that could erroneously require entering login information twice has been fixed.
Parquet
- Parquet files with missing dictionary page offsets are now read correctly.
- Deephaven’s Parquet reader now correctly handles dictionary-encoded strings in Parquet files.
Kafka
- Deephaven’s Kafka JSON specification now correctly propagates null values for integer fields.
Our Slack community continues to grow! Join us there for updates and help with your queries.
Source link
lol