Deephaven is the data backbone that powers hedge funds, banks, exchanges, and more. How do Deephaven experts leverage its power? I’ll show how through 23 examples of simple operations used regularly by some of the world’s top data experts. Don’t be fooled by their simplicity – they are the building blocks for innovative analysis that large firms rely on time and time again.
The only prerequisites for this blog are some version of Deephaven Community Core and Python installed on your machine. You have three options for Deephaven:
Any of the three options will get you started with the examples in this blog. In general, your choice of Deephaven deployment will depend on what you intend to do with it. Most users will find that the first or second option above are more than sufficient for their needs.
In most examples below, we will use the metric century bike ride dataset found in our examples repository. There are many more datasets there for you to play with, so check them out!
1. Read and write CSV files
Deephaven tables and CSV files are both column oriented. You can read CSV files that are local to your machine or on the web. You can write any table to a local CSV file as well.
In Deephaven, you can read a CSV file into a table using read_csv
. The CSV file can be local to your machine (make sure it’s visible to Deephaven) or remote on the web.
from deephaven import read_csv, write_csv
metric_century_local = read_csv("/data/examples/MetricCentury/csv/metriccentury.csv")
metric_century_remote = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/MetricCentury/csv/metriccentury.csv")
write_csv(metric_century_remote, "/data/metric_century.csv")
- metric_century_local
- metric_century_remote
2. Delete tables
There is no point to having two copies of the metric century table. You can delete them with del or = None.
metric_century = metric_century_remote
metric_century_remote = None
del(metric_century_local)
3. Get the first or last N rows of a table
You don’t always need all of the data. Sometimes you just want the first or last N rows of it.
n = 10
metric_century_head = metric_century.head(n)
metric_century_tail = metric_century.tail(n)
- metric_century_head
- metric_century_tail
4. Get the first or last N percent of a table
What if you want a certain percent of the data? This is very common in machine learning.
metric_century_head_pct = metric_century.head_pct(0.25)
metric_century_tail_pct = metric_century.tail_pct(0.10)
- metric_century_head_pct
- metric_century_tail_pct
5. Get table metadata
Deephaven tables can store a wide variety of data types, which gives you complete control of your data. Deephaven uses Java data types on the backend, and the metadata table shows you exactly what you’ve got.
metric_century_meta = metric_century.meta_table
6. Drop columns
You can omit columns from tables similar to how you omit rows.
cols_to_drop = ["Watts", "LatitudeDegrees", "LongitudeDegrees"]
metric_century_cols_dropped = metric_century.drop_columns(cols=cols_to_drop)
- metric_century_cols_dropped
7. Get the number of rows in a table
Tables have a size
attribute.
print(metric_century.size)
8. Filter data based on criteria
You can filter data out of tables on a per-column basis. Multiple conditions can be present in the same operation.
metric_century_where_moving = metric_century.where(["SpeedKPH > 0"])
metric_century_where_working = metric_century.where(["Cadence > 0", "Watts > 0"])
- metric_century_where_moving
- metric_century_where_working
9. Select a subset of table data
As previously mentioned, you don’t always need all of the data in a table. But you also don’t always want to explicitly drop columns. Sometimes you only want to select a subset. select
and view
do the trick. For an explanation on the difference between the two and which to use for your queries, see Choose the right selection method for your query.
metric_century_select = metric_century.select(["Time", "DistanceMeters", "SpeedKPH", "AltitudeMeters"])
metric_century_view = metric_century.view(["Time", "Cadence", "Watts", "HeartRate"])
- metric_century_select
- metric_century_view
10. Update a table
Deephaven has multiple ways to update a table: update
, update_view
, and lazy_update
. For an explanation of the difference between the three and when to use each one, see Choose the right selection method for your query.
metric_century_update = metric_century.update(["SpeedMPS = SpeedKPH * 1000 / 3600"])
metric_century_updateview = metric_century_update.update_view(["AccelMPS2 = SpeedMPS_[i] - SpeedMPS_[i - 1]"])
metric_century_lazyupdate = metric_century_updateview.lazy_update(["IsAccelerating = AccelMPS2 > 0 ? `Yes` : `No`"])
- metric_century_update
- metric_century_updateview
- metric_century_lazyupdate
11. Select unique values in columns
Want to see how many unique values exist in a column? Or how many unique combinations of values exist in more than one column? select_distinct
makes it simple.
unique_heartrates = metric_century.select_distinct(["HeartRate"])
unique_heartrate_cadences = metric_century.select_distinct(["HeartRate", "Cadence"])
- unique_heartrates
- unique_heartrate_cadences
12. Reverse table order
Table ordering can easily be reversed. This is particularly useful in real-time tables, where reversing a table causes the newest data to appear at its top.
metric_century_reversed = metric_century.reverse()
13. Rename columns
Column names aren’t always indicative of a column’s true meaning. In such a case, it’s appropriate to rename them accordingly.
metric_century_renamed = metric_century.rename_columns(["RevsPerMinute = Cadence", "HeartRateBPM = HeartRate"])
14. Sort columns
Columns in tables can be sorted in ascending or descending order. Sort direction can even vary from column to column in the same table.
from deephaven import SortDirection
sorted_ascending = metric_century.sort(["SpeedKPH", "Watts"])
sorted_descending = metric_century.sort_descending(["AltitudeMeters"])
sorted_both = metric_century.sort(["SpeedKPH", "Watts"], [SortDirection.ASCENDING, SortDirection.DESCENDING])
- sorted_ascending
- sorted_descending
- sorted_both
15. Time bin
It’s often useful to bin observations into buckets of time, such as when downsampling. Those buckets can be any amount of seconds, minutes, hours, etc.
binned_by_time = metric_century.update_view(["OneMinLowerBin = lowerBin(Time, MINUTE)", "FiveMinUpperBin = upperBin(Time, 5 * MINUTE)"])
16. Color format columns and rows
It’s useful, especially in real-time applications, to color columns and/or rows. Conditional coloring can make values of interest pop off the screen and be immediately obvious. Columns, rows, combinations, and even heat maps can be colored in either the background or foreground. A full list of colors can be found here.
metric_century_colorful = metric_century.format_columns(["Cadence = BLUE", "Watts = VIVID_YELLOW", "HeartRate = heatmap(HeartRate, 125, 145, LAWNGREEN, CRIMSON)"])
column_colors_conditional = metric_century.format_column_where("Cadence", "SpeedKPH = 0", "DARKORANGE")
row_colors_conditional = metric_century.format_row_where("SpeedKPH > 10", "fg(GREEN)")
- metric_century_colorful
- column_colors_conditional
- row_colors_conditional
17. Replay static data in real time
Real-time applications often need proofs-of-concept before they can be used in live systems. A great way to test if your query will work in real time or not is to replay a static table. The static table must have time stamps in Deephaven’s DateTime format.
from deephaven.replay import TableReplayer
from deephaven.time import to_datetime
start_time = to_datetime("2019-08-25T15:34:55Z")
end_time = to_datetime("2019-08-25T17:10:22Z")
replayer = TableReplayer(start_time, end_time)
metric_century_replayed = replayer.add_table(metric_century, "Time")
replayer.start()
18. Aggregate table data
An aggregation breaks data into subgroups and performs calculations on the grouped data. It has applications far and wide. Aggregations can be dedicated (a single calculation on subgroups) or combined (multiple calculations on subgroups).
from deephaven import agg as agg
count_by_heartrate = metric_century.count_by("Count", by=["HeartRate"])
multiple_aggs = metric_century.agg_by([agg.avg(cols=["AvgSpeed = SpeedKPH", "AvgCadence = Cadence"])], by=["HeartRate"])
- count_by_heartrate
- multiple_aggs
19. Read and write Parquet files
At the beginning of this blog, I showed how to read and write CSV files in Deephaven. Parquet is just like CSV, but it’s compressed, so it’s more efficient.
Want to efficiently store data? Read CSV files from the web and save them locally in Parquet format. You can store a ton more data by using Parquet over CSV.
from deephaven.parquet import read, write
from deephaven import read_csv
crypto_day1 = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/crypto_sept7.csv")
crypto_day2 = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/crypto_sept8.csv")
write(crypto_day1, "/data/crypto_sept7.parquet", compression_codec_name="GZIP")
write(crypto_day2, "/data/crypto_sept8.parquet", compression_codec_name="GZIP")
crypto_day1, crypto_day2 = None, None
crypto_sept7 = read("/data/crypto_sept7.parquet")
crypto_sept8 = read("/data/crypto_sept8.parquet")
20. Merge tables
Merging two or more tables vertically stacks them on top of one another. In the case of the two Parquet files we just read into memory, they contain the same set of crypto data for two different days. Thus, they can be easily merged together.
from deephaven import merge
crypto_merged = merge([crypto_sept7, crypto_sept8])
21. Create tables from scratch
Deephaven has three ways to create tables from scratch:
empty_table
new_table
- Creates a new table with column definitions based on Java primitive data types such as
int
,double
, andString
.
- Creates a new table with column definitions based on Java primitive data types such as
time_table
- Creates a real-time table with a single column of Deephaven timestamps. A new row is added once every N seconds, where N is specified.
from deephaven import empty_table, new_table, time_table
from deephaven.column import int_col, double_col, string_col
source1 = new_table([
int_col("Index", [1, 2, 3, 4, 5]),
double_col("Value", [1/2, 2/3, 3/4, 4/5, 5/6]),
string_col("Name", ["Art", "Cade", "Eric", "Sam", "Will"])
])
source2 = empty_table(5).update_view(["Index = i", "Sine = sin(0.1 * Index)"])
tt1 = time_table("PT00:00:01").lazy_update(["X = i % 4"])
tt2 = time_table("PT00:00:02").lazy_update(["Y = i % 3"])
22. Join tables
Joining tables is different from merging tables. In the example below, I use natural_join
. There are a several joins available; for more information about which to use for your queries, see our guide.
sources_joined_on2 = source1.natural_join(table=source2, on="Index")
sources_joined_on1 = source2.natural_join(table=source1, on="Index")
- sources_joined_on2
- sources_joined_on1
23. Join time series data
Real-time tables typically have timestamps, ass seen in tt1
and tt2
in the Create tables from scratch section. In this case, an as-of join or a reverse as-of join is typically used.
tt2.aj(table=tt1, on=["Timestamp"])
24. Create a real-time table
Why not create your own real-time data? It’s easy with DynamicTableWriter
. My query below is only meant to illustrate how to use it – the data is simple. This method of creating real-time data, however, has many applications including real-time crypto feeds, classifying slack messages, and much more.
from deephaven import DynamicTableWriter
from deephaven import dtypes as dht
import threading, time, numpy as np
col_defs = {"X": dht.double, "Y": dht.double}
table_writer = DynamicTableWriter(col_defs)
real_time_table = table_writer.table
def write_to_table():
for i in range(1000):
val = 0.1 * i
table_writer.write_row(val, np.sin(val))
time.sleep(0.1)
thread = threading.Thread(target=write_to_table)
thread.start()
Our documentation has loads of information, examples, and explanations to help you quickly become a Deephaven ninja. Reach out to us on Slack with feedback, questions, or more!
Source link
lol