24 powerful Deephaven tools for data analysis | Deephaven

24 powerful Deephaven tools for data analysis | Deephaven


Deephaven is the data backbone that powers hedge funds, banks, exchanges, and more. How do Deephaven experts leverage its power? I’ll show how through 23 examples of simple operations used regularly by some of the world’s top data experts. Don’t be fooled by their simplicity – they are the building blocks for innovative analysis that large firms rely on time and time again.

The only prerequisites for this blog are some version of Deephaven Community Core and Python installed on your machine. You have three options for Deephaven:

Any of the three options will get you started with the examples in this blog. In general, your choice of Deephaven deployment will depend on what you intend to do with it. Most users will find that the first or second option above are more than sufficient for their needs.

In most examples below, we will use the metric century bike ride dataset found in our examples repository. There are many more datasets there for you to play with, so check them out!

1. Read and write CSV files

Deephaven tables and CSV files are both column oriented. You can read CSV files that are local to your machine or on the web. You can write any table to a local CSV file as well.

In Deephaven, you can read a CSV file into a table using read_csv. The CSV file can be local to your machine (make sure it’s visible to Deephaven) or remote on the web.

from deephaven import read_csv, write_csv

metric_century_local = read_csv("/data/examples/MetricCentury/csv/metriccentury.csv")
metric_century_remote = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/MetricCentury/csv/metriccentury.csv")

write_csv(metric_century_remote, "/data/metric_century.csv")

2. Delete tables

There is no point to having two copies of the metric century table. You can delete them with del or = None.

metric_century = metric_century_remote

metric_century_remote = None
del(metric_century_local)

3. Get the first or last N rows of a table

You don’t always need all of the data. Sometimes you just want the first or last N rows of it.

n = 10 

metric_century_head = metric_century.head(n)
metric_century_tail = metric_century.tail(n)

4. Get the first or last N percent of a table

What if you want a certain percent of the data? This is very common in machine learning.

metric_century_head_pct = metric_century.head_pct(0.25)
metric_century_tail_pct = metric_century.tail_pct(0.10)

5. Get table metadata

Deephaven tables can store a wide variety of data types, which gives you complete control of your data. Deephaven uses Java data types on the backend, and the metadata table shows you exactly what you’ve got.

metric_century_meta = metric_century.meta_table

6. Drop columns

You can omit columns from tables similar to how you omit rows.

cols_to_drop = ["Watts", "LatitudeDegrees", "LongitudeDegrees"]
metric_century_cols_dropped = metric_century.drop_columns(cols=cols_to_drop)

7. Get the number of rows in a table

Tables have a size attribute.

print(metric_century.size)

8. Filter data based on criteria

You can filter data out of tables on a per-column basis. Multiple conditions can be present in the same operation.

metric_century_where_moving = metric_century.where(["SpeedKPH > 0"])
metric_century_where_working = metric_century.where(["Cadence > 0", "Watts > 0"])

9. Select a subset of table data

As previously mentioned, you don’t always need all of the data in a table. But you also don’t always want to explicitly drop columns. Sometimes you only want to select a subset. select and view do the trick. For an explanation on the difference between the two and which to use for your queries, see Choose the right selection method for your query.

metric_century_select = metric_century.select(["Time", "DistanceMeters", "SpeedKPH", "AltitudeMeters"])
metric_century_view = metric_century.view(["Time", "Cadence", "Watts", "HeartRate"])

10. Update a table

Deephaven has multiple ways to update a table: update, update_view, and lazy_update. For an explanation of the difference between the three and when to use each one, see Choose the right selection method for your query.

metric_century_update = metric_century.update(["SpeedMPS = SpeedKPH * 1000 / 3600"])
metric_century_updateview = metric_century_update.update_view(["AccelMPS2 = SpeedMPS_[i] - SpeedMPS_[i - 1]"])
metric_century_lazyupdate = metric_century_updateview.lazy_update(["IsAccelerating = AccelMPS2 > 0 ? `Yes` : `No`"])

11. Select unique values in columns

Want to see how many unique values exist in a column? Or how many unique combinations of values exist in more than one column? select_distinct makes it simple.

unique_heartrates = metric_century.select_distinct(["HeartRate"])
unique_heartrate_cadences = metric_century.select_distinct(["HeartRate", "Cadence"])

12. Reverse table order

Table ordering can easily be reversed. This is particularly useful in real-time tables, where reversing a table causes the newest data to appear at its top.

metric_century_reversed = metric_century.reverse()

13. Rename columns

Column names aren’t always indicative of a column’s true meaning. In such a case, it’s appropriate to rename them accordingly.

metric_century_renamed = metric_century.rename_columns(["RevsPerMinute = Cadence", "HeartRateBPM = HeartRate"])

14. Sort columns

Columns in tables can be sorted in ascending or descending order. Sort direction can even vary from column to column in the same table.

from deephaven import SortDirection

sorted_ascending = metric_century.sort(["SpeedKPH", "Watts"])
sorted_descending = metric_century.sort_descending(["AltitudeMeters"])
sorted_both = metric_century.sort(["SpeedKPH", "Watts"], [SortDirection.ASCENDING, SortDirection.DESCENDING])

15. Time bin

It’s often useful to bin observations into buckets of time, such as when downsampling. Those buckets can be any amount of seconds, minutes, hours, etc.

binned_by_time = metric_century.update_view(["OneMinLowerBin = lowerBin(Time, MINUTE)", "FiveMinUpperBin = upperBin(Time, 5 * MINUTE)"])

16. Color format columns and rows

It’s useful, especially in real-time applications, to color columns and/or rows. Conditional coloring can make values of interest pop off the screen and be immediately obvious. Columns, rows, combinations, and even heat maps can be colored in either the background or foreground. A full list of colors can be found here.

metric_century_colorful = metric_century.format_columns(["Cadence = BLUE", "Watts = VIVID_YELLOW", "HeartRate = heatmap(HeartRate, 125, 145, LAWNGREEN, CRIMSON)"])
column_colors_conditional = metric_century.format_column_where("Cadence", "SpeedKPH = 0", "DARKORANGE")
row_colors_conditional = metric_century.format_row_where("SpeedKPH > 10", "fg(GREEN)")

17. Replay static data in real time

Real-time applications often need proofs-of-concept before they can be used in live systems. A great way to test if your query will work in real time or not is to replay a static table. The static table must have time stamps in Deephaven’s DateTime format.

from deephaven.replay import TableReplayer
from deephaven.time import to_datetime

start_time = to_datetime("2019-08-25T15:34:55Z")
end_time = to_datetime("2019-08-25T17:10:22Z")

replayer = TableReplayer(start_time, end_time)
metric_century_replayed = replayer.add_table(metric_century, "Time")
replayer.start()

18. Aggregate table data

An aggregation breaks data into subgroups and performs calculations on the grouped data. It has applications far and wide. Aggregations can be dedicated (a single calculation on subgroups) or combined (multiple calculations on subgroups).

from deephaven import agg as agg

count_by_heartrate = metric_century.count_by("Count", by=["HeartRate"])
multiple_aggs = metric_century.agg_by([agg.avg(cols=["AvgSpeed = SpeedKPH", "AvgCadence = Cadence"])], by=["HeartRate"])

19. Read and write Parquet files

At the beginning of this blog, I showed how to read and write CSV files in Deephaven. Parquet is just like CSV, but it’s compressed, so it’s more efficient.

Want to efficiently store data? Read CSV files from the web and save them locally in Parquet format. You can store a ton more data by using Parquet over CSV.

from deephaven.parquet import read, write
from deephaven import read_csv

crypto_day1 = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/crypto_sept7.csv")
crypto_day2 = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/crypto_sept8.csv")

write(crypto_day1, "/data/crypto_sept7.parquet", compression_codec_name="GZIP")
write(crypto_day2, "/data/crypto_sept8.parquet", compression_codec_name="GZIP")

crypto_day1, crypto_day2 = None, None

crypto_sept7 = read("/data/crypto_sept7.parquet")
crypto_sept8 = read("/data/crypto_sept8.parquet")

20. Merge tables

Merging two or more tables vertically stacks them on top of one another. In the case of the two Parquet files we just read into memory, they contain the same set of crypto data for two different days. Thus, they can be easily merged together.

from deephaven import merge

crypto_merged = merge([crypto_sept7, crypto_sept8])

21. Create tables from scratch

Deephaven has three ways to create tables from scratch:

  • empty_table
  • new_table
    • Creates a new table with column definitions based on Java primitive data types such as int, double, and String.
  • time_table
    • Creates a real-time table with a single column of Deephaven timestamps. A new row is added once every N seconds, where N is specified.
from deephaven import empty_table, new_table, time_table
from deephaven.column import int_col, double_col, string_col

source1 = new_table([
int_col("Index", [1, 2, 3, 4, 5]),
double_col("Value", [1/2, 2/3, 3/4, 4/5, 5/6]),
string_col("Name", ["Art", "Cade", "Eric", "Sam", "Will"])
])
source2 = empty_table(5).update_view(["Index = i", "Sine = sin(0.1 * Index)"])
tt1 = time_table("PT00:00:01").lazy_update(["X = i % 4"])
tt2 = time_table("PT00:00:02").lazy_update(["Y = i % 3"])

22. Join tables

Joining tables is different from merging tables. In the example below, I use natural_join. There are a several joins available; for more information about which to use for your queries, see our guide.

sources_joined_on2 = source1.natural_join(table=source2, on="Index")
sources_joined_on1 = source2.natural_join(table=source1, on="Index")

23. Join time series data

Real-time tables typically have timestamps, ass seen in tt1 and tt2 in the Create tables from scratch section. In this case, an as-of join or a reverse as-of join is typically used.

tt2.aj(table=tt1, on=["Timestamp"])

24. Create a real-time table

Why not create your own real-time data? It’s easy with DynamicTableWriter. My query below is only meant to illustrate how to use it – the data is simple. This method of creating real-time data, however, has many applications including real-time crypto feeds, classifying slack messages, and much more.

from deephaven import DynamicTableWriter
from deephaven import dtypes as dht

import threading, time, numpy as np

col_defs = {"X": dht.double, "Y": dht.double}

table_writer = DynamicTableWriter(col_defs)
real_time_table = table_writer.table

def write_to_table():
for i in range(1000):
val = 0.1 * i
table_writer.write_row(val, np.sin(val))
time.sleep(0.1)

thread = threading.Thread(target=write_to_table)
thread.start()

Our documentation has loads of information, examples, and explanations to help you quickly become a Deephaven ninja. Reach out to us on Slack with feedback, questions, or more!



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.