Perform fully customizable rolling aggregations | Deephaven

Perform fully customizable rolling aggregations | Deephaven


Deephaven’s version 0.33 release introduced several new features and improvements. Two methods that have the potential to save users a lot of time are the new Rolling Formula update_by operations: rolling_formula_tick and rolling_formula_time. These operations support user-defined formulas, which can be used to calculate rolling aggregations over a specified window of rows or time.

Support for user-defined formulas opens up a whole new world of versatility for update_by.

This is exciting for a few reasons.

First of all, support for user-defined formulas opens up a whole new world of versatility for update_by. Rather than being limited to a single update_by operation, such as a simple Rolling Sum, users can define a custom operation that is as complex as their use cases require.

Second, Rolling Formula operations perform significantly better than calling a Rolling Group operation followed by update, which is how a user would have created a custom rolling aggregation in Deephaven version 0.32 and earlier.

Let’s put the new rolling_formula_tick operation to the test and see just how much faster it is than the old method. In the following example, we will create a table that will run a test a given number of times, time the runs, throw out the fastest and slowest result, and print the average for each test. We will then run the test with various table sizes and unique values.


from deephaven.updateby import rolling_group_tick, rolling_formula_tick, rolling_sum_tick
from deephaven import empty_table
from time import time

n_rows = 1_000_000
n_unique_vals = 10_000
n_tests = 5
times_group_update = [0] * n_tests
times_formula = [0] * n_tests
times_sum = [0] * n_tests

for test_num in range(n_tests):
test_table = empty_table(n_rows).update(f"IntCol = randomInt(0, {n_unique_vals})")


start = time()
t_grouped_updated = test_table.update_by(ops=rolling_group_tick(cols="X = IntCol", rev_ticks=1000), by=None).update("X = sum(X)")
end = time()
times_group_update[test_num] = end - start


start = time()
t_formula = test_table.update_by(ops=rolling_formula_tick(formula="sum(each)", formula_param="each", cols="X = IntCol", rev_ticks=1000), by=None)
end = time()
times_formula[test_num] = end - start


times_group_update.sort()
times_group_update = times_group_update[1:-1]
times_formula.sort()
times_formula = times_formula[1:-1]

print("Rolling Group + Update")
print(f"tMin: {min(times_group_update):.4f} seconds.")
print(f"tMax: {max(times_group_update):.4f} seconds.")
print(f"tAvg: {(sum(times_group_update) / len(times_group_update)):.4f} seconds.")

print("Rolling Formula")
print(f"tMin: {min(times_formula):.4f} seconds.")
print(f"tMax: {max(times_formula):.4f} seconds.")
print(f"tAvg: {(sum(times_formula) / len(times_formula)):.4f} seconds.")

rolling_formula_tick was consistently much faster than rolling_group_tick followed by update – nearly four times faster.

This is because with rolling_formula_tick, data in the window is loaded and available for quick access by the provided formula. However, with both rolling_group_tick and rolling_formula_tick, the formula is applied to every value in the window for every ouput row, which means the performance is 0(m*n), where m is the average size of the window and n is the number of output rows.

Rolling Formula operations are a significant improvement over Rolling Group operations followed by update. However, Rolling Formula operations still aren’t the fastest way to perform the operation in the previous example. When a dedicated aggregation exists, it’s fastest to use the dedicated aggregation – in this case, rolling_sum_tick. The code block is the same as shown above, except we replace our previous test lines with a test of rolling_sum_tick.

To view the entire example in one code block, click here.

from deephaven.updateby import rolling_group_tick, rolling_formula_tick, rolling_sum_tick
from deephaven import empty_table
from time import time

n_rows = 1_000_000
n_unique_vals = 10_000
n_tests = 5
times_group_update = [0] * n_tests
times_formula = [0] * n_tests
times_sum = [0] * n_tests

for test_num in range(n_tests):
test_table = empty_table(n_rows).update(f"IntCol = randomInt(0, {n_unique_vals})")


start = time()
t_sum = test_table.update_by(ops=rolling_sum_tick(cols="X = IntCol", rev_ticks=1000), by=None)
end = time()
times_sum[test_num] = end - start


times_sum.sort()
times_sum = times_sum[1:-1]

print("Rolling Sum")
print(f"tMin: {min(times_sum):.4f} seconds.")
print(f"tMax: {max(times_sum):.4f} seconds.")
print(f"tAvg: {(sum(times_sum) / len(times_sum)):.4f} seconds.")

img

That’s 25x faster than rolling_formula_tick!

Dedicated aggregations are optimized to perform the given operation as efficiently as possible and should be used whenever possible. If there is no dedicated aggregation that meets your needs, Rolling Formula operations are the next best option.

Our Community documentation has all of the resources you need to become a Deephaven power user. Our Slack community continues to grow, and we’d love to have you join us! If you have any questions, comments, or suggestions, please reach out to us there.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.