Store historical crypto data without file bloat | Deephaven

Store historical crypto data without file bloat | Deephaven


Crypto data is everywhere. As of March 2022, there are over 18,000 cryptocurrencies in existence. Like many other data sources, crypto data is typically stored in CSV format. Unfortunately, this format is outdated and memory inefficient. Enter Parquet, the lesser-known data storage format that eliminates the need for bloated CSV files. Use Parquet over CSV and have more space for more data, which means more crypto at your fingertips.

Everywhere I look online, whether it’s Kaggle, Nasdaq Data Link, or another publicly available source of historical crypto data, everything is presented in CSV format. That’s nice for previewing the data, but I don’t care much about that. Previews show a few rows of data and column names, and that’s usually it. Why do these places not give me the option to download the data in Parquet?

I don’t have the answer, but it feels like an oversight – Parquet files are much smaller, so they take less time to download. For now, I’ll just make do with what’s available. With Deephaven, I can turn those bloated CSV files to Parquet with ease.

I want data, and I want to lose the CSV bloat. I’ll grab a CSV file from the internet and load the data into a Deephaven table.

from deephaven import read_csv

coin_data = read_csv("https://media.githubusercontent.com/media/deephaven/examples/main/CryptoCurrencyHistory/CSV/CryptoTrades_20210922.csv")

Let’s start by writing the data to a CSV file and see how much memory that requires.

from deephaven import write_csv

write_csv(coin_data, "data/coin_data.csv")

70 megabytes for a single file seems like a bit much. Let’s see just how much space I can save by using Parquet.

Let’s see how much space we can save by writing this data to a Parquet file instead.

from deephaven.parquet import write

write(coin_data, "/data/CryptoTrades_20210922.parquet", compression_codec_name="GZIP")

img

This file now takes up 10 MB of space.

I could store 7 identical copies of this Parquet file and take up the same amount of space in memory as a single CSV file. This leaves me with one last thing to do…

import subprocess

subprocess.call("rm /data/coin_data.csv", shell=True)

Get that bloated CSV file out of here! Speaking of which, I think I’m overdue to convert all of my locally stored CSV files to Parquet.

That’s really all there is to it. Do you have big, ugly CSV files you use to store your data? Turn them into Parquet in four lines of code. After that, remove the CSV and never look back. Use all of that extra space to store more crypto data than you thought possible.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.