RSS feeds bring together a large amount of frequently updating data from multiple sources. People often use them to scan headlines and article snippets to figure out what’s actually worth their time. Podcasts typically publish their metadata, like episode titles, in RSS streams. With over 850,000 podcasts active in 2021, this provides us with a massive source of real-time data – the kind that Deephaven excels at handling.
Clearly sifting through that metadata manually would be a Sisyphean chore. In this blog post, we demonstrate how to build a system that aggregates podcast show titles into a single source. Placing the data into Deephaven tables makes working with information on this scale manageable. Last month, we gave you a DIY program to ingest Reddit posts and perform simple sentiment analysis. Similarly, you can modify this program in various ways, such as finding episodes that feature your favorite athlete or influencer. In fact, you could hook up any RSS feed with data that interests you.
It turns out podcasts often use RSS feeds to publish information. The Podcast Index claims to have records of over 4.3 million podcasts, and nearly all of these have an associated RSS feed. This has the potential to be an incredibly fruitful resource. Let’s pull in some of this data and see what’s out there.
We’ll use our code from analyzing Reddit RSS feeds as a starting point. Given the magnitude of RSS feeds, we need to accommodate our program to avoid performance problems. Our first step is to scale out effectively.
At the end of the day, RSS feeds are just URLs that follow a standard format. This means that RSS readers actually perform HTTP requests on the backend. Anyone who’s worked with HTTP requests knows how slow they can be, and how much performance improvements can come from threading them. This can be applied to our RSS reader to improve performance!
Deephaven’s tables work very well in multi-threaded environments. Specifically for our situation, we want a table that can be written to from multiple threads. One way to accomplish this is to create a table using Deephaven’s DynamicTableWriter
for each thread, and use the merge
method in the main thread to combine the tables together. The resulting table will continue to update as the tables in the threads update. This query shows a simple example:
from deephaven import DynamicTableWriter
import deephaven.dtypes as dht
from deephaven.TableTools import merge
import threading
import time
NUMBER_OF_TABLES = 3
def write_to_table(writer):
writer.logRow("A", 1)
time.sleep(3)
writer.logRow("B", 2)
time.sleep(3)
writer.logRow("C", 3)
column_names = [
"Letter",
"Number"
]
column_types = [
dht.string,
dht.int32
]
tables = []
threads = []
for i in range(NUMBER_OF_TABLES):
writer = DynamicTableWriter(column_names, column_types)
thread = threading.Thread(target=write_to_table, args=[writer])
thread.start()
tables.append(writer.getTable())
threads.append(thread)
result = merge(tables)
for i in range(len(tables)):
globals()[f"table_{i}"] = tables[i]
while True:
thread_is_alive = False
for thread in threads:
if thread.is_alive():
thread_is_alive = True
if thread_is_alive:
time.sleep(1)
else:
break
We can apply this to our RSS reader to pull podcast information. Given a list of RSS feeds where each feed points to a podcast of our choice, we can read from them in a threaded environment and write their data to Deephaven.
For this example, we read from four arbitrary podcast RSS feeds using two threads and write the title of each podcast episode to our table:
import os
os.system("pip install feedparser")
from deephaven import DynamicTableWriter
import deephaven.dtypes as dht
from deephaven.TableTools import merge
import threading
import time
import feedparser
NUMBER_OF_RSS_TABLES = 2
def read_rss_feeds(feed_urls, table_writer):
for url in feed_urls:
feed = feedparser.parse(url)
for entry in feed.entries:
title = entry["title"]
table_writer.logRow([title])
rss_feed_urls = [
[
"http://feeds.soundcloud.com/users/soundcloud:users:151205561/sounds.rss",
"https://nocturniarecords.podomatic.com/rss2.xml",
],
[
"http://feeds.soundcloud.com/users/soundcloud:users:142613909/sounds.rss",
"http://feeds.soundcloud.com/users/soundcloud:users:155565658/sounds.rss",
]
]
column_names = ["EpisodeTitle"]
column_types = [dht.string]
rss_tables = []
rss_threads = []
for i in range(NUMBER_OF_RSS_TABLES):
writer = DynamicTableWriter(column_names, column_types)
thread = threading.Thread(target=read_rss_feeds, args=[rss_feed_urls[i], writer])
thread.start()
rss_tables.append(writer.getTable())
rss_threads.append(thread)
rss_feeds = merge(rss_tables)
for i in range(len(rss_tables)):
globals()[f"rss_table_{i}"] = rss_tables[i]
while True:
thread_is_alive = False
for thread in rss_threads:
if thread.is_alive():
thread_is_alive = True
if thread_is_alive:
time.sleep(1)
else:
break
Now we have a single table containing information from our various podcasts. This example only pulls the episode title, but there are many other attributes from the RSS feed that can be used as well.
As we said, threads are a great way to improve performance in certain applications where processes can run in parallel and asynchronously. In these cases, Deephaven is a powerful tool.
The Deephaven Podcast Aggregation sample app shows an extreme example of using Deephaven in a threaded environment. Not only does this application scale out to millions of podcast RSS feeds, but it also contains pulling logic to continually read from these RSS feeds, allowing updates to the RSS feeds to come in real time. This project pulls all of the meta-data from each podcast it reads from. You can use this data to figure out information like the most recently published podcasts, what podcast episodes contain certain keywords, and what podcasts produce the most number of episodes in a given period of time. Let us know what you come up with on Slack or in our Github Discussions.
Source link
lol