How S&P Global is making markets more transparent with NLP, spaCy and Prodigy · Explosion

We’ve talked to Christopher Ewen, Senior Product Manager at S&P Global Commodity Insights, about how their small team built and shipped impressively efficient information extraction pipelines for real-time commodities trading insights in a high-security environment, and how they were able to achieve a 10× speed-up of their data collection and annotation workflows to build a new dataset.

S&P Global are one of the leading providers of data and insights for global energy and commodities, covering raw materials like metals, agricultural products and chemicals as well as the energy transition. Publishing benchmark prices in these markets lets producers and consumers lock in prices and hedge against uncertainty. Transparency is key to allowing these markets to operate efficiently and provides clarity and guidance in navigating the volatile commodity landscape, enabling fair trade, effective risk management, and informed decision-making.

“Heards” are trading activities market reporters receive daily via phone, email and instant messenger – essentially, information that was heard about commodities trades across many different markets like agriculture, coal, electric power, natural gas or oil, happening and published in real time. Relevant information collected that meets the methodological standards is published through real-time information services as “heards” in order to test information with the market. The information includes up to 32 different attributes, like price, participants or location, that are published as both structured and unstructured data. Customers include banks, financial institutions and trading houses at more than 15,000 public and private organizations in over 150 countries, who access the data via S&P’s Platts Connect platform, the underlying API or third-party vendors.

Structured data in the platform — View of structured heards data in Platts Connect

A key goal of Chris’ team is to make the commodities markets as transparent as possible and publish information and benchmark prices “immediately as heard”. Extracting the information from these heards automatically provides even more transparency while maintaining the real-time nature of the information.

In addition to the live feed, having structured historical data is also incredibly valuable, both internally and for customers. S&P Global process around 8,000 new heards per day, with an archive of over 13 million data points since 2017. While it would previously take an analyst hours to find answers in the unstructured data and wrangling Excel spreadsheets, the structured feed now lets them find this information in seconds.

Heards are collected as very concise notes using an extremely specific structure with extremely specific terminology. This presents several challenges, but also opportunities for a custom NLP pipeline: some attributes can be extracted reliably via rules, whereas others require a statistical language model.

Example of an incoming heard annotated with different structured attributes

To build automated production pipelines for processing heards in real time, Chris and his team at S&P Global use custom spaCy pipelines, fine-tuned for each market, and Prodigy for efficient annotation, data collection, quality control and evaluation.

For their production stack, high inference speed and latency is crucial: the incoming data entries need to be processed and validated in real time and meet the 15ms SLA per heard to provide maximum market transparency to customers. Using spaCy’s component implementations, their fine-tuned language models run at around 15,000 words per second at accuracies of up to 99% and model artifacts of 6 MB, making it easy to develop and deploy the pipelines in-house.

Having a small model makes it much easier to achieve our strict inference SLAs. The system is much less operationally complex because the model is so efficient. Less complexity means less that can go wrong.

— Christopher Ewen, Senior Product Manager

Keeping the data and models entirely private and in-house is also critical: heards contain information that can significantly affect and move markets and any pre-publication information is highly segregated, even within the office. Customers trust the commodities team to publish the data as soon as it comes in and before it’s seen by anybody else.

Naturally, the structured data needs to be highly accurate, which requires domain experts in the loop at all times. The team realized quickly that teaching people to annotate training and evaluation data would take too long and not pay off, so initially, Product Manager Chris was the only expert available to create data. Despite this challenge, the team was able to take advantage of Prodigy’s efficient design and interfaces to create a workflow requiring only 30 minutes of work per attribute per market, or 15 hours per market in total, and successfully ship their first pipelines to production. The reduced time needed by the market specialists allows them to focus on their job of communicating with market participants, assessing prices and publishing news.

Given the very specific structure and terminology used in heards, the project requires tooling that’s highly customizable: the pipeline needs to be able to define its own rules for tokenization to handle the unusual punctuation and combine the predictive named entity recognition model with rules to improve accuracy. The data development process needs to include the model and rules in the loop, and automate annotation wherever possible to allow experts to move through the data quickly, without requiring too much fine-grained clicking.

Prodigy lets us automate as much as possible and focus on valuable decisions and less clicking. I can stream in the model’s predictions and rule-based matches and make corrections in a single click.

— Christopher Ewen, Senior Product Manager

Flowchart showing the end-to-end workflow — Two annotation workflows in comparison: annotating all labels vs. focusing on a single label

Prodigy workflows in comparison — Two annotation workflows in comparison: annotating all labels vs. focusing on a single label

	Global Carbon Credits	Americas Crude Oil	Asia Steel Rebar
Accuracy (F-score)	0.95	0.96	0.99
Speed (words/second)	15,730	13,908	16,015
Model Size	6 MB	6 MB	6 MB
Training Examples	1,598	1,695	1,368
Evaluation Examples	211	200	345
Data Development Time	~15h	~15h	~15h

Source link
lol

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy · Explosion

Human-in-the-loop distillation with LLMs

config.cfg (excerpt)

Human-in-the-loop distillation

Project management and reproducible experiments

Results and evaluation

In-context learning vs. supervised learning for predictive tasks

Resources

By stp2y

Leave a Reply Cancel reply