Semantic Matching of Text Identifiers Using LASER Embeddings in Python

When using OCR to digitize financial reports, you may encounter various approaches for detecting specific categories within those reports. For example, traditional methods like the Levenshtein algorithm can be used for string matching based on edit distance, making it effective for handling near matches, such as correcting typos or small variations in text.

However, the challenge becomes more complex when you need to detect multiple categories in a single line of a report, especially when those categories may not appear exactly as expected or could overlap semantically.

In this post, we analyze a semantic matching approach using Facebook’s LASER (Language-Agnostic SEntence Representations) embeddings, showcasing how it can effectively handle this task.

Problem

The objective is to identify specific financial terms (categories) in a given text line. Let’s assume we have a fixed set of predefined categories that represent all possible terms of interest, such as:

["revenues", "operating expenses", "operating profit", "depreciation", "interest", "net profit", "tax", "profit after tax", "metric 1"]

Given an input line like:

"operating profit, net profit and profit after tax"

We aim to detect which identifiers appear in this line.

Semantic Matching with LASER

Instead of relying on exact or fuzzy text matches, we use semantic similarity. This approach leverages LASER embeddings to capture the semantic meaning of text and compares it using cosine similarity.

Implementation

Preprocessing the Text

Before embedding, the text is preprocessed by converting it to lowercase and removing extra spaces. This ensures uniformity.

def preprocess(text):
    return text.lower().strip()

Embedding Identifiers and Input Line

The LASER encoder generates normalized embeddings for both the list of identifiers and the input/OCR line.

identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True)
ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]

Ranking Identifiers by Specificity

Longer identifiers are prioritized by sorting them based on word count. This helps handle nested matches, where longer identifiers might subsume shorter ones (e.g., “profit after tax” subsumes “profit”).

ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True)
ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)

Calculating Similarity

Using cosine similarity, we measure how semantically similar each identifier is to the input line. Identifiers with similarity above a specified threshold are considered matches.

matches = []
threshold = 0.6

for idx, identifier_embedding in enumerate(ranked_embeddings):
    similarity = cosine_similarity([identifier_embedding], [ocr_line_embedding])[0][0]
    if similarity >= threshold:
        matches.append((ranked_identifiers[idx], similarity))

Resolving Nested Matches

To handle overlapping identifiers, longer matches are prioritized, ensuring shorter matches within them are excluded.

resolved_matches = []
for identifier, score in sorted(matches, key=lambda x: x[1], reverse=True):
    if not any(identifier in longer_id and len(identifier) < len(longer_id) for longer_id, _ in resolved_matches):
        resolved_matches.append((identifier, score))

Results

When the code is executed, the output provides a list of detected matches along with their similarity scores. For the example input:

Detected Matches:
profit after tax: 0.71
operating profit: 0.69
net profit: 0.64

Considerations for Longer and Complex Inputs

This method works well in structured financial reports with multiple categories on a single line, provided there aren’t too many categories or much unrelated text. However, accuracy can degrade with longer, complex inputs or unstructured user-generated text, as the embeddings may struggle to focus on relevant categories. It is less reliable for noisy or unpredictable inputs.

Conclusion

This post demonstrates how LASER embeddings can be a useful tool for detecting multiple categories in text. Is it the best option? Maybe not, but it is certainly one of the options worth considering, especially when dealing with complex scenarios where traditional matching techniques might fall short.

Full code

from laser_encoders import LaserEncoderPipeline
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Initialize encoder
encoder = LaserEncoderPipeline(lang="eng_Latn")

# Example identifiers and OCR line
identifiers = ["revenues", "operating expenses", "operating profit", "deprecation", "interest",
               "net profit", "tax", "profit after tax", "metric 1"]

ocr_line = "operating profit, net profit and profit after tax"

# Preprocessing
def preprocess(text):
    return text.lower().strip()

identifiers = [preprocess(identifier) for identifier in identifiers]
ocr_line = preprocess(ocr_line)

# Embed identifiers and OCR line
identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True)
ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]

# Rank identifiers by specificity (word count)
ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True)
ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)

# Initialize variables
matches = []
threshold = 0.6  # Adjusted threshold for better precision

# Match by specificity
for idx, identifier_embedding in enumerate(ranked_embeddings):
    similarity = cosine_similarity([identifier_embedding], [ocr_line_embedding])[0][0]
    if similarity >= threshold:
        matches.append((ranked_identifiers[idx], similarity))

# Resolve nested matches (by preferring longer matches)
resolved_matches = []
for identifier, score in sorted(matches, key=lambda x: x[1], reverse=True):
    if not any(identifier in longer_id and len(identifier) < len(longer_id) for longer_id, _ in resolved_matches):
        resolved_matches.append((identifier, score))

# Output results
print("Detected Matches:")
for identifier, score in resolved_matches:
    print(f"{identifier}: {score:.2f}")

Source link
lol