Much of the world’s data doesn’t conform to a particular structure. The day-to-day data exchange between humans is an example of this. The tweets we post on X don’t have a defined structure. The pictures on Instagram don’t conform to any structure, nor do the videos on YouTube or TikTok. Unstructured data makes up 80% of the world’s data and is growing.
Managing unstructured data is essential for the success of machine learning (ML) projects. Without structure, data is difficult to analyze and extracting meaningful insights and patterns is challenging.
This article will discuss managing unstructured data for AI and ML projects. You will learn the following:
- Why unstructured data management is necessary for AI and ML projects.
- How to properly manage unstructured data.
- The different tools used in unstructured data management.
- How to leverage Generative AI to manage unstructured data
- Benefits of applying proper unstructured data management processes to your AI/ML project.
What is Unstructured Data?
One thing is clear: unstructured data doesn’t mean it lacks information. All forms of data must have some form of information, or else they won’t be considered data. What makes unstructured data different from structured data is how it is stored.
Structured data has a predefined structure that the data curator provides. Unstructured data, on the other hand, lacks a predefined structure. Examples of this type of data include text, audio, video, images, and PDFs.
For instance, imagine if you wanted to know how many users visited a site over a week. You could pass the information in the form of text as “4 users were available on Monday, 6 were present on Tuesday, while eleven were present on Wednesday, then four on Thursday, and finally, on Friday, we had our highest number of users, which was 25.”
This text has a lot of information, but it is not structured. Here’s the structured equivalent of this same data in tabular form:
With structured data, you can use query languages like SQL to extract and interpret information. In contrast, such traditional query languages struggle to interpret unstructured data.
However, with the rise of Generative AI, making sense of unstructured data has become much easier. For example, a large language model (LLM) like GPT-4 can process text and extract structured information.
For unstructured data like images, convolutional neural network (CNN) models can categorize and analyze the data, identifying relevant features and categories.
The Need for Unstructured Data Management in AI Projects
With so much unstructured data in the world, it is important to manage it correctly. This is where unstructured data management comes into play. It involves the processes and techniques to organize, store, and handle unstructured data, enabling easy retrieval, analysis, and seamless integration within a project.
An effectively managed unstructured data system can significantly benefit any AI/ML project. Here are a few ways proper unstructured data management can enhance an AI/ML project:
- Data retrieval: If unstructured data is properly managed, it can be easier to retrieve when needed.
- Extract valuable insights: Extracting meaningful information from a collection of well-managed, unstructured data is easier.
- Detect data duplication: Multiple copies of the same data can lead to unnecessary storage consumption. With proper unstructured data management, you can write validation checks to detect multiple entries of the same data.
- Continuous learning: In a properly managed unstructured data pipeline, you can use new entries to train a production ML model, keeping the model up-to-date.
Next, let’s learn about some core challenges of managing unstructured datasets.
Managing Unstructured Data: Challenges
While managing unstructured data is crucial in any AI/ML project, it comes with some challenges. Here are some challenges you might face while managing unstructured data:
- Storage consumption: Unstructured data can consume a large volume of storage. For instance, if you are working with several high-definition videos, storing them would take a lot of storage space, which could be costly. So, when working with unstructured data in an AI/ML project, you must consider storage space.
- Data variety: Unstructured data comes in different modalities, including text, images, videos, and audio. Since there’s no single modality, managing the data can be challenging because a technique that works for one modality might not work for another.
- Further processing is usually required: Unstructured data, by nature, lacks the organization necessary for direct analysis, making further processing a critical challenge. Before using the data effectively in AI/ML models, you need to run transforms that convert text into tokenized formats, images into vector representations, or audio into spectral data.
- Data streaming: Due to their size, streaming large amounts of unstructured data from a data source to its destination can prove difficult.
You would likely encounter these challenges in one or more of your AI/ML projects that use unstructured datasets, so how do you overcome such challenges? Let’s see how to manage such datasets in the next section.
How Do You Manage Unstructured Data?
Now that you know why it is important to manage unstructured data correctly and what problems it can cause, let’s examine a typical project workflow for managing unstructured data.
There are 5 stages in unstructured data management:
- Data collection
- Data integration
- Data cleaning
- Data annotation and labeling
- Data preprocessing
Data Collection
The first stage in the unstructured data management workflow is data collection. Data can come from different sources, such as databases or directly from users, with additional sources, including platforms like GitHub, Notion, or S3 buckets.
The collected data files can be in various formats, including JPEGs, PNGs, PDFs, plain text, markdown, video (.mp4,.webm, etc.), and audio files (.wav,.mp3,.acc, etc.). Depending on the project’s goals, you may work with a single data type or multiple formats.
It’s also common to collect data from various sources throughout a project.
Data Integration
Once we collect the unstructured data from multiple storage locations, we store it in a central location for processing. To combine the collected data, you can integrate different data producers into a data lake as a repository.
A central repository for unstructured data is beneficial for tasks like analytics and data virtualization.
Data Cleaning
The next step is to clean the data after ingesting it into the data lake. This involves removing duplicates, correcting errors, handling missing or incomplete information, and standardizing formats.
Ensure the data is accurate and consistent to prepare it for subsequent stages, such as annotation and preprocessing.
Data Annotation and Labeling
In this stage, you perform labeling tasks that add extra information to the collected unstructured data, including metadata, tags (annotations), and other data description properties.
These annotations heavily depend on the type of unstructured data collected and the project’s goal. If it is image data, a human data annotator can perform tasks like classification or segmentation, or use an AI model like U-NET. For text, you can run tasks like sentiment analysis or topic modeling to add extra information to the data.
This stage adds descriptions and labels to the unstructured dataset, making it easier to categorize and prepare for other downstream tasks (e.g., data cleaning) since similar data will have similar annotations.
Data Preprocessing
Here, you can process the unstructured data into a format that can be used for the other downstream tasks. For instance, if the collected data was a text document in the form of a PDF, the data preprocessing—or preparation stage—can extract tables from this document.
The pipeline in this stage can convert the document into CSV files, and you can then analyze it using a tool like Pandas.
Several tools are required to properly manage unstructured data, from storage to analytical tools. You also need the right technique to help manage unstructured data. This section will give an overview of the tools and techniques for managing unstructured data in your project.
General Purpose Tools
These tools help manage the unstructured data pipeline to varying degrees, with some encompassing data collection, storage, processing, analysis, and visualization.
DagsHub’s Data Engine
DagsHub’s Data Engine is a centralized platform for teams to manage and use their datasets effectively. It streamlines the management of unstructured datasets, offering features like data collection, streaming, auto-labeling, curation, visualization, and lineage tracking.
DagsHub Data Engine is a central platform for collecting, curating, and managing unstructured data at scale.
Data Engine integrates with popular data, ML, and software tools so teams can collaborate effectively and maintain dataset versioning.
ReportMiner by Astera
ReportMiner automates the extraction and transformation of unstructured data from documents and reports. It converts this data into structured formats, making it easier for organizations to analyze and integrate insights into their business processes.
Storage Tools
To work with unstructured data, you need to store it. Storage tools help with this. These tools can be the source or destination of your data. Due to the uniqueness of unstructured data, different storage techniques can be used to store it.
Vector Databases
Vector databases help store unstructured data by storing the actual data and its vector representation. This allows for efficient retrieval by comparing the vector similarity between a query and the stored data.
Examples of vector databases include Weaviate, ChromaDB, and Qdrant.
NoSQL Databases
NoSQL databases do not follow the traditional relational database structure, which makes them ideal for storing unstructured data. They allow flexible data models such as document, key-value, and wide-column formats, which are well-suited for large-scale data management.
Examples of NoSQL databases include MongoDB’s Atlas, Cassandra, and Couchbase.
Data Lakes
Data lakes are centralized repositories designed to store vast amounts of raw, unstructured, and structured data in their native format. They enable flexible data storage and retrieval for diverse use cases, making them highly scalable for big data applications.
Popular data lake solutions include Amazon S3, Azure Data Lake, and Hadoop.
Data Processing Tools
These tools are essential for handling large volumes of unstructured data. They assist in efficiently managing and processing data from multiple sources, ensuring smooth integration and analysis across diverse formats.
Apache Kafka
Apache Kafka is a distributed event streaming platform for real-time data pipelines and stream processing. It allows unstructured data to be moved and processed easily between systems. Kafka is highly scalable and ideal for high-throughput and low-latency data pipeline applications.
Apache Hadoop
Apache Hadoop is an open-source framework that supports the distributed processing of large datasets across clusters of computers. It uses a map-reduce paradigm, making it suitable for batch processing unstructured data on a massive scale. Hadoop’s ecosystem includes storage (HDFS) and processing (MapReduce) components.
Apache Spark
Apache Spark is a fast, in-memory data processing engine that excels at large-scale data analytics. It supports real-time and batch processing and is highly efficient for unstructured data tasks such as machine learning, graph computation, and stream processing. Spark’s versatility makes it a preferred choice for data engineers and scientists.
Tools Dedicated to Unstructured Data Management
As the generative AI revolution progresses, new tools improve how we manage unstructured data, offering more efficient and creative solutions. Some of those tools include:
- Unstructured.io: Unstructured.io enables users to extract and transform valuable information from unstructured formats such as HTML, PDF, and PNG. It also integrates with LLMs and vector databases, enhancing its ability to extract meaningful insights from unstructured data. Unstructured.io offers a serverless API that can be accessed programmatically and a separate no-code platform.
- Llamaindex Parse: The team behind LlamaIndex developed LlamaIndex Parse, a document parsing tool. It allows users to extract data from documents, and then you can configure workflows to pass the data downstream to LLMs for further processing. The tool offers a web UI as well as Python and TypeScript SDKs for developers.
Deep Learning Techniques Used to Manage Unstructured Data
Now that you have seen some of the tools used in unstructured data management let’s explore the deep learning techniques you can use to process and understand unstructured data.
Embedding Models
Embedding models transform unstructured data, such as text, images, and audio, into vector representations. These vectors capture the data’s semantic meaning, making it easier to analyze, search, and retrieve relevant information across large datasets.
Large Language Models
We engineer LLMs like Gemini and GPT-4 to process and understand unstructured text data. They can generate human-like text, summarize documents, and answer questions, making them essential for natural language processing and text analytics tasks.
Newer models can process images and videos in addition to text, giving them multi-modal capabilities. This expands their versatility, allowing them to work with a wider range of unstructured data formats.
Deep learning models can extract structured information from unstructured sources, such as PDFs and images, into tabular formats. This technique helps transform messy data into organized tables for further analysis.
Text Recognition
Text recognition, often powered by Optical Character Recognition (OCR) models, converts text from images, scanned documents, or handwritten notes into plain text formats. This makes unstructured text data searchable and usable in downstream tasks like fine-tuning an LLM.
Named Entity Recognition (NER)
NER identifies and classifies entities like names, dates, organizations, and locations within the unstructured text. It helps extract meaningful elements from raw data for tasks like data labeling, information retrieval, and analysis.
Document Layout Analysis
Document layout analysis involves detecting and understanding the structural elements of documents, such as headers, footers, tables, and figures. This technique is vital for extracting and preserving the contextual layout of documents during data processing.
Walkthrough: Managing Unstructured Data with Unstructured.io
In the previous section, you briefly learned about unstructured.io as one of the special tools for managing unstructured data. In this section, you will walk through a code example of using unstructured.io.
Before that, let’s explore the injection pipeline of unstructured.io.
Unstructured.io Ingestion Pipeline
The ingestion pipeline of Unstructured.io is similar to the traditional Extract, Transform, Load (ETL) process. It operates in three stages:
- Extract unstructured data from a source.
- Transform the unstructured data into a more structured format.
- Ingest the transformed data into a designated destination.
Unstructured.io supports loading data from various sources, including GitHub repositories, local storage, Kafka streams, Amazon S3 buckets, and MongoDB databases. The transformation can be done using rule-based techniques like regular expressions and custom parsing rules or through model-based approaches, such as transformer-based models like LayoutLM. Additionally, OCR can extract text from images or scanned documents.
The transformed data can then be ingested into a destination, such as local storage, a vector or traditional database. An LLM can later use this data through Retrieval-Augmented Generation (RAG) to augment the model’s responses or even fine-tune the model for improved performance on specific tasks.
Unstructured.io Code Walkthrough
Let’s walk through an example using unstructured.io. In this code, you will upload an unstructured data format (a PDF) from a directory in local storage, perform transformations to extract relevant information, and then store the transformed data back to another directory.
To run the code, install the unstructured.io ingestion client:
pip install "unstructured-ingest[remote]"
With the ingestion client, we are using the Unstructured.io serverless API. To access it, you’ll need API keys, which you can obtain here. The service offers a 14-day trial version.
To run the example, you’ll need `UNSTRUCTURED_API_URL` and `UNSTRUCTURED_API_KEY`. Store these values in your environment variables before running the example.
Next, create a directory called unstructured_pdfs and place the PDFs you wish to transform in it. This example uses the Attention is All You Need paper as the sample PDF.
Here’s the source code:
import os
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalDownloaderConfig,
LocalConnectionConfig,
LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path="unstructured_pdfs"),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res",
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
uploader_config=LocalUploaderConfig(output_dir="unstructured_outputs")
).run()
Let’s break down the code. First, you import the Pipeline class, which creates an ingestion pipeline. The ProcessorConfig class is used to configure the ingestion process. LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, and LocalUploaderConfig configure the downloading of the unstructured data from local storage and uploading its transformed state back to local storage again.
The PartitionerConfig is used to configure how we wish to transform our unstructured data. The strategy hi_res specifies that we want to use a model-based approach to extract the information from the PDF. We can also specify how to partition the PDF with the partition config.
Once all the configurations have been set, we can run the pipeline. The pipeline creates JSON files that are stored in the unstructured_outputs folder specified in the uploader config.
Reviewing Unstructured.io Outputs
When unstructured.io transforms a document such as a PDF, its output is a list of Elements. These elements represent the extracted information from the documents. Using the outputs of the paper, let’s examine a couple of unstructured.io elements.
Title Element
Unstructured.io can extract all titles in a document and represent them as the Title Element. Unstructured.io identified that “Attention is All You Need” as the paper’s title.
{
"type": "Title",
"element_id": "a794d475b47689c5c29fac8fabfcb327",
"text": "Attention Is All You Need",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"filename": "1706.03762v7.pdf",
"data_source": {
"url": null,
"version": null,
"record_locator": {
"path": "/home/eteims/unstructured_pdfs/1706.03762v7.pdf"
},
"date_created": null,
"date_modified": "1725636615.260449",
"date_processed": "1725637072.9688466",
"permissions_data": [
{
"mode": 33204
}
],
"filesize_bytes": 2215244
}
}
}
NarativeText Element
Every chunk of text in a document would be represented with NarativeText Element. The abstract of the paper was extracted as a Narrative Text.
{
"type": "NarrativeText",
"element_id": "69e993117fd281d65dbbe15dc90cbbc9",
"text": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 1,
"parent_id": "e3c4bbd301642ada979eeea4e497ae68",
"filename": "1706.03762v7.pdf",
"data_source": {
"url": null,
"version": null,
"record_locator": {
"path": "/home/eteims/unstructured_pdfs/1706.03762v7.pdf"
},
"date_created": null,
"date_modified": "1725636615.260449",
"date_processed": "1725637072.9688466",
"permissions_data": [
{
"mode": 33204
}
],
"filesize_bytes": 2215244
}
}
}
Formula Element
Unstructured.io can also extract and represent equations from a document as a Formula Element.
The example code below extracts the Scaled Dot Product Attention equation.
{
"type": "Formula",
"element_id": "bdfb010e7c4f7c67c20140d7ceb64be8",
"text": "Attention(Q, K, V ) = softmax( QK T u221a dk )V",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 4,
"filename": "1706.03762v7.pdf",
"data_source": {
"url": null,
"version": null,
"record_locator": {
"path": "/home/eteims/unstructured_pdfs/1706.03762v7.pdf"
},
"date_created": null,
"date_modified": "1725636615.260449",
"date_processed": "1725637072.9688466",
"permissions_data": [
{
"mode": 33204
}
],
"filesize_bytes": 2215244
}
}
}
Table Element
Tables in a document can be extracted to become Table Element. Unstructured.io was able to extract all the tables from the paper. Here’s a sample:
The extracted tables can also be represented as HTML tables. This enables the extracted table to be further processed with a tool like Pandas.
{
"type": "Table",
"element_id": "7214284cca158e57bbc4fdfb2ee2f410",
"text": "Layer Type Self-Attention Recurrent Convolutional Self-Attention (restricted) Complexity per Layer O(n2 u00b7 d) O(n u00b7 d2) O(k u00b7 n u00b7 d2) O(r u00b7 n u00b7 d) Sequential Maximum Path Length Operations O(1) O(n) O(1) O(1) O(1) O(n) O(logk(n)) O(n/r)",
"metadata": {
"text_as_html": "<table><thead><tr><th>Layer Type</th><th>Complexity per Layer</th><th>Sequential Operations</th><th>Maximum Path Length</th></tr></thead><tbody><tr><td>Self-Attention</td><td>O(n?-d)</td><td>(1)</td><td>(1)</td></tr><tr><td>Recurrent</td><td>O(n-d?)</td><td></td><td>O(n)</td></tr><tr><td>Convolutional</td><td>O(k-n-d?)</td><td>) %</td><td>O(logk(n))</td></tr><tr><td>Self-Attention (restricted)</td><td>O(r-n-d</td><td>1)</td><td>O(n/r)</td></tr></tbody></table>",
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 6,
"filename": "1706.03762v7.pdf",
"data_source": {
"url": null,
"version": null,
"record_locator": {
"path": "/home/eteims/unstructured_pdfs/1706.03762v7.pdf"
},
"date_created": null,
"date_modified": "1725636615.260449",
"date_processed": "1725637072.9688466",
"permissions_data": [
{
"mode": 33204
}
],
"filesize_bytes": 2215244
}
}
}
In this example, you have explored a subset of Unstructured.io’s capabilities, demonstrating its effective use for managing unstructured data. Consider experimenting with more complex use cases and integrating them into your data workflows.
Best Practice for Unstructured Data Management
When working with unstructured data for an AI/ML project, there are best practices that can greatly improve data management and processing efficiency. Let’s explore these best practices, their benefits, implementation tips, and recommended tooling.
1. Focus on Metadata Management First
Implementing robust metadata management is crucial for making unstructured data more manageable and accessible.
- Benefits: Enhances data searchability and discoverability, improves data integration, and enables more effective data analysis. It also provides the foundation for downstream machine learning or AI applications.
- Implementation tip: Define a clear metadata schema tailored to your data needs. Use automated tagging tools and natural language processing (NLP) models to extract metadata from text-based data.
- Tooling: Apache Tika, ElasticSearch, Databricks, and AWS Glue for metadata extraction and management.
2. Implement a Data Provenance System
Tracking the origin and transformations of unstructured data helps maintain trust and transparency.
- Benefits: Facilitates better data governance, ensures data traceability for audits, and builds confidence in the data used for AI or analytics. It also aids in identifying the source of any data quality issues.
- Implementation tip: Integrate version control tools and data lineage tracking into your data ingestion workflow. Ensure that every data transformation is logged with timestamps and user information.
- Tooling: Apache Atlas, Great Expectations, and Delta Lake for data lineage and provenance tracking.
3. Use Vector Databases for Better Searchability
Vector databases are ideal for managing complex, unstructured data like images, audio, and text.
- Benefits: It improves data retrieval speeds for similar data types, improves semantic search capabilities, and enables more accurate recommendations based on content similarity.
- Implementation tip: Train or fine-tune NLP models for text and use pre-trained models for other modalities like images. Store these embeddings in a vector database for quick access.
- Tooling: Pinecone, Weaviate, FAISS (Facebook AI Similarity Search), and Milvus for storing and searching vector embeddings.
4. Integrate Data Quality Monitoring in the Pipeline
Ensuring the quality of unstructured data requires real-time monitoring throughout the data lifecycle.
- Benefits: Detects data drift and quality degradation across the entire lifecycle.
- Implementation tip: Set up automated checks and validations for data formats, completeness, and anomalies. Use dashboards to monitor these metrics and automate alerts for unusual patterns.
- Tooling: Evidently, Monte Carlo, WhyLabs, and Apache Airflow for data pipeline monitoring and alerting.
5. Utilize Hierarchical Storage Management (HSM) for Cost Efficiency
Managing storage tiers helps balance performance and cost for large volumes of unstructured data.
- Benefits: Reduces overall storage costs, ensures critical data remains easily accessible, and optimizes storage performance for data retrieval.
- Implementation tip: Analyze data access patterns to determine what data can be moved to colder storage. Set up automated policies for moving data between hot, warm, and cold storage based on access frequency.
- Tooling: AWS S3 with lifecycle management, Google Cloud Storage with coldline options, Azure Blob Storage, and NetApp StorageGRID for implementing HSM.
These practices can help organizations better manage the challenges of unstructured data, ultimately making it more accessible, reliable, and cost-effective.
Conclusion
Managing unstructured data in AI and ML projects has always been challenging, as most datasets, algorithms, and technologies have traditionally focused on structured data. However, as the volume of unstructured data grows, it’s essential to develop effective strategies to manage it and extract valuable insights.
Many problems that used to arise when managing unstructured data are now being fixed thanks to progress in Generative AI and tools like DagsHub and Unstructured.io. This makes it easier to use this data to its fullest potential.
Source link
lol