It was challenging to stop myself from starting this article with some variation of the popular phrase “garbage in, garbage out.” Well, I did it anyway. But jokes aside, we can easily imagine a situation in which we have built and deployed a machine learning model (possibly a black box) that accepts some input and returns some predictions. So far, so good.
However, with tons of complexity happening before the model (data preprocessing and manipulation), the model itself, and any post-processing of the outputs, many things can go wrong. And in some mission-critical fields (finance, healthcare, or security), there can be no margin of error, as crucial decisions are made based on the insights generated by ML models. In the unlikely scenario of unexpected events, having validation and confidence that processes in the operational pipelines responsible for data handling and processing are not failure points can be reassuring and enable troubleshooting problem areas.
That is why this article will discuss the importance of data validation. We will start by describing data validation in more detail. Then, we will cover the five most popular (Python) tools that we can use to validate our input/output data. We picked these tools due to their widespread adoption by the biggest companies (FAANG, Fortune 500, etc.), their robust and versatile features, and also active community support. Let’s dive right into it!
A quick primer on data validation
Data validation refers to the process of ensuring that data is accurate, reliable, and consistent before it is used for analysis, decision-making, or feeding into ML models. By validating data, we aim to mitigate errors or inconsistencies and prevent the introduction of any biases that could skew the outcomes of any analysis.
The objective of data validation is to eliminate any errors present within datasets. And to appropriately tackle the problem at hand, it’s essential to track down the root cause of these errors. Although many possible causes are contributing to errors within data, below are common areas that highlight a few examples:
- Human errors: Whenever we deal with data inputted by humans, we assume that things can go wrong. For example, imagine a person filling in a survey to determine whether they are eligible for a mortgage. But by mistake, they add some extra zeros to the end of their salary, or perhaps they drop a digit from their age. Errors can inevitably occur, but it is important to identify and address data input discrepancies before the data is used in any decision-making models.
- Incomplete data: In many scenarios, we might encounter incomplete data. In some cases, this might be a result of optional survey fields. In other scenarios, this might be a mistake, or some data was simply lost.
- Duplicated data: Duplicate records can skew summary statistics and distort our analyses.
- Inconsistent formatting: Inconsistencies in formats (e.g. dates) or units (e.g. currencies) can become substantial pain points when analyzing data and comparing/merging datasets.
- Validating the output of ML models: In scenarios where one model’s output feeds into another, such as a model estimating an individual’s salary, which then informs a subsequent model assessing their credit score, it’s critical to validate each model’s accuracy and reliability. In such a case, we should also validate the correctness of the first model. In some extreme cases, especially when dealing with linear models, it could happen that the predicted salary would escalate disproportionately or be negative. We should safeguard against such issues before they are propagated into the downstream models.
Notably, there are many ways in which things can go wrong, and as a result, there is no one-size-fits-all solution. This only emphasizes that it is crucial to have specific processes in place to catch as many potential issues as possible. It is definitely easier to act on the issue before any decision is made based on inaccurate data. When validating our data, there are several vital checks to implement to ensure integrity and suitability for analysis:
- Checking types – that is, making sure that the data types align with our expectations. For example, age should most likely be an integer.
- Checking formatting includes ensuring that the formats (for example, date or currency) in the input data are consistent with our expectations and compatible with the operations we might later want to perform on those data points.
- Verifying correctness/constraints – for example, ensuring that a numeric value representing a month does not exceed 12.
- Conducting a uniqueness check ensures that our data is free of duplicates. If unique identifiers (customer ID, bank account number, etc.) are not unique, we can efficiently run into severe issues while further processing our data.
From the perspective of data scientists, having these procedural checks and processes also allows them to focus on debugging and improving the machine learning models instead of tediously inspecting every single step of a highly complex ETL pipeline. Ensuring the trustworthiness of data is a significant advantage, as it allows the data professionals to focus on where their expertise lies. It’s essential to continuously explore the data through exploratory data analysis (EDA), as this might reveal any discrepancies or new issues that might not have been detected during initial validation processes.
At this point, I hope you do not need any further convincing that data validation is crucial and should not be neglected. Something data professionals often need help with is the duality of the problem. On the one hand, they know how vital validation is. On the other hand, data validation is not the most rewarding task. Thankfully, quite a few tools/libraries make this job much easier and more enjoyable.
This section will cover some of the most popular Python tools for validating our data. We will briefly describe each one and provide a simple example of how to use it. One thing to remember is that the examples will cover basic usage, as each tool offers enough complexity to deserve an entire article (or a few articles) dedicated only to that tool. For more examples, please refer to the respective documentation or GitHub repository.
1. Pydantic
GitHub ⭐: 18.7k
Pydantic is the most widely used data validation library for Python. Some of the reasons for why it is so popular:
- Pydantic uses type annotations for schema validation. Thanks to that, it reduces the learning curve and the amount of code the users have to write.
- The tool seamlessly integrates with popular IDEs and static analysis tools.
- As its validation logic is written in Rust, Pydantic is one of the fastest data validation libraries.
- Pydantic supports many possible validators out of the box (examples, phones, country codes, etc.) and additionally offers the possibility to build custom validators. This means users are virtually unrestricted and can implement custom solutions for their use cases.
- Pydantic is battle-tested and extensively used by all FAANG companies.
Below, you can see an example of Pydantic in action. We first generate a data model. In this case, we are considering a job candidate and want to store their ID, name, surname, and age. We restrict the age to between 18 and 45. Then, we create a dictionary containing the required values.
from typing_extensions import Annotated
from pydantic import BaseModel, Field
class JobCandidate(BaseModel):
id: int
name: str
surname: str
age: Annotated[int, Field(strict=True, ge=18, le=45)]
input = {
"id": 1,
"name": "Alan",
"surname": "Poe",
"age": 46,
}
JobCandidate(**input)
As you can see, the age does not align with the requirement. That is why when we validate the input against the data model, we will receive the following error:
ValidationError: 1 validation error for JobCandidate
age
Input should be less than or equal to 45 [type=less_than_equal, input_value=46, input_type=int]
For further information visit https://errors.pydantic.dev/2.7/v/less_than_equal
A potential drawback of Pydantic in the ML context is that it was not built with dataframes in mind. As such, we would have to write a lot of boilerplate code to map the dataframe rows into data classes and then validate each one with Pydantic.
2. Marshmallow
GitHub ⭐: 6.9k
Marshmallow is a library that easily converts complex data types to and from native Python datatypes. It also allows us to validate data by defining and enforcing schemas, ensuring input data follows the specified rules.
Below, we demonstrate a Marshmallow equivalent of the example we just saw with Pydantic.
from marshmallow import Schema, fields, validate
class JobCandidate(Schema):
id = fields.Int()
name = fields.Str()
surname = fields.Str()
age = fields.Int(validate=validate.Range(min=18, max=45))
input_data = {
"id": 1,
"name": "Alan",
"surname": "Poe",
"age": 46,
}
job_candidate_schema = JobCandidate()
job_candidate_schema.load(input_data)
Running the code snippet results in the following error:
ValidationError: {'age': ['Must be greater than or equal to 18 and less than or equal to 45.']}
As Marshmallow is quite similar to Pydantic, it also isn’t explicitly designed with dataframe support in mind.
3. jsonschema
GitHub ⭐: 4.4k
jsonschema is a Python implementation of the JSON Schema specification, which is a vocabulary that enables consistency and validity of JSON data at scale. Again, we will demonstrate the very same example using the jsonschema library. As you can see, the biggest difference is that we use a dictionary to create the schema of a job candidate.
from jsonschema import validate
job_candidate_schema = {
"properties": {
"id": {"type": "integer"},
"name": {"type": "string"},
"surname": {"type": "string"},
"age": {"type": "integer", "minimum": 18, "maximum": 45},
}
}
input_data = {
"id": 1,
"name": "Alan",
"surname": "Poe",
"age": 46,
}
validate(instance=input_data, schema=job_candidate_schema)
As before, running the snippet generates the expected error:
ValidationError: 46 is greater than the maximum of 45
Failed validating 'maximum' in schema['properties']['age']:
{'maximum': 45, 'minimum': 18, 'type': 'integer'}
On instance['age']:
46
4. Pandera
GitHub ⭐: 3k
Pandera is a data validation library designed specifically for dataframes. With Pandera, we can validate data within dataframes at runtime, which can be particularly useful for critical data pipelines running in production. A handy feature of Pandera is its ability to reuse defined schemas and apply them to various types of dataframes, including pandas, dask, modin, and pyspark.pandas. Additionally, the library supports checking column types in a pandas dataframe and performing statistical analysis such as hypothesis testing.
Let’s once again adjust the job candidate example. Since Pandera works with pandas dataframes, we will create a dataframe containing two candidates and test it against the same conditions.
import pandas as pd
import pandera as pa
job_candidate_schema = pa.DataFrameSchema(
{
"id": pa.Column(pa.Int),
"name": pa.Column(pa.String),
"surname": pa.Column(pa.String),
"age": pa.Column(pa.Int, checks=[pa.Check.ge(18), pa.Check.le(45)]),
}
)
test_df = pd.DataFrame(
{"id": [1, 2], "name": ["Alan", "John"], "surname": ["Poe", "Doe"], "age": [46, 19]}
)
validated_df = job_candidate_schema(test_df)
After running the code snippet, we get the following error:
SchemaError: Column 'age' failed element-wise validator number 1: less_than_or_equal_to(45) failure cases: 46
5. Great expectations
GitHub ⭐: 9.5k
Great Expectations is the most popular and comprehensive tool for validating data in data science and data engineering workflows. Some of its key characteristics include:
- Integration with various data environments, such as Pandas, Spark, and SQL databases.
- We can define expectation suites containing sets of checks for our data.
- The library also handles documentation, including detailed reports on why checks failed and profiling, offering suggestions on which quality checks could be applied to our dataframe.
- Thanks to its extensibility, users can create custom expectations, data connectors, and plugins to tailor the validation process to their specific needs.
Once again, let’s adjust our toy example to be validated with Great Expectations:
import pandas as pd
import great_expectations as ge
test_df = pd.DataFrame(
{"id": [1, 2], "name": ["Alan", "John"], "surname": ["Poe", "Doe"], "age": [46, 19]}
)
# Define a Great Expectations Expectation Suite
expectation_suite = ge.dataset.PandasDataset(test_df)
# Define expectations for each column
expectation_suite.expect_column_to_exist("id")
expectation_suite.expect_column_to_exist("name")
expectation_suite.expect_column_to_exist("surname")
expectation_suite.expect_column_values_to_be_between(
column="age",
min_value=18,
max_value=45,
)
# Validate the DataFrame against the Expectation Suite
validation_result = expectation_suite.validate()
After running the code snippet, we will notice the most significant difference, which is the output. I won’t paste the entire report here, as it would span a few pages. Still, we’ll focus on two of its elements:
- `success` – This contains a boolean flag indicating whether the validation outcome of the entire suite was positive.
- `results`- For each of the checks, it will contain the validation results.
Overall, the examination suite failed, as at least one expectation was unmet. Let’s examine two examples from the `results` section. First, a successful check verifying the presence of the ID column.
{
"success": true,
"expectation_config": {
"expectation_type": "expect_column_to_exist",
"kwargs": {
"column": "id",
"result_format": "BASIC"
},
"meta": {}
},
"result": {},
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_message": null,
"exception_traceback": null
}
}
This one has succeeded, as we have included that column in the dataframe. The second snippet will display the results of a test that failed:
{
"success": false,
"expectation_config": {
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {
"column": "age",
"min_value": 18,
"max_value": 45,
"result_format": "BASIC"
},
"meta": {}
},
"result": {
"element_count": 2,
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_count": 1,
"unexpected_percent": 50.0,
"unexpected_percent_total": 50.0,
"unexpected_percent_nonmissing": 50.0,
"partial_unexpected_list": [
46
]
},
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_message": null,
"exception_traceback": null
}
}
In this report, we can see that one of the two values failed the validation check, and the unexpected value was 46.
Overall, thanks to the thoroughness of the reports generated with Great Expectations. For example, we can allow some errors to occur if they do not happen too often or halt the entire pipeline if even one observation fails a check.
This article has highlighted the vital role of data validation and provided a roadmap to some of the most efficient tools available in 2024. By employing these tools, organizations can proactively identify and correct data issues, enhancing the reliability of model outputs. The question remains: which one to choose?
Use Pydantic if you validate a single input/output of the model. For example, if your model is running in production and accepts a JSON string with the values of the features, then you can use Pydantic to validate that input to ensure everything is in order. The same would go for the output of an ML model before it is served to the stakeholders.
Choose Great Expectations if you are evaluating more than a single observation. Thanks to the comprehensive report generated by the library and its customizability, you can create virtually any check you need in your workflow.
After accounting for data validation, explore other crucial steps in a data pipeline, such as data preparation for ML models and data versioning.
Source link
lol