Welcome to an exploratory journey into data validation with Pandera, a lesser-known yet powerful tool in the info scientist’s toolkit. This tutorial goals to light up the trail for those searching for to fortify their data processing pipelines with robust validation techniques.
Pandera is a Python library that gives flexible and expressive data validation for pandas data structures. It’s designed to bring more rigor and reliability to the info processing steps, ensuring that your data conforms to specified formats, types, and other constraints before you proceed with evaluation or modeling.
Within the intricate tapestry of information science, where data is the basic thread, ensuring its quality and consistency is paramount. Pandera promotes the integrity and quality of information through rigorous validation. It’s not nearly checking data types or formats; Pandera extends its vigilance to more sophisticated statistical validations, making it an indispensable ally in your data science endeavours. Specifically, Pandera stands out by offering:
- Schema enforcement: Guarantees that your DataFrame adheres to a predefined schema.
- Customisable validation: Enables creation of complex, custom validation rules.
- Integration with Pandas: Seamlessly works with existing pandas workflows.
Let’s start with installing Pandera. This could be done using pip:
pip install pandera
A schema in Pandera defines the expected structure, data types, and constraints of your DataFrame. We’ll begin by importing the essential libraries and defining an easy schema.
import pandas as pd
from pandas import Timestamp
import pandera as pa
from pandera import Column, DataFrameSchema, Check, Indexschema = DataFrameSchema({
"name": Column(str),
"age": Column(int, checks=pa.Check.ge(0)), # age needs to be non-negative
"email": Column(str, checks=pa.Check.str_matches(r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$')) # email format
})
This schema specifies that our DataFrame must have three columns: name
(string), age
(integer, non-negative), and email
(string, matching an everyday expression for email). Now, with our schema in place, let’s validate a DataFrame.
# Sample DataFrame
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, -5, 30],
"email": ["alice@example.com", "bob@example", "charlie@example.com"]
})# Validate
validated_df = schema(df)
In this instance, Pandera will raise a SchemaError
because Bob’s age is negative, which violates our schema.
SchemaError: failed element-wise validator 0:
failure cases:
index failure_case
0 1 -5
One in every of Pandera’s strengths is its ability to define custom validation functions.
@pa.check_input(schema)
def process_data(df: pd.DataFrame) -> pd.DataFrame:
# Some code to process the DataFrame
return dfprocessed_df = process_data(df)
The @pa.check_input
decorator ensures that the input DataFrame adheres to the schema before the function processes it.
Now, let’s explore more complex validations that Pandera offers. Constructing upon the prevailing schema, we are able to add additional columns with various data types and more sophisticated checks. We’ll introduce columns for categorical data, datetime data, and implement more advanced checks like ensuring unique values or referencing other columns.
# Define the improved schema
enhanced_schema = DataFrameSchema(
columns={
"name": Column(str),
"age": Column(int, checks=[Check.ge(0), Check.lt(100)]),
"email": Column(str, checks=[Check.str_matches(r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$')]),
"salary": Column(float, checks=Check.in_range(30000, 150000)),
"department": Column(str, checks=Check.isin(["HR", "Tech", "Marketing", "Sales"])),
"start_date": Column(pd.Timestamp, checks=Check(lambda x: x < pd.Timestamp("today"))),
"performance_score": Column(float, nullable=True)
},
index=Index(int, name="employee_id")
)# Custom check function
def salary_age_relation_check(df: pd.DataFrame) -> pd.DataFrame:
if not all(df["salary"] / df["age"] < 3000):
raise ValueError("Salary to age ratio check failed")
return df
# Function to process and validate data
def process_data(df: pd.DataFrame) -> pd.DataFrame:
# Apply custom check
df = salary_age_relation_check(df)
# Validate DataFrame with Pandera schema
return enhanced_schema.validate(df)
On this enhanced schema, we’ve added:
- Categorical data: The
department
column validates against specific categories. - Datetime data: The
start_date
column ensures dates are up to now. - Nullable column: The
performance_score
column can have missing values. - Index validation: An index
employee_id
of type integer is defined. - Complex check: A custom function
salary_age_relation_check
ensures a logical relationship between salary and age inside each department. - Implementation of the custom check in the info processing function: We integrate the
salary_age_relation_check
logic directly into our data processing function. - Use of Pandera’s
validate
method: As an alternative of using the@pa.check_types
decorator, we manually validated the DataFrame using thevalidate
method provided by Pandera.
Now, let’s create an example DataFrame df_example
that matches the structure and constraints of our enhanced schema and validate it.
df_example = pd.DataFrame({
"employee_id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 35, 45],
"email": ["alice@example.com", "bob@example.com", "charlie@example.com"],
"salary": [50000, 80000, 120000],
"department": ["HR", "Tech", "Sales"],
"start_date": [Timestamp("2022-01-01"), Timestamp("2021-06-15"), Timestamp("2020-12-20")],
"performance_score": [4.5, 3.8, 4.2]
})# Make certain the employee_id column is the index
df_example.set_index("employee_id", inplace=True)
# Process and validate data
processed_df = process_data(df_example)
Here, Pandera will raise a SchemaError
due to a mismatch between the expected data kind of the salary
column in enhanced_schema
(float
which corresponds to float64
in pandas/Numpy types) and the actual data type present in df_example
(int
or int64
in pandas/Numpy types).
SchemaError: expected series 'salary' to have type float64, got int64
Pandera can perform statistical hypothesis tests as a part of the validation process. This feature is especially useful for validating assumptions about your data distributions or relationships between variables.
Suppose you would like to be sure that the common salary in your dataset is around a certain value, say £75,000. One can define a custom check function to perform a one-sample t-test to evaluate if the mean of a sample (e.g., the mean of the salaries within the dataset) differs significantly from a known mean (in our case, £75,000).
from scipy.stats import ttest_1samp# Define the custom check for the salary column
def mean_salary_check(series: pd.Series, expected_mean: float = 75000, alpha: float = 0.05) -> bool:
stat, p_value = ttest_1samp(series.dropna(), expected_mean)
return p_value > alpha
salary_check = Check(mean_salary_check, element_wise=False, error="Mean salary check failed")
# Appropriately update the checks for the salary column by specifying the column name
enhanced_schema.columns["salary"] = Column(float, checks=[Check.in_range(30000, 150000), salary_check], name="salary")
Within the code above we’ve got:
- Defined the custom check function
mean_salary_check
that takes a pandas Series (thesalary
column in our DataFrame) and performs the t-test against the expected mean . The function returnsTrue
if the p-value from the t-test is larger than the importance level (alpha = 0.05), indicating that the mean salary shouldn’t be significantly different from £75,000. - We then wrapped this function in a Pandera
Check
, specifyingelement_wise=False
to point that the check is applied to the complete series moderately than to every element individually. - Finally, we updated the
salary
column in our Pandera schema to incorporate this latest check together with any existing checks.
With these steps, our Pandera schema now features a statistical test on the salary
column. We deliberately increase the common salary in df_example
to violate the schema’s expectation in order that Pandera will raise a SchemaError
.
# Change the salaries to exceede the expected mean of £75,000
df_example["salary"] = df_example["salary"] = [100000.0, 105000.0, 110000.0]
validated_df = enhanced_schema(df_example)
SchemaError: failed series or dataframe validator 1:
Pandera elevates data validation from an earthly checkpoint to a dynamic process that encompasses even complex statistical validations. By integrating Pandera into your data processing pipeline, you’ll be able to catch inconsistencies and errors early, saving time, stopping headaches down the road, and paving the best way for more reliable and insightful data evaluation.
For those willing to further their understanding of Pandera and its capabilities, the next resources function excellent starting points:
- Pandera Documentation: A comprehensive guide to all features and functionalities of Pandera (Pandera Docs).
- Pandas Documentation: As Pandera extends pandas, familiarity with pandas is crucial (Pandas Docs).
I’m not affiliated with Pandera in any capability, I’m just very keen about it 🙂