Building a Simple Data Quality DSL in Python

Photo by author

# Introduction

Data validation code in Python is often a pain to maintain. Business rules are buried in the nest if Statements, validation logic gets mixed up with error handling, and adding new checks often means chuffing through method functions to find the right place to insert code. Yes, there are data validation frameworks you can use, but we’ll focus on building something very simple that’s useful with Python.

Let’s create one by writing a simple Domain Specific Language (DSL). The words Especially for data validation. Instead of writing generic Python code, you create special functions and classes that express validation rules in terms that reflect your problem.

For data validation, this means that business requirements read like: “Customers must be between 18 and 120” or “Email addresses must contain the @ symbol and contain a valid domain.” You want DSL to handle the mechanics of checking data and reporting breaches, while you focus on the expression. what Looks like valid data. The result is validation logic that is readable, easy to maintain and test, and easy to extend. So, let’s start coding!

🔗 Link to the code on GitHub

# Why build DSL?

Consider validating customer data with Python:

def validate_customers(df):
    errors = ()
    if df('customer_id').duplicated().any():
        errors.append("Duplicate IDs")
    if (df('age') < 0).any():
        errors.append("Negative ages")
    if not df('email').str.contains('@').all():
        errors.append("Invalid emails")
    return errors

This approach combines hard-coded validation logic, business rules with error handling, and becomes invulnerable as the rules multiply. Instead, we want to write a DSL that separates the concerns and creates reusable authentication components.

Instead of writing procedural validation functions, a DSL lets you express rules that read like business requirements:

# Traditional approach
if df('age').min() < 0 or df('age').max() > 120:
    raise ValueError("Invalid ages found")

# DSL approach  
validator.add_rule(Rule("Valid ages", between('age', 0, 120), "Ages must be 0-120"))

The DSL approach separates what you’re validating (business rules) from how violations are handled (error reporting).. This makes the validation logic testable, reusable, and readable by non-programmers.

# Creating a sample dataset

Start by crunching a sample, realistic e-commerce customer data set with common quality issues:

import pandas as pd

customers = pd.DataFrame({
    'customer_id': (101, 102, 103, 103, 105),
    'email': ('john@gmail.com', 'invalid-email', '', 'sarah@yahoo.com', 'mike@domain.co'),
    'age': (25, -5, 35, 200, 28),
    'total_spent': (250.50, 1200.00, 0.00, -50.00, 899.99),
    'join_date': ('2023-01-15', '2023-13-45', '2023-02-20', '2023-02-20', '')
}) # Note: 2023-13-45 is an intentionally malformed date.

This dataset has duplicate customer IDs, incorrect email formats, impossible ages, negative spend amounts and bad dates. That should work pretty well for checking validation rules.

# Writing validation logic

// Creating a rule class

Let’s start by writing a simple one Rule The class that wraps the validation logic:

class Rule:
    def __init__(self, name, condition, error_msg):
        self.name = name
        self.condition = condition
        self.error_msg = error_msg
    
    def check(self, df):
        # The condition function returns True for VALID rows.
        # We use ~ (bitwise NOT) to select the rows that VIOLATE the condition.
        violations = df(~self.condition(df))
        if not violations.empty:
            return {
                'rule': self.name,
                'message': self.error_msg,
                'violations': len(violations),
                'sample_rows': violations.head(3).index.tolist()
            }
        return None

condition The parameter accepts any function that takes a data frame and returns a boolean Series Identifying the correct row. tailed operator (~) replaces this boolean Series Identifying violations. When a violation exists, check Returns information about the method including the rule name, error message, violation count, and sample row indexes for debugging.

This design separates validation logic from error reporting. condition The function is completely focused on the business principle whereas Rule The class permanently handles error details.

// Adding multiple rules

Next, let’s code up DataValidator A class that manages a collection of rules:

class DataValidator:
    def __init__(self):
        self.rules = ()
    
    def add_rule(self, rule):
        self.rules.append(rule)
        return self # Enables method chaining
    
    def validate(self, df):
        results = ()
        for rule in self.rules:
            violation = rule.check(df)
            if violation:
                results.append(violation)
        return results

add_rule The procedure returns self To activate the method chain. validate The method independently executes all rules and collects violation reports. This approach ensures that a failed rule does not prevent others from running.

// Constructing readable situations

Remember that when speeding up something Rule Class, we need one too condition Function This can be any function that takes in a data frame and returns a boolean Series. Although simple lambda functions work, they are not very easy to read. So let’s write a helper function to create readable validation words:

def not_null(column):
    return lambda df: df(column).notna()

def unique_values(column):
    return lambda df: ~df.duplicated(subset=(column), keep=False)

def between(column, min_val, max_val):
    return lambda df: df(column).between(min_val, max_val)

Each helper function returns a lambda that works with Pandas Boolean operations.

not_null Uses helper pandas’ notna() How to identify non-null values.
unique_values Uses a helper duplicated(..., keep=False) With the subset parameter, to flag all duplicate events, ensuring a more accurate violation count.
between Uses helper pandas between() The method that handles the range check automatically.

For pattern matching, Regular expression Straighten up:

import re

def matches_pattern(column, pattern):
    return lambda df: df(column).str.match(pattern, na=False)

na=False The parameter ensures that missing values are treated as validation failures rather than matches, which is usually the desired behavior for required fields.

# Creating a data validator for a sample dataset

Let’s now create a validator for the customer dataset to see how this DSL works:

validator = DataValidator()

validator.add_rule(Rule(
   "Unique customer IDs", 
   unique_values('customer_id'),
   "Customer IDs must be unique across all records"
))

validator.add_rule(Rule(
   "Valid email format",
   matches_pattern('email', r'^(^@\s)+@(^@\s)+\.(^@\s)+$'),
   "Email addresses must contain @ symbol and domain"
))

validator.add_rule(Rule(
   "Reasonable customer age",
   between('age', 13, 120),
   "Customer age must be between 13 and 120 years"
))

validator.add_rule(Rule(
   "Non-negative spending",
   lambda df: df('total_spent') >= 0,
   "Total spending amount cannot be negative"
))

Each rule follows the same pattern: A descriptive name, validation status, and an error message.

Uses the first rule unique_values Helpful function to check for duplicate customer IDs.
The second rule applies the regular expression pattern to validate email formats. This pattern requires at least one character before and after the @ symbol, plus the domain extension.
Uses the rule of thirds between Helpful for range validation, setting appropriate age range for users.
The final rule uses a lambda function to check the inline condition total_spent Values are non-negative.

Notice how each rule reads like a business requirement. The validator collects these rules and can execute them against any data frame with column names.

issues = validator.validate(customers)

for issue in issues:
    print(f"❌ Rule: {issue('rule')}")
    print(f"Problem: {issue('message')}")
    print(f"Affected rows: {issue('sample_rows')}")
    print()

The output clearly identifies specific problems and their locations in the dataset, making debugging straightforward. For the sample data, you will get the following output:

Validation Results:
❌ Rule: Unique customer IDs
   Problem: Customer IDs must be unique across all records
   Violations: 2
   Affected rows: (2, 3)

❌ Rule: Valid email format
   Problem: Email addresses must contain @ symbol and domain
   Violations: 3
   Affected rows: (1, 2, 4)

❌ Rule: Reasonable customer age
   Problem: Customer age must be between 13 and 120 years
   Violations: 2
   Affected rows: (1, 3)

❌ Rule: Non-negative spending
   Problem: Total spending amount cannot be negative
   Violations: 1
   Affected rows: (3)

# Adding cross-column validation

Real business rules often involve relationships between columns. Custom lambda functions handle complex validation logic:

def high_spender_email_required(df):
    high_spenders = df('total_spent') > 500
    has_valid_email = df('email').str.contains('@', na=False)
    # Passes if: (Not a high spender) OR (Has a valid email)
    return ~high_spenders | has_valid_email

validator.add_rule(Rule(
    "High Spenders Need Valid Email",
    high_spender_email_required,
    "Customers spending over $500 must have valid email addresses"
))

This rule uses Boolean logic where high-spending customers must have valid emails, but low-spending customers may be missing contact information. expression ~high_spenders | has_valid_email Translates to “not a high spender or doesn’t have a valid email”, which allows low spenders to pass verification regardless of email status.

# Dealing with date validation

Date validation requires careful handling because date parsing may fail:

def valid_date_format(column, date_format="%Y-%m-%d"):
    def check_dates(df):
        # pd.to_datetime with errors="coerce" turns invalid dates into NaT (Not a Time)
        parsed_dates = pd.to_datetime(df(column), format=date_format, errors="coerce")
        # A row is valid if the original value is not null AND the parsed date is not NaT
        return df(column).notna() & parsed_dates.notna()
    return check_dates

validator.add_rule(Rule(
    "Valid Join Dates",
    valid_date_format('join_date'),
    "Join dates must follow YYYY-MM-DD format"
))

Validation passes only if the original value is not null and the parsed date is valid (ie, no NaT) we remove the unnecessary try-except Block, depending on errors="coerce" i pd.to_datetime By replacing damaged wires to handle them gracefully NaTwhich is then captured parsed_dates.notna().

# Writing decorator integration samples

For production pipelines, you can write decorator patterns that provide clean integration.

def validate_dataframe(validator):
    def decorator(func):
        def wrapper(df, *args, **kwargs):
            issues = validator.validate(df)
            if issues:
                error_details = (f"{issue('rule')}: {issue('violations')} violations" for issue in issues)
                raise ValueError(f"Data validation failed: {'; '.join(error_details)}")
            return func(df, *args, **kwargs)
        return wrapper
    return decorator

# Note: 'customer_validator' needs to be defined globally or passed in a real implementation
# Assuming 'customer_validator' is the instance we built earlier
# @validate_dataframe(customer_validator)
def process_customer_data(df):
    return df.groupby('age').agg({'total_spent': 'sum'})

This decorator ensures that data undergoes validation before processing begins, preventing bad data from propagating through the pipeline. The decorator raises specification errors including specific validation failures. A comment was added to the code snippet for this note customer_validator The decorator will need to be accessible.

# Expanding the pattern

You can extend the DSL to include other validation rules as needed:

# Statistical outlier detection
def within_standard_deviations(column, std_devs=3):
    # Valid if absolute difference from mean is within N standard deviations
    return lambda df: abs(df(column) - df(column).mean()) <= std_devs * df(column).std()

# Referential integrity across datasets
def foreign_key_exists(column, reference_df, reference_column):
    # Valid if value in column is present in the reference_column of the reference_df
    return lambda df: df(column).isin(reference_df(reference_column))

# Custom business logic
def profit_margin_reasonable(df):
    # Ensures 0 <= margin <= 1
    margin = (df('revenue') - df('cost')) / df('revenue')
    return (margin >= 0) & (margin <= 1)

This way you can structure the validation logic as composable functions that return a Boolean series.

Here’s an example of how you might use the data validation DSL we built on the sample data, assuming the helper functions are in a module called data_quality_dsl:

import pandas as pd
from data_quality_dsl import DataValidator, Rule, unique_values, between, matches_pattern

# Sample data
df = pd.DataFrame({
    'user_id': (1, 2, 2, 3),
    'email': ('user@test.com', 'invalid', 'user@real.com', ''),
    'age': (25, -5, 30, 150)
})

# Build validator
validator = DataValidator()
validator.add_rule(Rule("Unique users", unique_values('user_id'), "User IDs must be unique"))
validator.add_rule(Rule("Valid emails", matches_pattern('email', r'^(^@)+@(^@)+\.(^@)+$'), "Invalid email format"))
validator.add_rule(Rule("Reasonable ages", between('age', 0, 120), "Age must be 0-120"))

# Run validation
issues = validator.validate(df)
for issue in issues:
    print(f"❌ {issue('rule')}: {issue('violations')} violations")

# The result

This DSL, although simple, works because it is consistent with how data professionals think about authentication. Rules express business logic in easy-to-understand requirements while allowing us to use Pandas for both performance and flexibility.

Separation of concerns makes validation logic testable and maintainable. This approach requires no external dependencies outside of Pandas and introduces no learning curve for those already familiar with Pandas operations.

This is something I worked on over a couple of evening coding sprints and several cups of coffee (of course!). But you can use this version as a starting point and make something much cooler. Happy coding!

Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include devops, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces and more. Bala also engages resource reviews and coding lessons.

# Introduction

# Why build DSL?

# Creating a sample dataset

# Writing validation logic

// Creating a rule class

// Adding multiple rules

// Constructing readable situations

# Creating a data validator for a sample dataset

# Adding cross-column validation

# Dealing with date validation

# Writing decorator integration samples

# Expanding the pattern

# The result

Editor's pick

Get latest news

Building a Simple Data Quality DSL in Python

# Introduction

# Why build DSL?

# Creating a sample dataset

# Writing validation logic

// Creating a rule class

// Adding multiple rules

// Constructing readable situations

# Creating a data validator for a sample dataset

# Adding cross-column validation

# Dealing with date validation

# Writing decorator integration samples

# Expanding the pattern

# The result

State AI: Economic homogeneity

MIT Offshot Releases Blueprint for Training Liquid AI Enterprise-Grade Small Models

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news