5 Useful Python Scripts for Automated Data Quality Checks

Photo by author

# Introduction

Data quality issues are everywhere. Missing values where there shouldn’t be any. Dates in wrong format. Duplicate records that slip away. Outliers that skew your analysis. Text fields with inconsistent capitalization and spelling variations. These issues can break your analytics, pipelines, and often lead to bad business decisions.

Manual data validation is difficult. You need to check the same problems over and over again in multiple data sets, and it’s easy to miss subtle problems. This article covers five practical Python scripts that handle the most common data quality problems.

Link to the code on GitHub.

# 1. Analyzing missing data

// Pain point

You receive a dataset that contains what you expect to be complete records, but empty cells, empty values, empty strings, and placeholder text such as “N/A” or “Unknown” are scattered. Some columns are mostly empty, others have only a few spaces. You need to understand the extent of the problem before you can fix it.

// What does the script do?

Comprehensively scans datasets in all its forms for missing data. Identifies patterns in missingness (random vs systematic), calculates an absolute score for each column, and flags columns with excessive missing data. It also generates visual reports that show where your data gaps are.

// How it works

The script reads data from CSV, Excel, or JSON files, detecting various representations of missing values such as None, NaN, empty strings, common placeholders. It then calculates the missing data percentage by column and row, indicating the correlation between missing values in the columns. Finally, it produces both summary data and detailed reports that include recommendations for addressing each type of disappearance.

⏩ Get the missing data analyzer script.

# 2. Validating data types

// Pain point

Your dataset claims numeric IDs, but some are text. Date fields contain dates, times, or sometimes just random strings. Email addresses in the Email column, except for fields that are not valid emails. Such inconsistencies cause the script to crash or result in incorrect calculations.

// What does the script do?

Verifies that each column has the expected data type. Checks numeric columns for non-numeric values, date columns for invalid dates, email and URL columns for proper formatting, and categorical columns for unexpected values. The script also provides detailed reports on type violations with line numbers and examples.

// How it works

The script accepts a schema definition specifying the expected types for each column, uses regex patterns and validation libraries to check format compliance, identifies and reports rows that violate type expectations, calculates violation rates per column, and recommends appropriate data type conversion or cleanup actions.

⏩ Get the data type validation script.

# 3. Detection of duplicate records

// Pain point

Your database should contain unique records, but duplicate entries keep appearing. Sometimes they are exact duplicates, sometimes only a few fields match. It might be the same customer whose name is spelled slightly differently, or transactions that have been mistakenly submitted twice. Finding them manually is extremely difficult.

// What does the script do?

Identifies duplicate and near-duplicate records using multiple detection strategies. Finds exact matches, fuzzy matches, and duplicates within specified column combinations based on match thresholds. Groups similar records together and calculates a confidence score for possible matches.

// How it works

The script uses hash-based exact matching for perfect duplicates, applying a fuzzy string matching algorithm using Levenstein distance For close duplicates, allows defining key columns for partial matching, generates duplicate clusters with similarity scores, and exports detailed reports showing all possible duplicates with recommendations for duplication.

⏩ Get the duplicate record detector script.

# 4. Detection of outliers

// Pain point

Your analysis results appear to be incorrect. You’ll dig in and see if someone entered 999 for age, the transaction amount is negative when it should be positive, or the measurement is three orders of magnitude larger than the rest. Outliers skew statistics, break models, and are often difficult to identify in large data sets.

// What does the script do?

Automatically detects statistical outliers using several methods. Applies Z-score analysis, IQR or interquartile range method, and domain-specific rules. Indicates extreme values, improbable values, and values that fall outside the expected range. Provides context for each outlier and suggests whether it is likely an error or a legitimate extreme value.

// How it works

The script analyzes numeric columns using configurable statistical thresholds, applies domain-specific validation rules, visualizes distributions with highlighted outliers, calculates outlier scores and confidence levels, and generates priority reports that first flag potential data corruption.

⏩ Get the outlier detection script.

# 5. Checking cross-field consistency

// Pain point

The individual fields look fine, but the relationships between the fields are broken. End dates followed by start dates. Shipping addresses in countries different from the country code of the billing address. Child records without corresponding parent records. Order combinations that do not match line item combinations. Finding these logical inconsistencies is difficult but just as damaging.

// What does the script do?

Validates logical relationships between fields based on business rules. Checks for temporal consistency, referential integrity, mathematical relationships, and custom business logic. Flag violations with specific details about what is inconsistent.

// How it works

The script accepts a rules definition file that defines relationships to validate, evaluates conditional logic and cross-field comparisons, performs lookups to verify referential integrity, calculates derived values and compares stored values, and generates a detailed violation report with row references and specific rule failures.

⏩ Get the crossfield consistency checker script.

# wrap up

These five scripts help you catch data quality problems early, before they break your analysis or system. Data validation should be automated, comprehensive and fast, and these scripts help with that.

So how do you get started? Download the script that addresses your biggest data quality pain point and install the required dependencies. Next, set up validation rules for your specific data, run it on a sample dataset to verify the setup. Then, integrate it into your data pipeline to catch issues automatically.

Clean data is the foundation of everything. Start validating systematically, and you’ll spend less time fixing problems. Congratulations on the endorsement!

Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she’s working on learning lessons and sharing her knowledge with the developer community, writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource reviews and coding tutorials.

# Introduction

# 1. Analyzing missing data

// Pain point

// What does the script do?

// How it works

# 2. Validating data types

// Pain point

// What does the script do?

// How it works

# 3. Detection of duplicate records

// Pain point

// What does the script do?

// How it works

# 4. Detection of outliers

// Pain point

// What does the script do?

// How it works

# 5. Checking cross-field consistency

// Pain point

// What does the script do?

// How it works

# wrap up

Editor's pick

Get latest news

5 Useful Python Scripts for Automated Data Quality Checks

# Introduction

# 1. Analyzing missing data

// Pain point

// What does the script do?

// How it works

# 2. Validating data types

// Pain point

// What does the script do?

// How it works

# 3. Detection of duplicate records

// Pain point

// What does the script do?

// How it works

# 4. Detection of outliers

// Pain point

// What does the script do?

// How it works

# 5. Checking cross-field consistency

// Pain point

// What does the script do?

// How it works

# wrap up

Finding value with AI and Industry 5.0 transformation

Learn Python and build an autonomous agent

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news