

Photo by editor
# Introduction
As a data professional, you know that machine learning models, analytics dashboards, business reports all depend on data that is accurate, consistent and properly formatted. But here’s the painful truth: Data cleaning consumes a huge chunk of project time. Data scientists and analysts spent much of their time cleaning and preparing the data instead of actually analyzing it.
The raw data you receive is dirty. It’s full of scattered values, duplicate records, inconsistent formats, outliers that skew your models, and text fields full of typos and inconsistencies. Cleaning this data manually is tedious, error-prone, and doesn’t scale.
This article covers five Python scripts specifically designed to automate common and time-consuming data cleaning tasks that you’ll often encounter in real-world projects.
# 1. Missing value handler
A word of pain: Your dataset has missing values everywhere – some columns are 90% complete, others have sparse data. You need to decide what to do with each one: skip rows, fill with means, use forward fill for the time series, or apply a more sophisticated effect. Doing this manually for each column is tedious and inconsistent.
What does the script do?: Automatically analyzes missing value patterns across your dataset, recommends appropriate handling strategies based on data type and missing patterns, and applies selected imputation methods. Produces a detailed report showing what is missing and how it has been handled.
How it works: The script scans all columns to calculate missing percentages and patterns, determines data types (numeric, categorical, datetime), and applies appropriate strategies.
- Mean/median for numeric data,
- mode for category,
- Intervals for time series.
It can detect and handle completely missing at random (MCAR), missing at random (MAR), and differentially missing at random (MNAR) samples, and logs all changes for reproducibility.
⏩ Get the missing value handler script
# 2. Duplicate record detector and resolver
A word of pain: There are duplicates in your data, but they are not always exact matches. Sometimes it’s the same customer with a slightly different name spelling, or the same transaction is recorded twice with minor variations. Finding these ambiguous duplicates and deciding which records require manual inspection of thousands of rows.
What does the script do?: Identifies both exact and ambiguous duplicate records using sorting matching rules. Groups similar records together, scores their similarity, and either flags them for review or automatically merges them based on survival rules you define such as newest, retained, most complete, and more.
How it works: The script first finds exact duplicates using a hash-based comparison for speed. Then it uses a fuzzy matching algorithm that uses Levenstein distance And Jaro Winkler To find near duplicates on key fields. Records are clustered into duplicate groups, and survivor rules determine which values to retain when merging. A detailed report shows all duplicate groups and steps found.
⏩ Get the duplicate detector script
# 3. Data type fixer and standard
A word of pain: Your CSV import converted everything to strings. Dates are in five different forms. The number has currency symbols and thousands of separators. Boolean values are represented in a single column as “Yes/No”, “Y/N”, “1/0”, and “True/False”. Achieving persistent data types requires writing custom parsing logic for each dirty column.
What does the script do?: Automatically detects the required data type for each column, standardizes the formats, and converts everything to the appropriate types. Handles dates in multiple formats, cleans numeric strings, normalizes Boolean representation, and validates results. Provides a conversion report showing what has been changed.
How it works: Script samples values from each column to evaluate the desired type using pattern matching and heuristics. It then applies the appropriate analysis: Detail For flexible date parsing, regex for numeric extraction, mapping dictionaries for boolean normalization. Failed conversions are logged with problematic values for manual review.
⏩ Get the datatype fixer script
# 4. Outlier detector
A word of pain: There are outliers in your numerical data that will ruin your analysis. Some are data entry errors, some are legitimate extreme values you want to keep, and some are ambiguous. You need to identify them, understand their impact, and decide how to handle each case—winsorize, cap, remove, or flag for review.
What does the script do?: Detects outliers using various statistical methods like IQR, Z-Score, Forest of Solitudevisualizes their distribution and effects, and applies curative treatment strategies. Distinguish between univariate and multivariate outliers. Outlier counts, their values, and how to handle them are shown.
How it works: The script calculates the outlier limits using the method(s) you choose, the values of the flags that exceed the threshold, and applies the treatment: removal, capping at 100%, vinsurization, or effectiveness with the values of the limits. For multivariate outliers, it uses solitary forest or Mahalanobis distance. All outliers are logged with their original values for audit purposes.
⏩ Get the outlier detector script
# 5. Text data cleaner and normalizer
A word of pain: Your text fields are a mess. Names have inconsistent capitalization, addresses use different abbreviations (St vs. Street vs. ST), product descriptions have HTML tags and special characters, and free text fields have leading/trailing whitespace everywhere. Standardizing text data requires dozens of regex patterns and string operations that are applied consistently.
What does the script do?: Automatically cleans and normalizes text data: normalizes case, removes unwanted characters, expands or normalizes abbreviations, strips HTML, normalizes whitespace, and handles Unicode issues. Configurable cleanup pipelines let you apply different rules to different column types (name, address, description, and so on).
How it works: The script provides a pipeline of text transformations that can be configured by column type. It provides case normalization, whitespace cleanup, special character removal, abbreviation standard using a lookup dictionary, and Unicode normalization. Every change is logged, and samples are provided for validation before/after.
# The result
These five scripts highlight some of the most time-consuming data cleaning challenges you’ll encounter in real-world projects. Here’s a quick recap:
- The missing value handler intelligently analyzes missing data and detects errors
- The duplicate detector finds and resolves exact and ambiguous duplicates
- The data type fixer standardizes the formats and converts them to the appropriate types
- An outlier detector identifies and treats data anomalies
- TextCleaner permanently normalizes dirty string data
Each script is designed to be modular. So you can use them individually or combine them together in a complete data cleaning pipeline. Start with a script that addresses your biggest pain point, test it on a sample of your data, customize the parameters for your specific use case, and gradually develop your automated cleaning workflow.
Happy data cleaning!
Bala Priya c is a developer and technical writer from India. She loves working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include devops, data science, and natural language processing. She enjoys reading, writing, coding and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces and more. Bala also engages resource reviews and coding lessons.