From Messy to Clean: 8 Python Recipes for Effortless Data Preprocessing

Photo by editor

# Introduction

While Data preprocessing While data science and machine learning are very relevant in workflows, these processes are often not done properly, largely because they are perceived as overly complex, time-consuming, or require extensive custom code. As a result, practitioners may delay essential tasks such as data cleansing, rely on brittle ad-hoc solutions that are unsustainable in the long term, or over-engineer solutions to problems that may be simpler at their core.

This article presents 8 Python tricks for converting raw, messy data into clean, pre-processed data with minimal effort.

Before looking at specific tricks and accompanying code examples, the following introductory code sets up the necessary libraries and defines a toy data set to illustrate each trick:

import pandas as pd
import numpy as np
# A tiny, intentionally messy dataset
df = pd.DataFrame({
    " User Name ": (" Alice ", "bob", "Bob", "alice", None),
    "Age": ("25", "30", "?", "120", "28"),
    "Income$": ("50000", "60000", None, "1000000", "55000"),
    "Join Date": ("2023-01-01", "01/15/2023", "not a date", None, "2023-02-01"),
    "City": ("New York", "new york ", "NYC", "New York", "nyc"),
})

# 1. Instant normalization of column names

This is a very useful, one-liner-style trick: in one line of code, it normalizes the names of all the columns in the dataset. The details depend on how you want to normalize your attribute names, but the following example shows how to replace whitespace with underscore symbols and lowercase everything, thus ensuring a consistent, standard naming convention. This is important to prevent annoying bugs in downstream operations or to fix potential typos. No need to repeat column by column!

df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# 2. Removing whitespace from scaled strings

Sometimes you just want to make sure that certain garbage invisible to the human eye, such as spaces at the beginning or end of string (integer) values, is systematically removed from the entire dataset. This strategy does this neatly for all columns containing strings, leaving other columns, such as numeric columns, unchanged.

df = df.apply(lambda s: s.str.strip() if s.dtype == "object" else s)

# 3. Converting numeric columns safely

If we’re not 100% sure that all the values in a numeric column obey the same format, it’s usually a good idea to explicitly convert those values to numeric format, which sometimes turns messy strings that look like numbers into real numbers. In a single line, we can do what would otherwise need to be attempted except for blocks and a more manual cleanup procedure.

df("age") = pd.to_numeric(df("age"), errors="coerce")
df("income$") = pd.to_numeric(df("income$"), errors="coerce")

Note here that other classical approaches e.g df('columna').astype(float) Can sometimes crash if invalid raw values are found that cannot be trivially converted to integers.

# 4. Analyzing with dates `errors="coerce"`

Similar validation based mechanism, different data type. This trick converts datetime values that are valid, discarding those that are not. to use errors="coerce" The key is to tell Panda. That, if invalid, unchangeable values are found, they must be changed. NaT (not one time), instead of generating an error and crashing the program in the process.

df("join_date") = pd.to_datetime(df("join_date"), errors="coerce")

# 5. Correcting missing values with smart defaults

For those unfamiliar with strategies for handling missing values, in addition to skipping entire rows that contain them, this strategy imputes those values—filling in the gaps—like using statistically driven defaults. median or mode. An efficient, one-liner based strategy that can be adjusted with different default aggregates. gave (0) An index with mode is used to obtain only one value in the case of a relationship between two or more “most frequent values”.

df("age") = df("age").fillna(df("age").median())
df("city") = df("city").fillna(df("city").mode()(0))

# 6. Standardizing categories with maps

In key columns with variable values, such as cities, it is also important to standardize names and eliminate possible inconsistencies to obtain cleaner group names and aggregate downstream groups. groupby() Reliable and effective. With the help of a dictionary, this example applies a one-to-one mapping to string values related to New York City, ensuring that they are all uniformly represented by “NYC”.

city_map = {"new york": "NYC", "nyc": "NYC"}
df("city") = df("city").str.lower().map(city_map).fillna(df("city"))

# 7. Intelligent and flexible removal of duplicates

The key to this highly customizable duplicate removal strategy is usage. subset=("user_name"). In this example, it is used to tell pandas to treat a row as a duplicate just by looking at it. "user_name" column, and verifying if the column value is the same as the one in the second row. A great way to ensure that each unique user is represented only once in the dataset, preventing double counting, is to do it all in a single instruction.

df = df.drop_duplicates(subset=("user_name"))

# 8. Trimming quantiles to remove outliers

The final trick consists of automatically capping extreme values or outliers instead of removing them entirely. Particularly useful when outliers are considered due to errors manually introduced into the data, e.g. Clipping sets extreme values falling below (and above) two percentiles (for example, 1 and 99), with such percentile values, leaving the actual values unchanged between the two specified percentiles. In simple terms, this is equivalent to placing excessively large or small values in the range.

q_low, q_high = df("income$").quantile((0.01, 0.99))
df("income$") = df("income$").clip(q_low, q_high)

# wrap up

This article explains eight useful tricks, tips, and strategies that will boost your data preprocessing pipelines in Python, making them more efficient, effective, and robust: all at the same time.

Iván Palomares Carrascosa He is a leader, author, speaker, and consultant in AI, Machine Learning, Deep Learning and LLMs. He trains and guides others in using AI in the real world.

# Introduction

# 1. Instant normalization of column names

# 2. Removing whitespace from scaled strings

# 3. Converting numeric columns safely

# 4. Analyzing with dates `errors="coerce"`

# 5. Correcting missing values with smart defaults

# 6. Standardizing categories with maps

# 7. Intelligent and flexible removal of duplicates

# 8. Trimming quantiles to remove outliers

# wrap up

Editor's pick

Get latest news

From Messy to Clean: 8 Python Recipes for Effortless Data Preprocessing

# Introduction

# 1. Instant normalization of column names

# 2. Removing whitespace from scaled strings

# 3. Converting numeric columns safely

# 4. Analyzing with dates errors="coerce"

# 5. Correcting missing values ​​with smart defaults

# 6. Standardizing categories with maps

# 7. Intelligent and flexible removal of duplicates

# 8. Trimming quantiles to remove outliers

# wrap up

How to Implement the Observer Pattern in Python

Google DeepMind wants to know if chatbots are just a gesture of excellence.

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news

# 4. Analyzing with dates `errors="coerce"`

# 5. Correcting missing values with smart defaults