Data Detox: Training Yourself for the Messy, Noisy, Real World

Photo by author

# Introduction

We’ve all spent hours debugging a model, only to discover that it wasn’t the algorithm but an incorrect drain value that lumped your results into row 47,832. Kaggle comparisons give the impression that the data is produced as a clean, well-labeled CSV with no class imbalance issues, but in reality, this is not the case.

In this article, we’ll use a real-life data project to explore four practical steps to prepare for dealing with messy, real-life datasets.

# The Newbroker Data Project: A hands-on test of real-world chaos

Nowbroker is an Indian property technology (proptech) company that connects property owners and renters directly in the broker-free marketplace.

Data Detox

This data project Used during the recruitment process for data science positions at Newbroker.

In this data project, Newbroker wants you to build a predictive model that predicts how many conversations a property will receive in a given time frame. We won’t complete the entire project here, but it will help us explore ways to train on messy real-world data.

It has three datasets:

property_data_set.csv
- Contains property details such as type, location, amenities, size, rent and other accommodation features.
property_photos.tsv
- Contains images of the property.
property_interactions.csv
- Properties contain a timestamp of the interaction.

# Comparing real production data versus clean interview data: A reality check

Interview datasets are polished, balanced and boring. Actual production data? It’s a dumpster fire of missing values, duplicate rows, inconsistent formats, and silent errors that wait until Friday at 5pm to break your pipeline.

Take the Nubroker property dataset, a real-world mess with 28,888 properties in three tables. At first glance, it seems fine. But dig deeper, and you’ll find 11,022 missing photo-uniform resource locators (URLs), corrupted with corrupted JSON string corruptions, and more.

It’s the line between clean and chaotic. Clean data trains you to model, but production data trains you to struggle and survive.

We will explore four ways to train yourself.

Data Detox

# Practice #1: Handling Missing Data

Missing data isn’t just annoying. This is a decision point. Delete row? Fill it with meaning? Flag it as unknown? The answer depends on why the data is missing and how much you can afford to lose.

There were three types of missing data in the Nubroker dataset. photo_urls The column was missing values in 11,022 of the 28,888 rows – 38% of the dataset. Here is the code.

Here is the output.

Deleting these rows will remove valuable property records. Instead, the solution was to treat the missing images as zero and move on.

def correction(x):
    if x is np.nan or x == 'NaN':
        return 0  # Missing photos = 0 photos
    else:
        return len(json.loads(x.replace('\\', '').replace('{title','{"title')))
pics('photo_count') = pics('photo_urls').apply(correction)

For numeric columns total_floor (23 missing) and categorical columns viz building_type (38 missing), strategy was the exception. Fill in the numerical blanks with figures, and the categorical blanks with figures.

for col in x_remain_withNull.columns:
    x_remain(col) = x_remain_withNull(col).fillna(x_remain_withNull(col).mean())
for col in x_cat_withNull.columns:
    x_cat(col) = x_cat_withNull(col).fillna(x_cat_withNull(col).mode()(0))

First decision: Do not delete without a questioning mind!

Understand the pattern. The missing image URLs were not random.

# Practice #2: Detecting Outliers

An outlier is not always a mistake, but it is always suspicious.

Can you imagine a property with 21 bathrooms, 800 years, or 40,000 square feet of space? You either found your dream property or someone made a data entry error.

The Nubroker dataset was full of these red flags. Box plots revealed extreme values in several columns: property age greater than 100, size greater than 10,000 square feet (sq ft), and reserves greater than 3.5 million. There were some legitimate luxury features. Most were data entry errors.

df_num.plot(kind='box', subplots=True, figsize=(22,10))
plt.show()

Here is the output.

Data Detox

The solution was to remove outliers based on the interquartile range (IQR), a simple statistical method that flags values greater than 2 times the IQR.

To handle this, we first write a function that removes these outliers.

def remove_outlier(df_in, col_name):
    q1 = df_in(col_name).quantile(0.25)
    q3 = df_in(col_name).quantile(0.75)
    iqr = q3 - q1
    fence_low = q1 - 2 * iqr
    fence_high = q3 + 2 * iqr
    df_out = df_in.loc((df_in(col_name) <= fence_high) & (df_in(col_name) >= fence_low))
    return df_out  # Note: Multiplier changed from 1.5 to 2 to match implementation.

And we run this code on numeric columns.

df = dataset.copy()
for col in df_num.columns:
    if col in ('gym', 'lift', 'swimming_pool', 'request_day_within_3d', 'request_day_within_7d'):
        continue  # Skip binary and target columns
    df = remove_outlier(df, col)
print(f"Before: {dataset.shape(0)} rows")
print(f"After: {df.shape(0)} rows")
print(f"Removed: {dataset.shape(0) - df.shape(0)} rows ({((dataset.shape(0) - df.shape(0)) / dataset.shape(0) * 100):.1f}% reduction)")

Here is the output.

After removing outliers, the dataset was reduced from 17,386 rows to 15,170, leaving 12.7% of the data lost while keeping the model sane. The trade was worth it.

As for the target variable. request_day_within_3dcapping was used instead of deletion. Values above 10 were reduced by 10 to avoid extreme outliers from sketching predictions. In the following code, we also compare the before and after results.

def capping_for_3days(x):
    num = 10
    return num if x > num else x
df('request_day_within_3d_capping') = df('request_day_within_3d').apply(capping_for_3days)
before_count = (df('request_day_within_3d') > 10).sum()
after_count = (df('request_day_within_3d_capping') > 10).sum()
total_rows = len(df)
change_count = before_count - after_count
percent_change = (change_count / total_rows) * 100
print(f"Before capping (>10): {before_count}")
print(f"After capping (>10): {after_count}")
print(f"Reduced by: {change_count} ({percent_change:.2f}% of total rows affected)")

The result?

A cleaner distribution, better model performance, and fewer debugging sessions.

# Practice #3: Dealing with duplication and contradictions

Duplicates are easy. Contradictions are difficult. A duplicate queue is fair df.drop_duplicates(). An inconsistent format, such as a JSON string that has been hashed by three different systems, requires detective work.

The Nubroker dataset had one of the worst JSON inconsistencies I’ve seen. photo_urls The column was supposed to contain valid JSON arrays, but instead, it was filled with bad strings, missing values, backslash escapes, and random trailing characters.

text_before = pics('photo_urls')(0)
print('Before Correction: \n\n', text_before)

Here is the first correction.

Fix required multiple string replacements to correct formatting before parsing. Here is the code.

text_after = text_before.replace('\\', '').replace('{title', '{"title').replace(')"', ')').replace('),"', ')","')
parsed_json = json.loads(text_after)

Here is the output.

The JSON was valid and parsable once it was indeed valid. It’s not the cleanest way to manipulate strings like this, but it works.

You see conflicting formats everywhere: dates stored as strings, typos in different values, and numeric IDs stored as floats.

The solution is standard, as we did with JSON formatting.

# Practice #4: Data Type Validation and Schema Checks

It all starts when you load your data. Figuring out later whether the dates are strings or whether the numbers are objects would be a waste of time.

In the Nobroker project, the types were validated while reading the CSV itself, as the project was implementing the correct data types with Pandas Parameters Here is the code.

data = pd.read_csv('property_data_set.csv')
print(data('activation_date').dtype)  
data = pd.read_csv('property_data_set.csv',
                   parse_dates=('activation_date'), 
                   infer_datetime_format=True, 
                   dayfirst=True)
print(data('activation_date').dtype)

Here is the output.

The same validation was applied to the interaction dataset.

interaction = pd.read_csv('property_interactions.csv',
    parse_dates=('request_date'), 
    infer_datetime_format=True, 
    dayfirst=True)

Not only was this good practice, but it was necessary for anything downstream. This project requires calculation of date and time differences between activation and application dates.

So the following code will generate an error if the dates are strings.

num_req('request_day') = (num_req('request_date') - num_req('activation_date')) / np.timedelta64(1, 'D')

A schema check will ensure that the structure does not change, but in reality, the data will also flow as its distribution changes over time. You can vary the input ratio slightly and check whether your model or its validation is able to detect and respond to this flow.

# Documenting your cleaning steps

In three months, you won’t remember why you’re limited request_day_within_3d Six months from now, your partner will break the pipeline by removing your outlier filter. In a year, the model will hit production, and no one will understand why it just fails.

Documentation is not optional. This is the difference between a reproduction pipeline and a voodoo script that works until it doesn’t.

The Nubroker project documented each change in code comments and structured notebook sections with explanations and a table of contents.

# Assignment
# Read and Explore All Datasets
# Data Engineering
Handling Pics Data
Number of Interactions Within 3 Days
Number of Interactions Within 7 Days
Merge Data
# Exploratory Data Analysis and Processing
# Feature Engineering
Remove Outliers
One-Hot Encoding
MinMaxScaler
Classical Machine Learning
Predicting Interactions Within 3 Days
Deep Learning
# Try to correct the first Json
# Try to replace corrupted values then convert to json
# Function to correct corrupted json and get count of photos

Also version control matters. Track changes to your cleaning logic. Save intermediate datasets. Reward yourself for the effort you put in and the work you did.

The goal is not perfection. The goal is clear. If you can’t explain why you made a decision, you can’t defend the model when it fails.

# Final thoughts

Clean data is a myth. The best data scientists aren’t the ones who run away from messy datasets. They are the ones who know how to defeat them. They detect missing values before training.

They are able to identify outliers before they affect predictors. They check schemas before joining tables. And they write everything down so the next person doesn’t start from scratch.

Perfect data has no real impact. It comes from the ability to deal with inaccurate data and still create something functional.

So don’t panic when you have to deal with a dataset and you see null values, broken strings and outliers. What you see is not a problem, but an opportunity to demonstrate your skills against a real-world dataset.

Nate Rosedy A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Netcareer writes on the latest trends in the market, gives interview tips, shares data science projects, and covers everything SQL.

# Introduction

# The Newbroker Data Project: A hands-on test of real-world chaos

# Comparing real production data versus clean interview data: A reality check

# Practice #1: Handling Missing Data

# Practice #2: Detecting Outliers

# Practice #3: Dealing with duplication and contradictions

# Practice #4: Data Type Validation and Schema Checks

# Documenting your cleaning steps

# Final thoughts

Editor's pick

Get latest news

Data Detox: Training Yourself for the Messy, Noisy, Real World

# Introduction

# The Newbroker Data Project: A hands-on test of real-world chaos

# Comparing real production data versus clean interview data: A reality check

# Practice #1: Handling Missing Data

# Practice #2: Detecting Outliers

# Practice #3: Dealing with duplication and contradictions

# Practice #4: Data Type Validation and Schema Checks

# Documenting your cleaning steps

# Final thoughts

FreecodeCamp’s new Python certification is now live

A Brief History of Hype by Sam Altman

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news