How I built a data cleaning pipeline using a messy Dordish dataset

Photo by editor

# Introduction

According to Crowded Flower Surveydata scientists spend 60% of their time organizing and cleaning data.

In this article, we’ll walk through building a data cleaning pipeline using a real-life dataset from Dordish. It contains nearly 200,000 food delivery records, each with dozens of attributes such as delivery time, total items, and store category (eg, Mexican, Thai, or American food).

# Predicting food delivery times with Dordish data

DoorDash aims to accurately estimate the time it takes to deliver food, from the moment a customer places an order to the time it arrives at their door. i This data projectwe are tasked with developing a model that predicts the total delivery period based on historical delivery data.

However, we will not do the entire project. Instead, we will use the dataset provided in the project and create a data cleaning pipeline.

Our workflow consists of two major steps.

# Searching for data

Let’s start by loading and viewing the first few rows of the dataset.

// Load and preview the dataset

import pandas as pd
df = pd.read_csv("historical_data.csv")
df.head()

Here is the output.

This dataset contains datetime columns that capture the order creation time and the actual delivery time, which can be used to calculate the delivery period. It also includes other features such as store category, total item count, subtotal, and minimum item price, making it suitable for a variety of data analysis. We can already see that there are some non-values, which we will explore more closely in the following step.

// Explore with columns `info()`

Let’s inspect all the column names info() Method We will use this method throughout the article to observe changes in column value counts. This is a good indicator of missing data and overall data health.

Here is the output.

Data cleaning pipeline

As you can see, we have 15 columns, but the number of non-null values in them varies. This means that some columns have missing values, which can affect our analysis if not handled properly. One last thing: create_at And Actual_Delivery_Time Data types are objects. It should be datetime.

# Building a data cleaning pipeline

In this step, we develop a structured data cleaning pipeline to prepare the dataset for modeling. Each step identifies common problems such as timestamp formatting, missing values, and irrelevant properties.

// Fixing date and time column data types

Before analyzing the data, we need to fix the columns that represent time. Otherwise, the calculation we mentioned (Actual_Delivery_Time – Created_At) will be false.

What we are fixing:

create_at: When the order was placed
Actual_Delivery_Time: When the food came

Both of these columns are stored as objects, so to be able to perform calculations, we need to convert them to datetime format. To do this, we can use datatime functions Pandas. Here is the code.

import pandas as pd
df = pd.read_csv("historical_data.csv")
# Convert timestamp strings to datetime objects
df("created_at") = pd.to_datetime(df("created_at"), errors="coerce")
df("actual_delivery_time") = pd.to_datetime(df("actual_delivery_time"), errors="coerce")
df.info()

Here is the output.

Building a data cleaning pipeline

As you can see from the screenshot above, create_at And Actual_Delivery_Time Now there are DateTime objects.

In key columns, store_primary_category The fewest non-null values are (192,668), which means it has the most missing data. That’s why we will focus on cleaning it first.

// Data renovation with `mode()`

The dirtiest column in the dataset, as indicated by its large number of missing values, is store_primary_category. It tells us what kind of food stores are available like Mexican, American and Thai. However, many rows are missing this information, which is a problem. For example, it may limit how we can group or analyze data. So how do we fix it?

We will fill these rows instead of skipping them. To do this, we will use smart exceptions.

We write a dictionary that maps each one store_id in its most frequent category, and then use this mapping to fill in the missing values. Let’s look at the dataset before we do that.

Here is the code.

import numpy as np

# Global most-frequent category as a fallback
global_mode = df("store_primary_category").mode().iloc(0)

# Build store-level mapping to the most frequent category (fast and robust)
store_mode = (
    df.groupby("store_id")("store_primary_category")
      .agg(lambda s: s.mode().iloc(0) if not s.mode().empty else np.nan)
)

# Fill missing categories using the store-level mode, then fall back to global mode
df("store_primary_category") = (
    df("store_primary_category")
      .fillna(df("store_id").map(store_mode))
      .fillna(global_mode)
)

df.info()

Here is the output.

Data corruption with mode

As you can see from the screenshot above, store_primary_category The column now has more non-null counts. But let’s double check with this code.

df("store_primary_category").isna().sum()

Here is the output showing the number of non-values. It is zero; We got rid of them all.

And let’s look at the dataset after the defects.

// Falling to the rest of the naans

In the previous step, we corrected store_primary_categorybut did you notice anything? Non-null counts in columns still don’t match!

This is a clear sign that we are still dealing with missing values in some part of the dataset. Now, when it comes to data cleaning, we have two options:

Fill in these missing values
Leave them alone

Given that this dataset has about 200,000 rows, we can afford to lose some. With smaller datasets, you’ll need to be more careful. In this case, it is advisable to analyze each column, establish the criteria (decide how the missing values will be filled—that is, the mean, the median, the most frequent value, or a default value related to the domain), and then fill them.

To remove the nonce, we will use drop() Pandas library method. We are arranging inplace = true To apply changes directly to the data frame without the need for reassignment. Let’s look at the dataset at this point.

Nance is falling

Here is the code.

df.dropna(inplace=True)
df.info()

Here is the output.

Nance is falling

As you can see from the screenshot above, each column now has the same number of non-null values.

Let’s look at the dataset after all the transformations.

Nance is falling

// What can you do next?

Now that we have a clean dataset, here are a few things you can do next:

Perform EDA To understand delivery patterns.
Engineer new features like delivery times or the ratio of busy dashers to add more meaning to your analysis.
Analyze correlations between variables to increase the performance of your model.
Create different regression models And find the best performing model.
Predict the delivery period with the best performing model.

# Final thoughts

In this article, we have cleaned a real-life dataset from Dordish by solving common data quality problems, such as fixing incorrect data types and handling missing values. We created a simple data cleaning pipeline tailored to this data project and explored possible next steps.

Real-world datasets can be messier than you think, but there are plenty of ways and tricks to solve these problems. Thanks for reading!

Nate Rosedy A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Netcareer writes on the latest trends in the market, gives interview tips, shares data science projects, and covers everything SQL.

# Introduction

# Predicting food delivery times with Dordish data

# Searching for data

// Load and preview the dataset

// Explore with columns `info()`

# Building a data cleaning pipeline

// Fixing date and time column data types

// Data renovation with `mode()`

// Falling to the rest of the naans

// What can you do next?

# Final thoughts

Editor's pick

Get latest news

How I built a data cleaning pipeline using a messy Dordish dataset

# Introduction

# Predicting food delivery times with Dordish data

# Searching for data

// Load and preview the dataset

// Explore with columns info()

# Building a data cleaning pipeline

// Fixing date and time column data types

// Data renovation with mode()

// Falling to the rest of the naans

// What can you do next?

# Final thoughts

ACE Prevents Context Loss with ‘Evolution Playbook’ for Self-Improving AI Agents

Under the baton of AI agents: A technical guide to the next frontier of general AI

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news

// Explore with columns `info()`

// Data renovation with `mode()`