How I built a data cleaning pipeline using a messy Dordish dataset

by SkillAiNest

How I built a data cleaning pipeline using a messy Dordish datasetHow I built a data cleaning pipeline using a messy Dordish dataset
Photo by editor

# Introduction

According to Crowded Flower Surveydata scientists spend 60% of their time organizing and cleaning data.

In this article, we’ll walk through building a data cleaning pipeline using a real-life dataset from Dordish. It contains nearly 200,000 food delivery records, each with dozens of attributes such as delivery time, total items, and store category (eg, Mexican, Thai, or American food).

# Predicting food delivery times with Dordish data

Predicting food delivery times with Dordish dataPredicting food delivery times with Dordish data
DoorDash aims to accurately estimate the time it takes to deliver food, from the moment a customer places an order to the time it arrives at their door. i This data projectwe are tasked with developing a model that predicts the total delivery period based on historical delivery data.

However, we will not do the entire project. Instead, we will use the dataset provided in the project and create a data cleaning pipeline.

Our workflow consists of two major steps.

Data cleaning pipelineData cleaning pipeline

# Searching for data

Data cleaning pipelineData cleaning pipeline

Let’s start by loading and viewing the first few rows of the dataset.

// Load and preview the dataset

import pandas as pd
df = pd.read_csv("historical_data.csv")
df.head()

Here is the output.

Data cleaning pipelineData cleaning pipeline

This dataset contains datetime columns that capture the order creation time and the actual delivery time, which can be used to calculate the delivery period. It also includes other features such as store category, total item count, subtotal, and minimum item price, making it suitable for a variety of data analysis. We can already see that there are some non-values, which we will explore more closely in the following step.

// Explore with columns info()

Let’s inspect all the column names info() Method We will use this method throughout the article to observe changes in column value counts. This is a good indicator of missing data and overall data health.

Here is the output.

Data cleaning pipelineData cleaning pipeline

As you can see, we have 15 columns, but the number of non-null values ​​in them varies. This means that some columns have missing values, which can affect our analysis if not handled properly. One last thing: create_at And Actual_Delivery_Time Data types are objects. It should be datetime.

# Building a data cleaning pipeline

In this step, we develop a structured data cleaning pipeline to prepare the dataset for modeling. Each step identifies common problems such as timestamp formatting, missing values, and irrelevant properties.
Building a data cleaning pipelineBuilding a data cleaning pipeline

// Fixing date and time column data types

Before analyzing the data, we need to fix the columns that represent time. Otherwise, the calculation we mentioned (Actual_Delivery_Time – Created_At) will be false.

What we are fixing:

  • create_at: When the order was placed
  • Actual_Delivery_Time: When the food came

Both of these columns are stored as objects, so to be able to perform calculations, we need to convert them to datetime format. To do this, we can use datatime functions Pandas. Here is the code.

import pandas as pd
df = pd.read_csv("historical_data.csv")
# Convert timestamp strings to datetime objects
df("created_at") = pd.to_datetime(df("created_at"), errors="coerce")
df("actual_delivery_time") = pd.to_datetime(df("actual_delivery_time"), errors="coerce")
df.info()

Here is the output.

Building a data cleaning pipelineBuilding a data cleaning pipeline

As you can see from the screenshot above, create_at And Actual_Delivery_Time Now there are DateTime objects.

Building a data cleaning pipelineBuilding a data cleaning pipeline

In key columns, store_primary_category The fewest non-null values ​​are (192,668), which means it has the most missing data. That’s why we will focus on cleaning it first.

// Data renovation with mode()

The dirtiest column in the dataset, as indicated by its large number of missing values, is store_primary_category. It tells us what kind of food stores are available like Mexican, American and Thai. However, many rows are missing this information, which is a problem. For example, it may limit how we can group or analyze data. So how do we fix it?

We will fill these rows instead of skipping them. To do this, we will use smart exceptions.

We write a dictionary that maps each one store_id in its most frequent category, and then use this mapping to fill in the missing values. Let’s look at the dataset before we do that.

Data corruption with modeData corruption with mode

Here is the code.

import numpy as np

# Global most-frequent category as a fallback
global_mode = df("store_primary_category").mode().iloc(0)

# Build store-level mapping to the most frequent category (fast and robust)
store_mode = (
    df.groupby("store_id")("store_primary_category")
      .agg(lambda s: s.mode().iloc(0) if not s.mode().empty else np.nan)
)

# Fill missing categories using the store-level mode, then fall back to global mode
df("store_primary_category") = (
    df("store_primary_category")
      .fillna(df("store_id").map(store_mode))
      .fillna(global_mode)
)

df.info()

Here is the output.

Data corruption with modeData corruption with mode

As you can see from the screenshot above, store_primary_category The column now has more non-null counts. But let’s double check with this code.

df("store_primary_category").isna().sum()

Here is the output showing the number of non-values. It is zero; We got rid of them all.

Data corruption with modeData corruption with mode

And let’s look at the dataset after the defects.

Data corruption with modeData corruption with mode

// Falling to the rest of the naans

In the previous step, we corrected store_primary_categorybut did you notice anything? Non-null counts in columns still don’t match!

This is a clear sign that we are still dealing with missing values ​​in some part of the dataset. Now, when it comes to data cleaning, we have two options:

  • Fill in these missing values
  • Leave them alone

Given that this dataset has about 200,000 rows, we can afford to lose some. With smaller datasets, you’ll need to be more careful. In this case, it is advisable to analyze each column, establish the criteria (decide how the missing values ​​will be filled—that is, the mean, the median, the most frequent value, or a default value related to the domain), and then fill them.

To remove the nonce, we will use drop() Pandas library method. We are arranging inplace = true To apply changes directly to the data frame without the need for reassignment. Let’s look at the dataset at this point.

Nance is fallingNance is falling

Here is the code.

df.dropna(inplace=True)
df.info()

Here is the output.

Nance is fallingNance is falling

As you can see from the screenshot above, each column now has the same number of non-null values.

Let’s look at the dataset after all the transformations.

Nance is fallingNance is falling

// What can you do next?

Now that we have a clean dataset, here are a few things you can do next:

  • Perform EDA To understand delivery patterns.
  • Engineer new features like delivery times or the ratio of busy dashers to add more meaning to your analysis.
  • Analyze correlations between variables to increase the performance of your model.
  • Create different regression models And find the best performing model.
  • Predict the delivery period with the best performing model.

# Final thoughts

In this article, we have cleaned a real-life dataset from Dordish by solving common data quality problems, such as fixing incorrect data types and handling missing values. We created a simple data cleaning pipeline tailored to this data project and explored possible next steps.

Real-world datasets can be messier than you think, but there are plenty of ways and tricks to solve these problems. Thanks for reading!

Nate Rosedy A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Netcareer writes on the latest trends in the market, gives interview tips, shares data science projects, and covers everything SQL.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro