

Photo by Author | Canva
Ever run a script and immediately wish you didn’t enter?
Data science is not just a process. This is survival skills – especially when dirty datases developed or prophesying models on which the real people depend.
In this article, we will find the basics of debugging, especially in your data science workflows, using real -life datastas from Dordsh Delivery, and most importantly, how to debug like Pro.
DEARDSH DELIVERY DELIVERY PRESS: What are we dealt with?
In this data ProjectDordsh asked his data science candidates to predict the delivery period. Let’s first see the data information. The code is:
Output it is:
It seems that they did not provide the delivery period, so you should calculate it here. It’s easy, but if you are initial, don’t worry. Let’s see how it can be calculated.
import pandas as pd
from datetime import datetime
# Assuming historical_data is your DataFrame
historical_data("created_at") = pd.to_datetime(historical_data('created_at'))
historical_data("actual_delivery_time") = pd.to_datetime(historical_data('actual_delivery_time'))
historical_data("actual_total_delivery_duration") = (historical_data("actual_delivery_time") - historical_data("created_at")).dt.total_seconds()
historical_data.head()
Here is the output head. You can see actual_total_delivery_duration
.
Well, now we can start! But before that, here is the data defense language for this dataset.
In the column historical_data.csv
Time Features:
- Market_ID: A city/area in which Dordsh works, such as, Los Angeles, is given in data identified.
- CREATED_AT: Time stamp in UTC when the order was submitted by the user by the user. (Note: This time stamp is in the UTC, but if you need it, the real time zone of the region was American/Pacific).
- Original_ Delivery_ Time: Time stamp in UTC when the order was delivered to the user.
Store features:
- Store_ ID: An ID order representing the restaurant was submitted.
- Store_Perry_Cattergery: Restaurant food category, such as, Italian, Asian.
- Order_Potocol: A store can receive order from Dordash in many ways. This field represents the ID identifying the protocol.
Order features:
- Total_ Items: Total number of items in order.
- Sub Total: The total price of the submitted order (in the Senate).
- Num_distinct_items: Number of separate items contained in the sequence.
- Min_ITEM_PRICE: Item price (in Cent) with minimum cost in order.
- Max_ITEM_PRICE: The highest cost item price in the order (in the Senate).
Market Features:
As a market, we have information about the state of the market when the order is given, which can be used to estimate the delivery time. The following features are the values of the time created_at
(Order Submitting Time):
- Total_Newsif_ Desers: The number of available diners that are 10 miles from the store at the time of order creation.
- Total_Bosi
total_onshift_dashers
Who are currently working on an order. - Total_ Out Standards: The number of orders at a distance of 10 miles of this order, which is currently being processed.
The predictions of other models:
We have predictions of other models for different stages of delivery process we can use:
- Estimated_Derd_place_Dourction: Receiving order from Dordash (in seconds) for restaurant.
- Estimated_ Store_Conomer_Doring_ Doring: Estimated travel time between store and user (in seconds).
Great, so let’s start!
Commonly in the Data Science Projects
In this section, we will discover common debugging errors in one Data Science ProjectsReading Dataset and starting through the most important part: modeling.
Reading Datasate: FileNotFoundError
DTYPE Alert, and Corrections
Case 1: File not found – classic
In Data Science, your first bug often salutes you read_csv
. And not with Hello. Let’s debug together this moment, line through the line. The code is:
import pandas as pd
try:
df = pd.read_csv('Strata Questions/historical_data.csv')
df.head(3)
except FileNotFoundError as e:
import os
print("File not found. Here's where Python is looking:")
print("Working directory:", os.getcwd())
print("Available files:", os.listdir())
raise e
Here is an output.
You just don’t make mistakes – you ask for it. This shows where the code thinks it is and what looks around it. If your file is not on the list, you will now know. No idea. Only facts.
Change the path complete one, and voyilà !
Case 2: D -type misinterpretation – a silent misconception of Azigar
You load the data, but something is off. The bug is hidden within your types.
# Assuming df is your loaded DataFrame
try:
print("Column Types:\n", df.dtypes)
except Exception as e:
print("Error reading dtypes:", e)
Here is an output.
Case 3: History Parsing – Silence subtivia
We discovered that we should calculate the first delivery period, and we did it with this procedure.
try:
# This code was shown earlier to calculate the delivery duration
df("created_at") = pd.to_datetime(df('created_at'))
df("actual_delivery_time") = pd.to_datetime(df('actual_delivery_time'))
df("actual_total_delivery_duration") = (df("actual_delivery_time") - df("created_at")).dt.total_seconds()
print("Successfully calculated delivery duration and checked dtypes.")
print("Relevant dtypes:\n", df(('created_at', 'actual_delivery_time', 'actual_total_delivery_duration')).dtypes)
except Exception as e:
print("Error during date processing:", e)
Here is an output.
Good and professional! Now we avoid these red mistakes that will raise our mood – I know that seeing them can reduce your encouragement.
Handle the lost data: KeyErrors
For, for, for,. NaNs
And logical losses
Some insects do not crash your code. They just give you the wrong results, quietly, until you wonder why your model is in the trash.
This section digs into the missing data – not only how to clean it, but also how to debugged it properly.
Case 1: Career – you thought this column exists
Here is our code.
try:
print(df('store_rating'))
except KeyError as e:
print("Column not found:", e)
print("Here are the available columns:\n", df.columns.tolist())
Here is an output.
The code was not broken due to logic. It was broken due to an assumption. This is exactly where debugging life is. Always make a list of them before closing your columns.
Case 2: Non -counting – lost values that you did not expect
You suppose everything is clean. But real -world data always hides space. Let’s check them.
try:
null_counts = df.isnull().sum()
print("Nulls per column:\n", null_counts(null_counts > 0))
except Exception as e:
print("Failed to inspect nulls:", e)
Here is an output.
It exposes silent problems. May be store_primary_category
Missing in thousands of rows. May have failed to change the timstamps and am now NaT
.
You don’t know until you check. Debugging – confirming every assumption.
Case 3: Logical losses – lost data that is not really missing
We say that you try to filter these orders where the total total is more than 1,000,000, expecting hundreds of rows. But it gives you zero:
try:
filtered = df(df('subtotal') > 1000000)
print("Rows with subtotal > 1,000,000:", filtered.shape(0))
except Exception as e:
print("Filtering error:", e)
This is not a code error. This is a mistake of logic. You expected high value orders, but perhaps there is no one above this doorstep. Debug it with a range check:
print("Subtotal range:", df('subtotal').min(), "to", df('subtotal').max())
Here is an output.
Case 4: isna()
≠Zero does not mean it is clean
Even if isna().sum()
Zero shows, maybe dirty data, such as white space or ‘none’ wire. Run more aggressive checks:
try:
fake_nulls = df(df('store_primary_category').isin(('', ' ', 'None', None)))
print("Rows with fake missing categories:", fake_nulls.shape(0))
except Exception as e:
print("Fake missing value check failed:", e)
Catchs the hidden trash from it isnull()
Remember
Feature Engineering Error: TypeErrors
History parsing, and much more
Feature engineering seems to enjoy at first, unless your new column breaks or throws every model TypeError
The mid -pipeline is how to debugate this stage as if anyone has been burned before.
Case 1: You think you can split, but you can’t do it
Let’s create a new feature. If an error occurs, our try-except
The block will catch it.
try:
df('value_per_item') = df('subtotal') / df('total_items')
print("value_per_item created successfully")
except Exception as e:
print("Error occurred:", e)
Here is an output.
No mistakes? Well but let’s look close.
print(df(('subtotal', 'total_items', 'value_per_item')).sample(3))
Here is an output.
Case 2: Date analysis went wrong
Now, to change you dtype
What is important, but what will happen if you think that everything has been done correctly, yet the problems are intact?
# This is the standard way, but it can fail silently on mixed types
df("created_at") = pd.to_datetime(df("created_at"))
df("actual_delivery_time") = pd.to_datetime(df("actual_delivery_time"))
You think this is fine, but if you have mixed types in your column, it may fail quietly or break your pipeline. That is why, instead of making direct changes, it is better to use strong function.
from datetime import datetime
def parse_date_debug(df, col):
try:
parsed = pd.to_datetime(df(col))
print(f"(SUCCESS) '{col}' parsed successfully.")
return parsed
except Exception as e:
print(f"(ERROR) Failed to parse '{col}':", e)
# Find non-date-like values to debug
non_datetimes = df(pd.to_datetime(df(col), errors="coerce").isna())(col).unique()
print("Sample values causing issue:", non_datetimes(:5))
raise
df("created_at") = parse_date_debug(df, "created_at")
df("actual_delivery_time") = parse_date_debug(df, "actual_delivery_time")
Here is an output.
When the date -time parsing crash is crash, it helps you detect poor rows.
Case 3: Bid Division that can mislead
This will not make any mistake in our data frame as the columns are already numerical. But the problem here is: Some datases are sticking to the types of objects, even when they look like numbers. It leads:
- The misleading proportional
- Invalid model behavior
- No warning
df("busy_dashers_ratio") = df("total_busy_dashers") / df("total_onshift_dashers")
Let’s verify the types before computing, even if the operation does not throw any mistake.
import numpy as np
def create_ratio_debug(df, num_col, denom_col, new_col):
num_type = df(num_col).dtype
denom_type = df(denom_col).dtype
if not np.issubdtype(num_type, np.number) or not np.issubdtype(denom_type, np.number):
print(f"(TYPE WARNING) '{num_col}' or '{denom_col}' is not numeric.")
print(f"{num_col}: {num_type}, {denom_col}: {denom_type}")
df(new_col) = np.nan
return df
if (df(denom_col) == 0).any():
print(f"(DIVISION WARNING) '{denom_col}' contains zeros.")
df(new_col) = df(num_col) / df(denom_col)
return df
df = create_ratio_debug(df, "total_busy_dashers", "total_onshift_dashers", "busy_dashers_ratio")
Here is an output.
This causes potential divisions by zero cases and prevents silent insects.
Modeling errors: shape matching and diagnostic confusion
Case 1: Non values in features cause the model crash
We say we want to create a linear reactionary model. LinearRegression()
NAN does not support values. If X has the value of the disappearance of any row, the model refuses to train.
Here is the code, which creates a deliberate shape to stimulate the error:
from sklearn.linear_model import LinearRegression
X_train = df(("estimated_order_place_duration", "estimated_store_to_consumer_driving_duration")).iloc(:-10)
y_train = df("actual_total_delivery_duration").iloc(:-5)
model = LinearRegression()
model.fit(X_train, y_train)
Here is an output.
Let’s debug on this issue. First, we test Nanz.
print(X_train.isna().sum())
Here is an output.
Well, let’s check the other variable too.
print(y_train.isna().sum())
Here is an output.
Similarity and non -values must be resolved. This is the code to fix it.
from sklearn.linear_model import LinearRegression
# Re-align X and y to have the same length
X = df(("estimated_order_place_duration", "estimated_store_to_consumer_driving_duration"))
y = df("actual_total_delivery_duration")
# Step 1: Drop rows with NaN in features (X)
valid_X = X.dropna()
# Step 2: Align y to match the remaining indices of X
y_aligned = y.loc(valid_X.index)
# Step 3: Find indices where y is not NaN
valid_idx = y_aligned.dropna().index
# Step 4: Create final clean datasets
X_clean = valid_X.loc(valid_idx)
y_clean = y_aligned.loc(valid_idx)
model = LinearRegression()
model.fit(X_clean, y_clean)
print("✅ Model trained successfully!")
And voyilà ! Here is an output.
Case 2: Object column (dates) crash model
We say you try to train models like a timstamp actual_delivery_time
.
But – oh no – it’s still an item or date time, and you mistakenly mix it in numerical columns. Linear regression does not like it a little.
from sklearn.linear_model import LinearRegression
X = df(("actual_delivery_time", "estimated_order_place_duration"))
y = df("actual_total_delivery_duration")
model = LinearRegression()
model.fit(X, y)
The error code is this:
You are combining two contradictory data types in the X -matrix:
- A column (
actual_delivery_time
) Isdatetime64
. - Other (
estimated_order_place_duration
) Isint64
.
Skate Learn expects all the features to be the same numerical datchep. It cannot handle mixed types such as date time and intake. Let’s resolve it by converting the date time column into a numerical representation.
# Ensure datetime columns are parsed correctly, coercing errors to NaT
df("actual_delivery_time") = pd.to_datetime(df("actual_delivery_time"), errors="coerce")
df("created_at") = pd.to_datetime(df("created_at"), errors="coerce")
# Recalculate duration in case of new NaNs
df("actual_total_delivery_duration") = (df("actual_delivery_time") - df("created_at")).dt.total_seconds()
# Convert datetime to a numeric feature (Unix timestamp in seconds)
df("delivery_time_timestamp") = df("actual_delivery_time").astype("int64") // 10**9
Well now that the D -type is numerical, let’s apply the ML model.
from sklearn.linear_model import LinearRegression
# Use the new numeric timestamp feature
X = df(("delivery_time_timestamp", "estimated_order_place_duration"))
y = df("actual_total_delivery_duration")
# Drop any remaining NaNs from our feature set and target
X_clean = X.dropna()
y_clean = y.loc(X_clean.index).dropna()
X_clean = X_clean.loc(y_clean.index)
model = LinearRegression()
model.fit(X_clean, y_clean)
print("✅ Model trained successfully!")
Here is an output.
Great work!
Final views: Debug Summer, not difficult
Model crashes are not always produced by complex insects – sometimes, it is just a wandering naan or unchanged history column that sticks to your data pipeline.
Instead of boating with crapatic steak marks or tossing try-except
In the dark, blocks like darts, dig your data frame. Peek .info()
Check out, .isna().sum()
And do not refrain from .dtypes
. These simple steps unveil hidden mines before you hit fit()
.
I have shown you that even a neglected object type or timid lost price can sabotage a model. But with a sharp eye, cleaner preparation, and deliberately removing the feature, you will be shifting to build intelligence with debugging reactions.
Net Razii A data is in a scientist and product strategy. He is also an affiliated professor of Teaching Analytics, and is the founder of Stratskrich, a platform that helps data scientists prepare for his interview with the real questions of high companies. The net carrier writes on the latest trends in the market, gives interview advice, sharing data science projects, and everything covers SQL.