The basics of debugging problems

by SkillAiNest

The basics of debugging problemsThe basics of debugging problems
Photo by Author | Canva

Ever run a script and immediately wish you didn’t enter?

Data science is not just a process. This is survival skills – especially when dirty datases developed or prophesying models on which the real people depend.

In this article, we will find the basics of debugging, especially in your data science workflows, using real -life datastas from Dordsh Delivery, and most importantly, how to debug like Pro.

DEARDSH DELIVERY DELIVERY PRESS: What are we dealt with?

Delivery to Delivery Predicts to debug on issuesDelivery to Delivery Predicts to debug on issues
In this data ProjectDordsh asked his data science candidates to predict the delivery period. Let’s first see the data information. The code is:

Output it is:

    Deblying the issue of azar in predicting the delivery period    Deblying the issue of azar in predicting the delivery period

It seems that they did not provide the delivery period, so you should calculate it here. It’s easy, but if you are initial, don’t worry. Let’s see how it can be calculated.

import pandas as pd
from datetime import datetime

# Assuming historical_data is your DataFrame
historical_data("created_at") = pd.to_datetime(historical_data('created_at'))
historical_data("actual_delivery_time") = pd.to_datetime(historical_data('actual_delivery_time'))
historical_data("actual_total_delivery_duration") = (historical_data("actual_delivery_time") - historical_data("created_at")).dt.total_seconds()
historical_data.head()

Here is the output head. You can see actual_total_delivery_duration.

Delivery Predict Prediction Production of Debaging Issue problemDelivery Predict Prediction Production of Debaging Issue problem

Well, now we can start! But before that, here is the data defense language for this dataset.

In the column historical_data.csv

Time Features:

  • Market_ID: A city/area in which Dordsh works, such as, Los Angeles, is given in data identified.
  • CREATED_AT: Time stamp in UTC when the order was submitted by the user by the user. (Note: This time stamp is in the UTC, but if you need it, the real time zone of the region was American/Pacific).
  • Original_ Delivery_ Time: Time stamp in UTC when the order was delivered to the user.

Store features:

  • Store_ ID: An ID order representing the restaurant was submitted.
  • Store_Perry_Cattergery: Restaurant food category, such as, Italian, Asian.
  • Order_Potocol: A store can receive order from Dordash in many ways. This field represents the ID identifying the protocol.

Order features:

  • Total_ Items: Total number of items in order.
  • Sub Total: The total price of the submitted order (in the Senate).
  • Num_distinct_items: Number of separate items contained in the sequence.
  • Min_ITEM_PRICE: Item price (in Cent) with minimum cost in order.
  • Max_ITEM_PRICE: The highest cost item price in the order (in the Senate).

Market Features:

As a market, we have information about the state of the market when the order is given, which can be used to estimate the delivery time. The following features are the values of the time created_at (Order Submitting Time):

  • Total_Newsif_ Desers: The number of available diners that are 10 miles from the store at the time of order creation.
  • Total_Bosi total_onshift_dashers Who are currently working on an order.
  • Total_ Out Standards: The number of orders at a distance of 10 miles of this order, which is currently being processed.

The predictions of other models:

We have predictions of other models for different stages of delivery process we can use:

  • Estimated_Derd_place_Dourction: Receiving order from Dordash (in seconds) for restaurant.
  • Estimated_ Store_Conomer_Doring_ Doring: Estimated travel time between store and user (in seconds).

Great, so let’s start!

Commonly in the Data Science Projects

Commonly in the Data Science ProjectsCommonly in the Data Science Projects

In this section, we will discover common debugging errors in one Data Science ProjectsReading Dataset and starting through the most important part: modeling.

Reading Datasate: FileNotFoundErrorDTYPE Alert, and Corrections

Case 1: File not found – classic

In Data Science, your first bug often salutes you read_csv. And not with Hello. Let’s debug together this moment, line through the line. The code is:

import pandas as pd

try:
    df = pd.read_csv('Strata Questions/historical_data.csv')
    df.head(3)
except FileNotFoundError as e:
    import os
    print("File not found. Here's where Python is looking:")
    print("Working directory:", os.getcwd())
    print("Available files:", os.listdir())
    raise e

Here is an output.

Data science projects to debugged errorsData science projects to debugged errors

You just don’t make mistakes – you ask for it. This shows where the code thinks it is and what looks around it. If your file is not on the list, you will now know. No idea. Only facts.

Change the path complete one, and voyilà!

Not debugged in the file's errorsNot debugged in the file's errors

Case 2: D -type misinterpretation – a silent misconception of Azigar

You load the data, but something is off. The bug is hidden within your types.

# Assuming df is your loaded DataFrame
try:
    print("Column Types:\n", df.dtypes)
except Exception as e:
    print("Error reading dtypes:", e)

Here is an output.

Displashing errors in a misinterpretation of D -typeDisplashing errors in a misinterpretation of D -type

Case 3: History Parsing – Silence subtivia

We discovered that we should calculate the first delivery period, and we did it with this procedure.

try:
    # This code was shown earlier to calculate the delivery duration
    df("created_at") = pd.to_datetime(df('created_at'))
    df("actual_delivery_time") = pd.to_datetime(df('actual_delivery_time'))
    df("actual_total_delivery_duration") = (df("actual_delivery_time") - df("created_at")).dt.total_seconds()
    print("Successfully calculated delivery duration and checked dtypes.")
    print("Relevant dtypes:\n", df(('created_at', 'actual_delivery_time', 'actual_total_delivery_duration')).dtypes)
except Exception as e:
    print("Error during date processing:", e)

Here is an output.

Debaging the errors in the data parsingDebaging the errors in the data parsing

Good and professional! Now we avoid these red mistakes that will raise our mood – I know that seeing them can reduce your encouragement.

Handle the lost data: KeyErrorsFor, for, for,. NaNsAnd logical losses

Some insects do not crash your code. They just give you the wrong results, quietly, until you wonder why your model is in the trash.

This section digs into the missing data – not only how to clean it, but also how to debugged it properly.

Case 1: Career – you thought this column exists

Here is our code.

try:
    print(df('store_rating'))
except KeyError as e:
    print("Column not found:", e)
    print("Here are the available columns:\n", df.columns.tolist())

Here is an output.

The key to debugging Uzar issuesThe key to debugging Uzar issues

The code was not broken due to logic. It was broken due to an assumption. This is exactly where debugging life is. Always make a list of them before closing your columns.

Case 2: Non -counting – lost values that you did not expect

You suppose everything is clean. But real -world data always hides space. Let’s check them.

try:
    null_counts = df.isnull().sum()
    print("Nulls per column:\n", null_counts(null_counts > 0))
except Exception as e:
    print("Failed to inspect nulls:", e)

Here is an output.

Non -counting in deepering the problemsNon -counting in deepering the problems

It exposes silent problems. May be store_primary_category Missing in thousands of rows. May have failed to change the timstamps and am now NaT.

You don’t know until you check. Debugging – confirming every assumption.

Case 3: Logical losses – lost data that is not really missing

We say that you try to filter these orders where the total total is more than 1,000,000, expecting hundreds of rows. But it gives you zero:

try:
    filtered = df(df('subtotal') > 1000000)
    print("Rows with subtotal > 1,000,000:", filtered.shape(0))
except Exception as e:
    print("Filtering error:", e)

This is not a code error. This is a mistake of logic. You expected high value orders, but perhaps there is no one above this doorstep. Debug it with a range check:

print("Subtotal range:", df('subtotal').min(), "to", df('subtotal').max())

Here is an output.

Logical disadvantages in debugging problemsLogical disadvantages in debugging problems

Case 4: isna() ≠ Zero does not mean it is clean

Even if isna().sum() Zero shows, maybe dirty data, such as white space or ‘none’ wire. Run more aggressive checks:

try:
    fake_nulls = df(df('store_primary_category').isin(('', ' ', 'None', None)))
    print("Rows with fake missing categories:", fake_nulls.shape(0))
except Exception as e:
    print("Fake missing value check failed:", e)

Catchs the hidden trash from it isnull() Remember

Handle the missing data in debugging issuesHandle the missing data in debugging issues

Feature Engineering Error: TypeErrorsHistory parsing, and much more

Feature engineering seems to enjoy at first, unless your new column breaks or throws every model TypeError The mid -pipeline is how to debugate this stage as if anyone has been burned before.

Case 1: You think you can split, but you can’t do it

Let’s create a new feature. If an error occurs, our try-except The block will catch it.

try:
    df('value_per_item') = df('subtotal') / df('total_items')
    print("value_per_item created successfully")
except Exception as e:
    print("Error occurred:", e)

Here is an output.

The feature of engineering in debugging issuesThe feature of engineering in debugging issues

No mistakes? Well but let’s look close.

print(df(('subtotal', 'total_items', 'value_per_item')).sample(3))

Here is an output.

The feature of engineering in debugging issuesThe feature of engineering in debugging issues

Case 2: Date analysis went wrong

Now, to change you dtype What is important, but what will happen if you think that everything has been done correctly, yet the problems are intact?

# This is the standard way, but it can fail silently on mixed types
df("created_at") = pd.to_datetime(df("created_at"))
df("actual_delivery_time") = pd.to_datetime(df("actual_delivery_time"))

You think this is fine, but if you have mixed types in your column, it may fail quietly or break your pipeline. That is why, instead of making direct changes, it is better to use strong function.

from datetime import datetime

def parse_date_debug(df, col):
    try:
        parsed = pd.to_datetime(df(col))
        print(f"(SUCCESS) '{col}' parsed successfully.")
        return parsed
    except Exception as e:
        print(f"(ERROR) Failed to parse '{col}':", e)
        # Find non-date-like values to debug
        non_datetimes = df(pd.to_datetime(df(col), errors="coerce").isna())(col).unique()
        print("Sample values causing issue:", non_datetimes(:5))
        raise

df("created_at") = parse_date_debug(df, "created_at")
df("actual_delivery_time") = parse_date_debug(df, "actual_delivery_time")

Here is an output.

To analyze the wrong date in debugging issuesTo analyze the wrong date in debugging issues

When the date -time parsing crash is crash, it helps you detect poor rows.

Case 3: Bid Division that can mislead

This will not make any mistake in our data frame as the columns are already numerical. But the problem here is: Some datases are sticking to the types of objects, even when they look like numbers. It leads:

  • The misleading proportional
  • Invalid model behavior
  • No warning
df("busy_dashers_ratio") = df("total_busy_dashers") / df("total_onshift_dashers")

Let’s verify the types before computing, even if the operation does not throw any mistake.

import numpy as np

def create_ratio_debug(df, num_col, denom_col, new_col):
    num_type = df(num_col).dtype
    denom_type = df(denom_col).dtype

    if not np.issubdtype(num_type, np.number) or not np.issubdtype(denom_type, np.number):
        print(f"(TYPE WARNING) '{num_col}' or '{denom_col}' is not numeric.")
        print(f"{num_col}: {num_type}, {denom_col}: {denom_type}")
        df(new_col) = np.nan
        return df
    
    if (df(denom_col) == 0).any():
        print(f"(DIVISION WARNING) '{denom_col}' contains zeros.")
    
    df(new_col) = df(num_col) / df(denom_col)
    return df

df = create_ratio_debug(df, "total_busy_dashers", "total_onshift_dashers", "busy_dashers_ratio")

Here is an output.

Discrisited Division Division Discrusions in Debaging Azigar issuesDiscrisited Division Division Discrusions in Debaging Azigar issues

This causes potential divisions by zero cases and prevents silent insects.

Modeling errors: shape matching and diagnostic confusion

Case 1: Non values in features cause the model crash

We say we want to create a linear reactionary model. LinearRegression() NAN does not support values. If X has the value of the disappearance of any row, the model refuses to train.

Here is the code, which creates a deliberate shape to stimulate the error:

from sklearn.linear_model import LinearRegression

X_train = df(("estimated_order_place_duration", "estimated_store_to_consumer_driving_duration")).iloc(:-10)
y_train = df("actual_total_delivery_duration").iloc(:-5) 
model = LinearRegression()
model.fit(X_train, y_train)

Here is an output.

Modeling errors in debugging issuesModeling errors in debugging issues

Let’s debug on this issue. First, we test Nanz.

print(X_train.isna().sum())

Here is an output.

To debug on non -valuesTo debug on non -values

Well, let’s check the other variable too.

print(y_train.isna().sum())

Here is an output.

To debug on non -valuesTo debug on non -values

Similarity and non -values must be resolved. This is the code to fix it.

from sklearn.linear_model import LinearRegression

# Re-align X and y to have the same length
X = df(("estimated_order_place_duration", "estimated_store_to_consumer_driving_duration"))
y = df("actual_total_delivery_duration")

# Step 1: Drop rows with NaN in features (X)
valid_X = X.dropna()

# Step 2: Align y to match the remaining indices of X
y_aligned = y.loc(valid_X.index)

# Step 3: Find indices where y is not NaN
valid_idx = y_aligned.dropna().index

# Step 4: Create final clean datasets
X_clean = valid_X.loc(valid_idx)
y_clean = y_aligned.loc(valid_idx)

model = LinearRegression()
model.fit(X_clean, y_clean)
print("✅ Model trained successfully!")

And voyilà! Here is an output.

Dataste of debugging Azigar problemsDataste of debugging Azigar problems

Case 2: Object column (dates) crash model

We say you try to train models like a timstamp actual_delivery_time.

But – oh no – it’s still an item or date time, and you mistakenly mix it in numerical columns. Linear regression does not like it a little.

from sklearn.linear_model import LinearRegression

X = df(("actual_delivery_time", "estimated_order_place_duration"))
y = df("actual_total_delivery_duration")

model = LinearRegression()
model.fit(X, y)

The error code is this:

To debug on azagar in Object columnsTo debug on azagar in Object columns

You are combining two contradictory data types in the X -matrix:

  • A column (actual_delivery_time) Is datetime64.
  • Other (estimated_order_place_duration) Is int64.

Skate Learn expects all the features to be the same numerical datchep. It cannot handle mixed types such as date time and intake. Let’s resolve it by converting the date time column into a numerical representation.

# Ensure datetime columns are parsed correctly, coercing errors to NaT
df("actual_delivery_time") = pd.to_datetime(df("actual_delivery_time"), errors="coerce")
df("created_at") = pd.to_datetime(df("created_at"), errors="coerce")

# Recalculate duration in case of new NaNs
df("actual_total_delivery_duration") = (df("actual_delivery_time") - df("created_at")).dt.total_seconds()

# Convert datetime to a numeric feature (Unix timestamp in seconds)
df("delivery_time_timestamp") = df("actual_delivery_time").astype("int64") // 10**9

Well now that the D -type is numerical, let’s apply the ML model.

from sklearn.linear_model import LinearRegression

# Use the new numeric timestamp feature
X = df(("delivery_time_timestamp", "estimated_order_place_duration"))
y = df("actual_total_delivery_duration")

# Drop any remaining NaNs from our feature set and target
X_clean = X.dropna()
y_clean = y.loc(X_clean.index).dropna()
X_clean = X_clean.loc(y_clean.index)

model = LinearRegression()
model.fit(X_clean, y_clean)
print("✅ Model trained successfully!")

Here is an output.

To debug on azagar in Object columnsTo debug on azagar in Object columns

Great work!

Final views: Debug Summer, not difficult

Model crashes are not always produced by complex insects – sometimes, it is just a wandering naan or unchanged history column that sticks to your data pipeline.

Instead of boating with crapatic steak marks or tossing try-except In the dark, blocks like darts, dig your data frame. Peek .info()Check out, .isna().sum()And do not refrain from .dtypes. These simple steps unveil hidden mines before you hit fit().

I have shown you that even a neglected object type or timid lost price can sabotage a model. But with a sharp eye, cleaner preparation, and deliberately removing the feature, you will be shifting to build intelligence with debugging reactions.

Net Razii A data is in a scientist and product strategy. He is also an affiliated professor of Teaching Analytics, and is the founder of Stratskrich, a platform that helps data scientists prepare for his interview with the real questions of high companies. The net carrier writes on the latest trends in the market, gives interview advice, sharing data science projects, and everything covers SQL.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro