Quick engineering for outlier detection

Photo by author

# Introduction

Outliers represent extreme values in a given dataset. They are so extreme that they can spoil your analysis like heavily skewed statistics. For example, in a player height dataset, 12 feet is an outlier even for NBA players and pulls the mean up significantly.

How do we handle them? We will answer this question by conducting a real-life data project requested by physician partners during the data scientist recruitment process.

First, we’ll explore detection methods, define outliers, and finally craft prompts to implement the process.

# What are the methods of outlier detection and removal?

Outlier detection depends on the dataset you have. How?

For example, if your dataset has a normal distribution, you can use the standard deviation or z-score to detect them. However, if your dataset does not follow a normal distribution, you can use the percentile method, principal component analysis (PCA), or the interquartile range (IQR) method.

You can check This article To see how to detect outliers using a box plot.

In this section, we’ll explore the procedures and Python code to implement these techniques.

// Standard deviation method

In this method, we can define outliers by measuring how much each value deviates from the mean.

For example, in the graph below, you can see the normal distribution and $\pm3$ standard deviation from the mean.

To use this method, first measure the mean and calculate the standard deviation. Next, determine the threshold by adding and subtracting three standard deviations from the mean, and filter the dataset to retain only values within that range. This is Pandas The code that performs this operation.

import pandas as pd
import numpy as np

col = df('column')

mean = col.mean()
std = col.std()

lower = mean - 3 * std
upper = mean + 3 * std

# Keep values within the 3 std dev range
filtered_df = df((col >= lower) & (col <= upper))

We make an assumption: the dataset should follow a normal distribution. What is a Normal distribution? This means that the data follows a balanced, bell-shaped distribution. Here is an example:

Using this method, you will flag about 0.3% of the data as outliers, because about 99.99.7% of the data 3 standard deviations from the mean are included.
Quick engineering for outlier detection

// IQR

The interquartile range (IQR) represents the middle 50% of your data and shows the most common values in your dataset, as shown in the graph below.

To detect outliers using the IQR, first calculate the IQR. In the following code, we define the first and third quartiles and subtract the first quartile from the third to find the IQR (\ (0.75 – 0.25 = 0.5 \)).

Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)

IQR = Q3 - Q1

Once you have the IQR, you need to create a filter, defining the limits.

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

Any value outside these ranges will be flagged as an outlier.

filtered_df = df((df('column') >= lower) & (df('column') <= upper))

As you can see from the image below, the IQR represents the box in the middle. You can clearly see the limits we have defined ($\(\pm1.5\text{IQR}$).

You can apply IQR to any distribution, but it works best if the distribution is not skewed too much.

// One hundred percent

The percentile method involves removing values based on a chosen threshold.

This threshold is commonly used because it removes the most extreme 1% to 5% of the data, which are usually outliers.

We did the same thing in the last section while calculating the IQR, like this:

Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)

For example, let’s define the top 99% and lower 1% of the dataset as outliers.

lower_p = df('column').quantile(0.01)
upper_p = df('column').quantile(0.99)

Finally, filter the dataset based on these limits.

filtered_df = df((df('column') >= lower_p) & (df('column') <= upper_p))

This method does not rely on assumptions unlike standard deviation (normal distribution) and IQR methods (no maximum skewed distribution).

# Detecting Outliers Data Project from Physician Partners

Physician Partners is a healthcare group that helps physicians coordinate patient care more effectively. i This data projectthey asked us to create an algorithm that can find outliers in data in one or more columns.

First, let’s explore the dataset using this code.

sfrs = pd.read_csv('sfr_test.csv')
sfrs.head()

The output is:

member_unique_id	gender	sink	Eligible	Ahl_Monit	affinity_type	pbp_group	plan_name	NPI	line_of_business
1	f	21/06/1990	2020	202006	associated with	Non-SNP	Medicare – Careless	1	We
2	M	02/01/1948	2020	202006	associated with	Non-SNP	Naan	1	We
3	M	14/06/1948	2020	202006	associated with	Non-SNP	Medicare – Careless	1	We
4	M	10/02/1954	2020	202006	associated with	D-snp	Medicare – Carinades	1	We
5	M	31/12/1953	2020	202006	associated with	Non-SNP	Naan	1	We

However, there are more columns that we haven’t seen head() To see them, let’s use the method info() method

And let’s see the output.
Quick engineering for outlier detection

This dataset contains simulated health care and financial information, including demographics, plan details, clinical flags, and financial columns used to identify unusually high spending members.

Here are those columns and their descriptions.

Column	Explanation
member_unique_id	Member ID
gender	Gender of the member
sink	Member’s date of birth
Eligible	The year
Ahl_Monit	The month
affinity_type	Kind of doctor
pbp_group	Health Plan Group
plan_name	Name of the health plan
NPI	Doctor’s ID
line_of_business	Type of health plan
ESRD	True if the patient is on dialysis
Hospice	True if the patient is in hospice

As you can see from the project data detail, there is a catch: some data points include a dollar sign (“$”), so care needs to be taken.

Let’s take a closer look at this column.

Here is the output.

Quick engineering for outlier detection

Dollar signs and those commas need to be resolved so we can analyze the data properly.

# Quick manipulation for outlier detection

Now we are familiar with the details of the dataset. Now it’s time to write two different notations: one to detect outliers and one to remove them.

// A signal to detect outliers

We’ve learned three different techniques, so we should add them right away.

Also, as you can see info() As the procedure outputs, the dataset has NANs (missing values): most columns have 10,530 entries, but some columns have missing values (eg, plan_name 6,606 columns with non-null values). It should be taken care of.

The hint is:

You are a data analysis assistant. I have attached a dataset. Your task is to detect outliers using three methods: standard deviation, IQR, and percentile.
Follow these steps:
1. Load the linked dataset and remove both the “$” sign and any comma separators (“,”) from the financial columns, then convert them to numeric.
2. Handle missing values by removing rows with NA in numeric columns that we analyze.
3. Apply the three methods to the financial columns:
Standard deviation method: Flag values are mean +/- 3 * std
IQR method: Q1 – 1.5 * IQR and Q3 + 1.5 * Flag values outside IQR
Percentile Method: Use the 1st and 99th percentiles as cutoffs
4. Instead of listing all the results for each column, just calculate and output:
– Total number of outliers found in all financial columns for each procedure
– Average number per column for each procedure
Additionally, save the row indices of detected outliers in three separate CSV files:
– sd_outlier_indices.csv
– iqr_outlier_indices.csv
– Percentage
The output simply calculates the summary and saves the indices to CSV.
financial_columns = (
“IPA_FUNDING”,
“ma_premium”,
“ma_risk_score”,
“mbr_with_rx_rebates”,
“part_size”,
“pcp_cap”,
“PCP_FFS”,
“plan_paymum”,
“Professor”,
“Insurance”,
“risk_score_parted”,
“rx”,
“rx_rebates”,
“rx_with_rebates”,
“rx_without_rebates”,
“spec_cap”
Jeez

The above command will first load the dataset and handle the missing values by removing them. Next, it will output the number of exits using the financial columns and create three CSV files. These will include indicators of missing values for each of these techniques.

// A signal to remove the outliers

After finding the indexes, the next step is to eliminate them. To do this, we will also write a hint.

You are a data analysis assistant. I’ve also attached a dataset to the csv that contains the indexes that are going out.
Your job is to remove these outliers and return a clean version of the dataset.
1. Load the dataset.
2. Remove all the given outliers using the given indices.
3. Confirm how many values were removed.
4. Return the cleaned dataset.

This indicator first loads the dataset and removes outliers using the given indices.

# Test pointers

Let’s examine how these work. First, download the dataset.

// Outlier detection signal

Now, attach the dataset you want to ChatGPT (or the Large Language Model (LLM) of your choice). Paste indicator to detect outliers after connecting to the dataset. Let’s see the output.

Quick engineering for outlier detection

The output shows how many outliers each method detects, averaged per column, and, as requested, contains CSV files containing the identities of these outliers.

We then tell it to make all CSVs downloadable with this hint:

Generate clean CSVs for download

Here is the output with links.

Quick engineering for outlier detection

// Hint to remove outliers

This is the last step. Select the method you want to use to remove outliers, then copy the outlier removal prompt. Attach the CSV to this prompt and send it.

We removed the outliers. Now, let’s validate this using Python. The following code will read the cleaned dataset and compare the shapes to display before and after.

cleaned = pd.read_csv("/cleaned_dataset.csv")

print("Before:", sfrs.shape)
print("After :", cleaned.shape)
print("Removed rows:", sfrs.shape(0) - cleaned.shape(0))

Here is the output.

This confirms that we removed 791 outliers, using the standard deviation method with ChatGPT.

# Final thoughts

Removing outliers not only increases the performance of your machine learning model, but also makes your analysis more robust. Extreme values can distort your analysis. The reason for these outliers? They could be simple typos, or they could be values that appear in the dataset but aren’t representative of the real population, like 7-foot guys like Shaquille O’Neal.

To remove outliers, you can use these techniques using Python or go a step further and add AI to the process using your pointers. Always be very careful because your dataset may contain details that AI cannot understand at first glance like “$” signs.

Nate Rosedy A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Netcareer writes on the latest trends in the market, gives interview tips, shares data science projects, and covers everything SQL.

# Introduction

# What are the methods of outlier detection and removal?

// Standard deviation method

// IQR

// One hundred percent

# Detecting Outliers Data Project from Physician Partners

# Quick manipulation for outlier detection

// A signal to detect outliers

// A signal to remove the outliers

# Test pointers

// Outlier detection signal

// Hint to remove outliers

# Final thoughts

Editor's pick

Get latest news

Quick engineering for outlier detection

# Introduction

# What are the methods of outlier detection and removal?

// Standard deviation method

// IQR

// One hundred percent

# Detecting Outliers Data Project from Physician Partners

# Quick manipulation for outlier detection

// A signal to detect outliers

// A signal to remove the outliers

# Test pointers

// Outlier detection signal

// Hint to remove outliers

# Final thoughts

5 Innovative Automotive Technologies to Watch in 2026

Brand Contextual AI: The Missing Need for Marketing AI

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news