

Photo by author
# Introduction
Outliers represent extreme values in a given dataset. They are so extreme that they can spoil your analysis like heavily skewed statistics. For example, in a player height dataset, 12 feet is an outlier even for NBA players and pulls the mean up significantly.
How do we handle them? We will answer this question by conducting a real-life data project requested by physician partners during the data scientist recruitment process.
First, we’ll explore detection methods, define outliers, and finally craft prompts to implement the process.
# What are the methods of outlier detection and removal?
Outlier detection depends on the dataset you have. How?
For example, if your dataset has a normal distribution, you can use the standard deviation or z-score to detect them. However, if your dataset does not follow a normal distribution, you can use the percentile method, principal component analysis (PCA), or the interquartile range (IQR) method.
You can check This article To see how to detect outliers using a box plot.
In this section, we’ll explore the procedures and Python code to implement these techniques.
// Standard deviation method
In this method, we can define outliers by measuring how much each value deviates from the mean.
For example, in the graph below, you can see the normal distribution and \(\pm3\) standard deviation from the mean.
![]()
![]()
To use this method, first measure the mean and calculate the standard deviation. Next, determine the threshold by adding and subtracting three standard deviations from the mean, and filter the dataset to retain only values within that range. This is Pandas The code that performs this operation.
import pandas as pd
import numpy as np
col = df('column')
mean = col.mean()
std = col.std()
lower = mean - 3 * std
upper = mean + 3 * std
# Keep values within the 3 std dev range
filtered_df = df((col >= lower) & (col <= upper))We make an assumption: the dataset should follow a normal distribution. What is a Normal distribution? This means that the data follows a balanced, bell-shaped distribution. Here is an example:
![]()
![]()
Using this method, you will flag about 0.3% of the data as outliers, because about 99.99.7% of the data 3 standard deviations from the mean are included.

// IQR
The interquartile range (IQR) represents the middle 50% of your data and shows the most common values in your dataset, as shown in the graph below.
![]()
![]()
To detect outliers using the IQR, first calculate the IQR. In the following code, we define the first and third quartiles and subtract the first quartile from the third to find the IQR (\ (0.75 – 0.25 = 0.5 \)).
Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)
IQR = Q3 - Q1Once you have the IQR, you need to create a filter, defining the limits.
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQRAny value outside these ranges will be flagged as an outlier.
filtered_df = df((df('column') >= lower) & (df('column') <= upper))As you can see from the image below, the IQR represents the box in the middle. You can clearly see the limits we have defined (\(\(\pm1.5\text{IQR}\)).![]()
![]()
You can apply IQR to any distribution, but it works best if the distribution is not skewed too much.
// One hundred percent
The percentile method involves removing values based on a chosen threshold.
This threshold is commonly used because it removes the most extreme 1% to 5% of the data, which are usually outliers.
We did the same thing in the last section while calculating the IQR, like this:
Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)For example, let’s define the top 99% and lower 1% of the dataset as outliers.
lower_p = df('column').quantile(0.01)
upper_p = df('column').quantile(0.99)Finally, filter the dataset based on these limits.
filtered_df = df((df('column') >= lower_p) & (df('column') <= upper_p))This method does not rely on assumptions unlike standard deviation (normal distribution) and IQR methods (no maximum skewed distribution).
# Detecting Outliers Data Project from Physician Partners
Physician Partners is a healthcare group that helps physicians coordinate patient care more effectively. i This data projectthey asked us to create an algorithm that can find outliers in data in one or more columns.
First, let’s explore the dataset using this code.
sfrs = pd.read_csv('sfr_test.csv')
sfrs.head()The output is:
| member_unique_id | gender | sink | Eligible | Ahl_Monit | affinity_type | pbp_group | plan_name | NPI | line_of_business |
|---|---|---|---|---|---|---|---|---|---|
| 1 | f | 21/06/1990 | 2020 | 202006 | associated with | Non-SNP | Medicare – Careless | 1 | We |
| 2 | M | 02/01/1948 | 2020 | 202006 | associated with | Non-SNP | Naan | 1 | We |
| 3 | M | 14/06/1948 | 2020 | 202006 | associated with | Non-SNP | Medicare – Careless | 1 | We |
| 4 | M | 10/02/1954 | 2020 | 202006 | associated with | D-snp | Medicare – Carinades | 1 | We |
| 5 | M | 31/12/1953 | 2020 | 202006 | associated with | Non-SNP | Naan | 1 | We |
However, there are more columns that we haven’t seen head() To see them, let’s use the method info() method
And let’s see the output.

This dataset contains simulated health care and financial information, including demographics, plan details, clinical flags, and financial columns used to identify unusually high spending members.
Here are those columns and their descriptions.
| Column | Explanation |
|---|---|
| member_unique_id | Member ID |
| gender | Gender of the member |
| sink | Member’s date of birth |
| Eligible | The year |
| Ahl_Monit | The month |
| affinity_type | Kind of doctor |
| pbp_group | Health Plan Group |
| plan_name | Name of the health plan |
| NPI | Doctor’s ID |
| line_of_business | Type of health plan |
| ESRD | True if the patient is on dialysis |
| Hospice | True if the patient is in hospice |
As you can see from the project data detail, there is a catch: some data points include a dollar sign (“$”), so care needs to be taken.
![]()
![]()
Let’s take a closer look at this column.
Here is the output.


Dollar signs and those commas need to be resolved so we can analyze the data properly.
# Quick manipulation for outlier detection
Now we are familiar with the details of the dataset. Now it’s time to write two different notations: one to detect outliers and one to remove them.
// A signal to detect outliers
We’ve learned three different techniques, so we should add them right away.
Also, as you can see info() As the procedure outputs, the dataset has NANs (missing values): most columns have 10,530 entries, but some columns have missing values (eg, plan_name 6,606 columns with non-null values). It should be taken care of.
The hint is:
You are a data analysis assistant. I have attached a dataset. Your task is to detect outliers using three methods: standard deviation, IQR, and percentile.
Follow these steps:
1. Load the linked dataset and remove both the “$” sign and any comma separators (“,”) from the financial columns, then convert them to numeric.
2. Handle missing values by removing rows with NA in numeric columns that we analyze.
3. Apply the three methods to the financial columns:
Standard deviation method: Flag values are mean +/- 3 * std
IQR method: Q1 – 1.5 * IQR and Q3 + 1.5 * Flag values outside IQR
Percentile Method: Use the 1st and 99th percentiles as cutoffs
4. Instead of listing all the results for each column, just calculate and output:
– Total number of outliers found in all financial columns for each procedure
– Average number per column for each procedureAdditionally, save the row indices of detected outliers in three separate CSV files:
– sd_outlier_indices.csv
– iqr_outlier_indices.csv
– PercentageThe output simply calculates the summary and saves the indices to CSV.
financial_columns = (
“IPA_FUNDING”,
“ma_premium”,
“ma_risk_score”,
“mbr_with_rx_rebates”,
“part_size”,
“pcp_cap”,
“PCP_FFS”,
“plan_paymum”,
“Professor”,
“Insurance”,
“risk_score_parted”,
“rx”,
“rx_rebates”,
“rx_with_rebates”,
“rx_without_rebates”,
“spec_cap”
Jeez
The above command will first load the dataset and handle the missing values by removing them. Next, it will output the number of exits using the financial columns and create three CSV files. These will include indicators of missing values for each of these techniques.
// A signal to remove the outliers
After finding the indexes, the next step is to eliminate them. To do this, we will also write a hint.
You are a data analysis assistant. I’ve also attached a dataset to the csv that contains the indexes that are going out.
Your job is to remove these outliers and return a clean version of the dataset.
1. Load the dataset.
2. Remove all the given outliers using the given indices.
3. Confirm how many values were removed.
4. Return the cleaned dataset.
This indicator first loads the dataset and removes outliers using the given indices.
# Test pointers
Let’s examine how these work. First, download the dataset.
// Outlier detection signal
Now, attach the dataset you want to ChatGPT (or the Large Language Model (LLM) of your choice). Paste indicator to detect outliers after connecting to the dataset. Let’s see the output.


The output shows how many outliers each method detects, averaged per column, and, as requested, contains CSV files containing the identities of these outliers.
We then tell it to make all CSVs downloadable with this hint:
Generate clean CSVs for download
Here is the output with links.


// Hint to remove outliers
This is the last step. Select the method you want to use to remove outliers, then copy the outlier removal prompt. Attach the CSV to this prompt and send it.
![]()
![]()
We removed the outliers. Now, let’s validate this using Python. The following code will read the cleaned dataset and compare the shapes to display before and after.
cleaned = pd.read_csv("/cleaned_dataset.csv")
print("Before:", sfrs.shape)
print("After :", cleaned.shape)
print("Removed rows:", sfrs.shape(0) - cleaned.shape(0))Here is the output.
![]()
![]()
This confirms that we removed 791 outliers, using the standard deviation method with ChatGPT.
# Final thoughts
Removing outliers not only increases the performance of your machine learning model, but also makes your analysis more robust. Extreme values can distort your analysis. The reason for these outliers? They could be simple typos, or they could be values that appear in the dataset but aren’t representative of the real population, like 7-foot guys like Shaquille O’Neal.
To remove outliers, you can use these techniques using Python or go a step further and add AI to the process using your pointers. Always be very careful because your dataset may contain details that AI cannot understand at first glance like “$” signs.
Nate Rosedy A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Netcareer writes on the latest trends in the market, gives interview tips, shares data science projects, and covers everything SQL.