Quick engineering for outlier detection

by SkillAiNest

Quick engineering for outlier detectionQuick engineering for outlier detection
Photo by author

# Introduction

Outliers represent extreme values ​​in a given dataset. They are so extreme that they can spoil your analysis like heavily skewed statistics. For example, in a player height dataset, 12 feet is an outlier even for NBA players and pulls the mean up significantly.

How do we handle them? We will answer this question by conducting a real-life data project requested by physician partners during the data scientist recruitment process.

First, we’ll explore detection methods, define outliers, and finally craft prompts to implement the process.

# What are the methods of outlier detection and removal?

Outlier detection depends on the dataset you have. How?

For example, if your dataset has a normal distribution, you can use the standard deviation or z-score to detect them. However, if your dataset does not follow a normal distribution, you can use the percentile method, principal component analysis (PCA), or the interquartile range (IQR) method.

You can check This article To see how to detect outliers using a box plot.

In this section, we’ll explore the procedures and Python code to implement these techniques.

// Standard deviation method

In this method, we can define outliers by measuring how much each value deviates from the mean.

For example, in the graph below, you can see the normal distribution and \(\pm3\) standard deviation from the mean.

Quick engineering for outlier detectionQuick engineering for outlier detection

To use this method, first measure the mean and calculate the standard deviation. Next, determine the threshold by adding and subtracting three standard deviations from the mean, and filter the dataset to retain only values ​​within that range. This is Pandas The code that performs this operation.

import pandas as pd
import numpy as np

col = df('column')

mean = col.mean()
std = col.std()

lower = mean - 3 * std
upper = mean + 3 * std

# Keep values within the 3 std dev range
filtered_df = df((col >= lower) & (col <= upper))

We make an assumption: the dataset should follow a normal distribution. What is a Normal distribution? This means that the data follows a balanced, bell-shaped distribution. Here is an example:

Quick engineering for outlier detectionQuick engineering for outlier detection

Using this method, you will flag about 0.3% of the data as outliers, because about 99.99.7% of the data 3 standard deviations from the mean are included.
Quick engineering for outlier detectionQuick engineering for outlier detection

// IQR

The interquartile range (IQR) represents the middle 50% of your data and shows the most common values ​​in your dataset, as shown in the graph below.

Quick engineering for outlier detectionQuick engineering for outlier detection

To detect outliers using the IQR, first calculate the IQR. In the following code, we define the first and third quartiles and subtract the first quartile from the third to find the IQR (\ (0.75 – 0.25 = 0.5 \)).

Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)

IQR = Q3 - Q1

Once you have the IQR, you need to create a filter, defining the limits.

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

Any value outside these ranges will be flagged as an outlier.

filtered_df = df((df('column') >= lower) & (df('column') <= upper))

As you can see from the image below, the IQR represents the box in the middle. You can clearly see the limits we have defined (\(\(\pm1.5\text{IQR}\)).
Quick engineering for outlier detectionQuick engineering for outlier detection

You can apply IQR to any distribution, but it works best if the distribution is not skewed too much.

// One hundred percent

The percentile method involves removing values ​​based on a chosen threshold.

This threshold is commonly used because it removes the most extreme 1% to 5% of the data, which are usually outliers.

We did the same thing in the last section while calculating the IQR, like this:

Q1 = df('column').quantile(0.25)
Q3 = df('column').quantile(0.75)

For example, let’s define the top 99% and lower 1% of the dataset as outliers.

lower_p = df('column').quantile(0.01)
upper_p = df('column').quantile(0.99)

Finally, filter the dataset based on these limits.

filtered_df = df((df('column') >= lower_p) & (df('column') <= upper_p))

This method does not rely on assumptions unlike standard deviation (normal distribution) and IQR methods (no maximum skewed distribution).

# Detecting Outliers Data Project from Physician Partners

Physician Partners is a healthcare group that helps physicians coordinate patient care more effectively. i This data projectthey asked us to create an algorithm that can find outliers in data in one or more columns.

First, let’s explore the dataset using this code.

sfrs = pd.read_csv('sfr_test.csv')
sfrs.head()

The output is:

member_unique_idgendersinkEligibleAhl_Monitaffinity_typepbp_groupplan_nameNPIline_of_business
1f21/06/19902020202006associated withNon-SNPMedicare – Careless1We
2M02/01/19482020202006associated withNon-SNPNaan1We
3M14/06/19482020202006associated withNon-SNPMedicare – Careless1We
4M10/02/19542020202006associated withD-snpMedicare – Carinades1We
5M31/12/19532020202006associated withNon-SNPNaan1We

However, there are more columns that we haven’t seen head() To see them, let’s use the method info() method

And let’s see the output.
Quick engineering for outlier detectionQuick engineering for outlier detection

This dataset contains simulated health care and financial information, including demographics, plan details, clinical flags, and financial columns used to identify unusually high spending members.

Here are those columns and their descriptions.

ColumnExplanation
member_unique_idMember ID
genderGender of the member
sinkMember’s date of birth
EligibleThe year
Ahl_MonitThe month
affinity_typeKind of doctor
pbp_groupHealth Plan Group
plan_nameName of the health plan
NPIDoctor’s ID
line_of_businessType of health plan
ESRDTrue if the patient is on dialysis
HospiceTrue if the patient is in hospice

As you can see from the project data detail, there is a catch: some data points include a dollar sign (“$”), so care needs to be taken.

Quick engineering for outlier detectionQuick engineering for outlier detection

Let’s take a closer look at this column.

Here is the output.

Quick engineering for outlier detectionQuick engineering for outlier detection

Dollar signs and those commas need to be resolved so we can analyze the data properly.

# Quick manipulation for outlier detection

Now we are familiar with the details of the dataset. Now it’s time to write two different notations: one to detect outliers and one to remove them.

// A signal to detect outliers

We’ve learned three different techniques, so we should add them right away.

Also, as you can see info() As the procedure outputs, the dataset has NANs (missing values): most columns have 10,530 entries, but some columns have missing values ​​(eg, plan_name 6,606 columns with non-null values). It should be taken care of.

The hint is:

You are a data analysis assistant. I have attached a dataset. Your task is to detect outliers using three methods: standard deviation, IQR, and percentile.

Follow these steps:

1. Load the linked dataset and remove both the “$” sign and any comma separators (“,”) from the financial columns, then convert them to numeric.

2. Handle missing values ​​by removing rows with NA in numeric columns that we analyze.

3. Apply the three methods to the financial columns:

Standard deviation method: Flag values ​​are mean +/- 3 * std

IQR method: Q1 – 1.5 * IQR and Q3 + 1.5 * Flag values ​​outside IQR

Percentile Method: Use the 1st and 99th percentiles as cutoffs

4. Instead of listing all the results for each column, just calculate and output:

– Total number of outliers found in all financial columns for each procedure
– Average number per column for each procedure

Additionally, save the row indices of detected outliers in three separate CSV files:
– sd_outlier_indices.csv
– iqr_outlier_indices.csv
– Percentage

The output simply calculates the summary and saves the indices to CSV.

financial_columns = (
“IPA_FUNDING”,
“ma_premium”,
“ma_risk_score”,
“mbr_with_rx_rebates”,
“part_size”,
“pcp_cap”,
“PCP_FFS”,
“plan_paymum”,
“Professor”,
“Insurance”,
“risk_score_parted”,
“rx”,
“rx_rebates”,
“rx_with_rebates”,
“rx_without_rebates”,
“spec_cap”
Jeez

The above command will first load the dataset and handle the missing values ​​by removing them. Next, it will output the number of exits using the financial columns and create three CSV files. These will include indicators of missing values ​​for each of these techniques.

// A signal to remove the outliers

After finding the indexes, the next step is to eliminate them. To do this, we will also write a hint.

You are a data analysis assistant. I’ve also attached a dataset to the csv that contains the indexes that are going out.

Your job is to remove these outliers and return a clean version of the dataset.

1. Load the dataset.
2. Remove all the given outliers using the given indices.
3. Confirm how many values ​​were removed.
4. Return the cleaned dataset.

This indicator first loads the dataset and removes outliers using the given indices.

# Test pointers

Let’s examine how these work. First, download the dataset.

// Outlier detection signal

Now, attach the dataset you want to ChatGPT (or the Large Language Model (LLM) of your choice). Paste indicator to detect outliers after connecting to the dataset. Let’s see the output.

Quick engineering for outlier detectionQuick engineering for outlier detection

The output shows how many outliers each method detects, averaged per column, and, as requested, contains CSV files containing the identities of these outliers.

We then tell it to make all CSVs downloadable with this hint:

Generate clean CSVs for download

Here is the output with links.

Quick engineering for outlier detectionQuick engineering for outlier detection

// Hint to remove outliers

This is the last step. Select the method you want to use to remove outliers, then copy the outlier removal prompt. Attach the CSV to this prompt and send it.

Quick engineering for outlier detectionQuick engineering for outlier detection

We removed the outliers. Now, let’s validate this using Python. The following code will read the cleaned dataset and compare the shapes to display before and after.

cleaned = pd.read_csv("/cleaned_dataset.csv")

print("Before:", sfrs.shape)
print("After :", cleaned.shape)
print("Removed rows:", sfrs.shape(0) - cleaned.shape(0))

Here is the output.

Quick engineering for outlier detectionQuick engineering for outlier detection

This confirms that we removed 791 outliers, using the standard deviation method with ChatGPT.

# Final thoughts

Removing outliers not only increases the performance of your machine learning model, but also makes your analysis more robust. Extreme values ​​can distort your analysis. The reason for these outliers? They could be simple typos, or they could be values ​​that appear in the dataset but aren’t representative of the real population, like 7-foot guys like Shaquille O’Neal.

To remove outliers, you can use these techniques using Python or go a step further and add AI to the process using your pointers. Always be very careful because your dataset may contain details that AI cannot understand at first glance like “$” signs.

Nate Rosedy A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Netcareer writes on the latest trends in the market, gives interview tips, shares data science projects, and covers everything SQL.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro