We used 5 outlier detection methods on the original dataset: they disagreed on 96% of the flagged samples.

Photo by author

# Introduction

All data science tutorials make finding outliers fairly easy. Remove all values greater than three standard deviations; This is everything. But once you start working with a real dataset where the distribution is skewed and a stakeholder asks, “Why did you remove that data point?” You suddenly realize you don’t have a good answer.

So we did an experiment. We tested five of the most commonly used outlier detection methods on a real dataset (6,497 Portuguese wines) to find out: Do these methods give consistent results?

They didn’t. What we learned from the disagreement turned out to be more valuable than anything we could have learned from a textbook.

Photo by author

We’ve created this analysis as an interactive Strata Notebook, a format you can use for your own experiments. Data Project on StrataScratch. You can watch and play. Full code here.

# Setting up

Our data comes from Wine quality datasetPublicly available through UCI’s Machine Learning Repository. It includes physicochemical measurements of 6,497 Portuguese “Vinho Verde” wines (1,599 red, 4,898 white) as well as quality ratings by expert tasters.

We chose it for several reasons. This is production data, not something artificially generated. The distributions are skewed (6 out of 11 features have skewness \(> 1 \)), so the data do not meet textbook assumptions. And quality ratings let us check whether the “outliers” found in wines with exceptional ratings are more visible.

Below are five methods we tested:

External detection methods

# Discovering the first surprise: Inflated results from multiple testing

Before we could compare methods, we hit a wall. With 11 features, the simple approach (flag the sample based on an extreme value in at least one feature) produced highly inflated results.

IQR flagged about 23% of the wines as outliers. The Z-Score flagged at around 26%.

When about 1 in 4 wines are flagged as outliers, something is off. 25% of real datasets do not contain outliers. The problem was that we were testing 11 features independently, and that adds to the results.

The math is straightforward. If each feature has less than 5% probability of being a “random” extreme value, then with 11 independent features:
\( P(\text{at least one extreme}) = 1 – (0.95)^{11} \about 43\% \)

Simply put: even if every characteristic is perfectly normal, you’d expect about half of your samples to have at least one extreme value, just by chance.

To fix this, we changed the requirement: flag a sample only if at least 2 features are simultaneously extreme.

Changing min_features from 1 to 2 changed the definition from “any feature of the sample is extreme” to “the sample is extreme in more than one feature”.

Here is the correct code:

# Count extreme features per sample
outlier_counts = (np.abs(z_scores) > 3.5).sum(axis=1)
outliers = outlier_counts >= 2

# Comparing 5 methods on 1 data set

Once multiple testing was fixed, we counted how many samples each method took:

External detection methods
Here is how we set up the ML methods:

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
 
iforest = IsolationForest(contamination=0.05, random_state=42)
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)

Why do ML methods show exactly 5%? Due to the contamination parameter. This requires them to flag exactly eighty percent. This is a quota, not a limit. In other words, Isolation Forest will flag 5% regardless of whether your data contains 1% true outliers or 20%.

# Discovering real differences: They identify different things.

Here’s what surprised us the most. When we examined how well the methods agreed, Jacquard similarity 0.10 to 0.30. This is a bad deal.

Of the 6,497 wines:

Only 32 samples (0.5%) were flagged by all 4 primary methods.
143 samples (2.2%) were flagged by 3+ methods.
The remaining “outliers” were flagged by only 1 or 2 methods.

You might think it’s a bug, but that’s the point. Each method has its own definition of “abnormal”:

External detection methods
If a wine has a significantly higher residual sugar level than the average, it is an unbiased outlier (the Z-Score/IQR will catch it). But if it’s surrounded by other wines with similar sugar levels, LOF won’t flag it. This is normal in the local context.

So the real question is not “Which method is best?” It’s “What kind of unusual am I looking for?”

# Testing for parsimoniousness: do outliers relate to wine quality?

The dataset includes expert quality ratings (3-9). We wanted to know: Do outliers appear more frequently among highly quality-rated wines?

External detection methods
High-quality wines were twice as likely to be out of consensus. This is a good sanity check. In some cases, the relationship is clear: Wines with highly volatile acidity tend to taste vinegary, rank poorly, and are marked as outliers. Chemistry drives both outcomes. But we cannot assume that this explains every case. There may be patterns that we are not seeing, or confounding factors that we have not accounted for.

# Making three decisions that shape our results.

External detection methods

// 1. Using a robust Z-Score instead of a standard Z-Score

A standard Z-Score uses the mean and standard deviation of the data, both of which are affected by outliers in our dataset. A robust Z-Score instead uses the median and median absolute deviation (MAD), neither of which is affected by outliers.

As a result, the Standard Z-Score identified 0.8% of the data as outliers, while the Robust Z-Score identified 3.5%.

# Robust Z-Score using median and MAD
median = np.median(data, axis=0)
mad = np.median(np.abs(data - median), axis=0)
robust_z = 0.6745 * (data - median) / mad

// 2. Measuring red and white wines separately

Red and white wines have different baseline levels of the chemicals. For example, when combining red and white wines in the same dataset, a red wine with exactly average chemistry compared to other red wines may be identified as an outlier based solely on its sulfur content compared to the combined mean of the red and white wines. Therefore, we scaled each wine type separately using the median and interquartile range (IQR) of each wine type, and then combined the two.

# Scale each wine type separately
from sklearn.preprocessing import RobustScaler
scaled_parts = ()
for wine_type in ('red', 'white'):
    subset = df(df('type') == wine_type)(features)
    scaled_parts.append(RobustScaler().fit_transform(subset))

// 3. Knowing when to override a method.

The elliptic envelope assumes that your data follows a multivariate normal distribution. Ours didn’t happen. Six of the eleven features had a reduction above 1, and one feature reached 5.4. We kept the oval envelope in the completion contest, but kept it out of a unanimous vote.

# Determining which method performs best for this wine dataset.

Photo by author

Can we pick a “winner” given the characteristics of our data (heavy skew, mixed population, no known ground truth)?

Robust Z-Score, IQR, Isolation Forest, and LOF all handle skewed data reasonably well. If forced to choose, we’d go with the Isolation Forest: no distributional assumptions, considers all features at once, and deals well with mixed populations.

But no one method does it all:

Isolation Forest can miss people who are high on only one trait (Z-Score/IQR captures them).
Z-Score/IQR can miss outliers that are abnormal across multiple characteristics (multivariate outliers).

Better method: Use multiple methods and rely on consensus. 143 wines flagged by 3 or more methods are more reliable than anything flagged by one method alone.

Here’s how we calculated consensus:

# Count how many methods flagged each sample
consensus = zscore_out + iqr_out + iforest_out + lof_out
high_confidence = df(consensus >= 3)  # Identified by 3+ methods

Without ground truth (as in most real-world projects), method agreement is the closest measure of confidence.

# Understanding what all this means for your own projects.

Define your problem before choosing your method. What kind of “unusual” are you actually looking for? Data entry errors look different from measurement anomalies, and both look different from true rare cases. The type of problem suggests different approaches.

Check your assumptions. If your data is highly skewed, the standard Z-Score and Elliptic Envelope will steer you in the wrong direction. Check your distribution before committing to a method.

Use multiple methods. Samples flagged in three or more ways with different definitions of “outlier” are more reliable than samples flagged by only one.

Do not assume that all outliers must be removed. An outlier may be an error. It may also be your most interesting data point. Domain knowledge makes that call, not algorithms.

# Concluding Remarks

The point here is not that outlier detection is broken. It’s that “outlier” means different things depending on who’s asking. Z-Score and IQR catch values that are extreme on the same dimension. Isolation Forest and LOF find patterns that stand out among their overall patterns. The elliptic envelope works well when your data is actually Gaussian (ours wasn’t).

Before choosing a method, find out what you are really looking for. And if you are not sure? Run multiple methods and run with consensus.

# Frequently Asked Questions

// 1. Determining what technique I should start with.

A good place to start is the isolation forest technique. It doesn’t assume how your data is distributed and uses all your features at once. However, if you want to identify extreme values for a particular measurement (such as very high blood pressure readings), then the Z-Score or IQR may be more appropriate.

// 2. Choosing a contamination rate for skate learning methods

It depends on the problem you are trying to solve. A commonly used value is 5% (or 0.05). But remember that pollution is a quota. This means that 5% of your samples will be classified as outliers, regardless of whether your data actually contains 1% or 20% of true outliers. Use a contamination rate based on your knowledge of the proportion of outliers in your data.

// 3. Removing outliers before splitting the train/test data

No, you should fit an outlier detection model to your training dataset, and then apply the trained model to your testing dataset. If you do otherwise, your test data is affecting your preprocessing, which introduces leakage.

// 4. Handling of categorical features

The techniques involved here work on numerical data. There are three possible alternatives for explicit features:

Encode your explicit variables and continue;
use techniques developed for mixed-type data (eg HBOS);
Run outlier detection on numeric columns separately and use frequency-based methods for classification.

// 5. Knowing whether a flagged outlier is an error or just an anomaly

You cannot determine from the algorithm alone when an identified outlier represents an error rather than just an anomaly. It flags what is abnormal, not what is wrong. For example, a wine with a high residual sugar content may have a data entry error, or it may be a dessert wine that is intended to be too sweet. Ultimately, only your domain expertise can answer. If you’re not sure, mark it for review instead of automatically removing it.

Net Rosiedi A data scientist and product strategist. He is also an adjunct professor teaching analytics, and the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers all things SQL.

# Introduction

# Setting up

# Discovering the first surprise: Inflated results from multiple testing

# Comparing 5 methods on 1 data set

# Discovering real differences: They identify different things.

# Testing for parsimoniousness: do outliers relate to wine quality?

# Making three decisions that shape our results.

// 1. Using a robust Z-Score instead of a standard Z-Score

// 2. Measuring red and white wines separately

// 3. Knowing when to override a method.

# Determining which method performs best for this wine dataset.

# Understanding what all this means for your own projects.

# Concluding Remarks

# Frequently Asked Questions

// 1. Determining what technique I should start with.

// 2. Choosing a contamination rate for skate learning methods

// 3. Removing outliers before splitting the train/test data

// 4. Handling of categorical features

// 5. Knowing whether a flagged outlier is an error or just an anomaly

Editor's pick

Get latest news

We used 5 outlier detection methods on the original dataset: they disagreed on 96% of the flagged samples.

# Introduction

# Setting up

# Discovering the first surprise: Inflated results from multiple testing

# Comparing 5 methods on 1 data set

# Discovering real differences: They identify different things.

# Testing for parsimoniousness: do outliers relate to wine quality?

# Making three decisions that shape our results.

// 1. Using a robust Z-Score instead of a standard Z-Score

// 2. Measuring red and white wines separately

// 3. Knowing when to override a method.

# Determining which method performs best for this wine dataset.

# Understanding what all this means for your own projects.

# Concluding Remarks

# Frequently Asked Questions

// 1. Determining what technique I should start with.

// 2. Choosing a contamination rate for skate learning methods

// 3. Removing outliers before splitting the train/test data

// 4. Handling of categorical features

// 5. Knowing whether a flagged outlier is an error or just an anomaly

Pushri: Get push notifications when your AI agent is finished.

How to Land Freelance Clients with Small Business Whisperer Luke Siciliano (Developer Interview) (Podcast #211)

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news