An article exploring techniques for outlier detection in datasets. Learn the best way to use data visualization, z-scores, and clustering techniques to identify outliers in your dataset
Nassim Taleb writes how “tail” events define a big a part of the success (or failure) of a phenomenon on this planet.
Everybody knows that you simply need more prevention than treatment, but few reward acts of prevention.
N. Taleb — The Black Swan
A tail event is a rare event, the probability of which is on the tail of the distribution, on the left or right.
In response to Taleb, we live our lives focusing totally on probably the most plausible events, those which are most definitely to occur. By doing this, we should not preparing ourselves to cope with the rare events which may occur.
When rare events occur (especially the negative ones), they take us by surprise and our usual actions that we typically take don’t have any effect.
Just consider our behavior when a rare event occurs, equivalent to the bankruptcy of the FTX cryptocurrency exchange, or a robust earthquake that disrupts the territory. For those directly involved, the everyday response is panic.
Anomalies are present in all places, and after we draw a distribution and its probability function we are literally obtaining useful information to guard ourselves or to implement strategies for these tail events, should they occur.
It’s due to this fact essential to tell ourselves on the best way to discover these anomalies, and above all to be able to act in cases where they’re observed.
In this text, we’ll give attention to the methods and techniques used to discover outliers (the mentioned anomalies) in data. Specifically, we’ll explore data visualization techniques and using descriptive statistics and statistical testing.
An outlier is a worth that deviates significantly from the opposite values within the dataset. This deviation might be numerical and even categorical.
For instance, a numeric outlier is when now we have one value that is far larger or much smaller than most other values inside the dataset.
A categorical outlier, alternatively, occurs when now we have labels often called “other” or “unknown” that represent a much higher proportion of the opposite labels inside the dataset.
Outliers might be brought on by measurement errors, input errors, transcription errors or just by data that doesn’t follow the conventional trend of the dataset.
In some cases, outliers might be indicative of broader problems within the dataset or the method that produced the information and may offer essential insights to the individuals who developed the information collection process.
There are several techniques that we are able to use to discover outliers in our data. These are those we’ll touch upon in this text
- data visualization: which lets you discover anomalies by the distribution of information by making use of graphs useful for this purpose
- use of descriptive statistics, equivalent to the interquartile range
- use of z-scores
- use of clustering techniques: which allows to discover groups of comparable data and to discover any “isolated” or “unclassifiable” data
each of those methods is valid for identifying outliers, and ought to be chosen based on our data. Let’s see them one after the other.
Data visualization
Some of the common techniques for locating anomalies is thru exploratory data evaluation and particularly with data visualization.
Using Python, you should utilize libraries like Matplotlib or Seaborn to visualise the information in such a way you could easily spot any anomalies.
For instance, you may create a histogram or boxplot to visualise the distribution of your data and spot any values that deviate significantly from the mean.
The anatomy of the boxplot might be understood from this Kaggle post.
https://www.kaggle.com/discussions/general/219871
If you should read more about the best way to perform exploratory data evaluation (EDA), read this text 👇
Use of descriptive statistics
One other approach to identifying anomalies is using descriptive statistics. For instance, the interquartile range (IQR) might be used to discover values that deviate significantly from the mean.
The interquartile range (IQR) is defined because the difference between the third quartile (Q3) and the primary quartile (Q1) of the dataset. Outliers are defined as values outside the IQR range multiplied by a coefficient typically of 1.5.
The previously discussed boxplot is only one method that uses such descriptive metrics to discover anomalies.
An example in Python for identifying outliers using interquartile range is as follows:
import numpy as npdef find_outliers_IQR(data, threshold=1.5):
# Find first and third quartiles
Q1, Q3 = np.percentile(data, [25, 75])
# Compute IQR (interquartile range)
IQR = Q3 - Q1
# Compute lower and upper sure
lower_bound = Q1 - (threshold * IQR)
upper_bound = Q3 + (threshold * IQR)
# Select outliers
outliers = [x for x in data if x < lower_bound or x > upper_bound]
return outliers
This method calculates the primary and third quartiles of the dataset, then calculates the IQR and the lower and upper bounds. Finally, discover outliers as those values which are outside the lower and upper thresholds.
This handy function might be used to discover outliers in a dataset and might be added to your toolkit of utility functions in almost any project.
Use of z-scores
One other technique to spot anomalies is thru z-scores. Z-scores measure how much a worth deviates from the mean by way of standard deviations.
The formula for converting data to z-scores is as follows:
where x is the unique value, μ is the dataset mean, and σ is the dataset standard deviation. The z-score indicates what number of standard deviations the unique value is from the mean. A z-score value greater than 3 (or lower than -3) is generally considered an outlier.
This method is especially useful when working with large datasets and when you should discover anomalies in an objective and reproducible way.
In Sklearn in Python, the conversion to z scores might be done like this
from sklearn.preprocessing import StandardScalerdef find_outliers_zscore(data, threshold=3):
# Normalize data
scaler = StandardScaler()
standardized = scaler.fit_transform(data.reshape(-1, 1))
# Select outliers
outliers = [data[i] for i, x in enumerate(standardized) if x < -threshold or x > threshold]
return outliers
Use of clustering techniques
Finally, clustering techniques might be used to discover any “isolated” or “unclassifiable” data. This might be useful when working with very large and sophisticated datasets, where data visualization is just not enough to identify anomalies.
On this case, one option is to make use of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, which is a clustering algorithm that may discover groups of information based on their density and locate any points that don’t belong to any clusters. These points are regarded as outliers.
The DBSCAN algorithm can again be implemented with Python’s sklearn lib.
Take this visualized dataset for instance
The DBSCAN application provides this visualization
The code to create these charts is as follows
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCANdef generate_data_with_outliers(n_samples=100, noise=0.05, outlier_fraction=0.05, random_state=42):
# Create random data
X = np.concatenate([np.random.normal(0.5, 0.1, size=(n_samples//2, 2)),
np.random.normal(1.5, 0.1, size=(n_samples//2, 2))], axis=0)
# Add outliers
n_outliers = int(outlier_fraction * n_samples)
outliers = np.random.RandomState(seed=random_state).rand(n_outliers, 2) * 3 - 1.5
X = np.concatenate((X, outliers), axis=0)
# Add noise to the information to resemble real-world data
X = X + np.random.randn(n_samples + n_outliers, 2) * noise
return X
# Genereate data
X = generate_data_with_outliers(outlier_fraction=0.2)
# Apply DBSCAN to cluster the information and find outliers
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan.fit(X)
# Select outliers
outlier_indices = np.where(dbscan.labels_ == -1)[0]
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap="viridis")
plt.scatter(X[outlier_indices, 0], X[outlier_indices, 1], c="red", label="Outliers", marker="x")
plt.xticks([])
plt.yticks([])
plt.legend()
plt.show()
This method creates a DBSCAN object with the parameters eps
and min_samples
and suits it to the information. Then discover outliers as those values that don’t belong to any cluster, i.e. those which are labeled as -1.
That is just one in all many clustering techniques that might be used to discover anomalies. For instance, a technique based on deep learning relies on autoencoders particular neural networks that exploit a compressed representation of the information to discover distinctive features within the input data.
In this text now we have seen several techniques that might be used to discover outliers in data.
We talked about data visualization, using descriptive statistics and z-scores, and clustering techniques.
Each of those techniques is valid and ought to be chosen based on the sort of data you might be analyzing. The essential thing is to keep in mind that identifying outliers can provide essential information to enhance data collection processes and to make higher decisions based on the outcomes obtained.