Anomaly Detection: How to Find Outliers Using the Grubbs Test – PyImageSearch

Anomaly Detection: How to Find Outliers Using the Grubbs Test - PyImageSearch



Home » Blog » Anomaly Detection: How to Find Outliers Using the Grubbs Test

Table of Contents


In this tutorial, you will learn how to find outliers in data using a popular statistical technique called the Grubbs test.

In the world of data analytics, detecting anomalies is crucial for uncovering patterns that deviate from the norm. Whether it’s identifying fraudulent transactions, spotting manufacturing defects, or analyzing climate data, the ability to find outliers can significantly enhance decision-making processes. One of the robust statistical methods employed for this purpose is the Grubbs test. Named after Frank E. Grubbs, this test is specifically designed to detect a single outlier in a normally distributed dataset.

The Grubbs test works by comparing the maximum deviation of the data points from the mean relative to the standard deviation. It effectively identifies whether an extreme value within the dataset is indeed an outlier or just a part of the natural variation.

In this blog post, we will delve into the mechanics of the Grubbs test, its application in anomaly detection, and provide a practical guide on how to implement it using real-world data.

This lesson is the last of a 4-part series on Anomaly Detection:

  1. Credit Card Fraud Detection Using Spectral Clustering
  2. Predictive Maintenance Using Isolation Forest
  3. Build a Network Intrusion Detection System with Variational Autoencoders
  4. Anomaly Detection: How to Find Outliers Using the Grubbs Test (this tutorial)

To learn how to use the Grubbs test for outlier detection, just keep reading.


An outlier (Figure 1) is a data point that significantly differs from other observations in a dataset. It stands out due to its unusually high or low value compared to the rest of the data, which can indicate that it’s an anomaly, error, or something noteworthy. Outliers can arise for various reasons (e.g., measurement errors, data entry mistakes, or genuine but rare events). Regardless of their origin, identifying outliers is crucial because they can distort statistical analyses, leading to incorrect conclusions.

Figure 1: What is outlier detection (source: Medium).

Outliers are important in fields (e.g., healthcare, and engineering), where detecting anomalies can have significant implications. For instance, in fraud detection, an outlier might represent an unusual transaction that warrants further investigation. In quality control, an outlier could indicate a defect in a manufacturing process. By understanding and identifying outliers, we can improve data quality, make better decisions, and gain deeper insights into the underlying patterns of the data. The Grubbs test is one of the statistical methods used to identify these outliers, ensuring that the integrity of the analysis is maintained.


The Grubbs test is a statistical method used to identify outliers in a normally distributed dataset. It is particularly useful when you suspect that a single data point is significantly different from the rest of the data. Further, one should first verify that a normal distribution can reasonably approximate the data before applying the Grubbs test.

The Grubbs test evaluates whether the most extreme value in the dataset deviates significantly from the mean of the rest of the data. This involves comparing the largest deviation from the mean, normalized by the standard deviation, against a critical value.

Performing the Grubbs test involves the following steps.


Formulating the Hypotheses

Formulating the hypotheses is a crucial step in statistical testing, including the Grubbs test for outlier detection. Hypotheses provide a framework for making decisions based on statistical evidence. In the context of the Grubbs test, we define two competing hypotheses (Figure 2): the null hypothesis H_0 and the alternative hypothesis H_1.

Figure 2: Null vs. Alternate Hypothesis (source: ThoughtCo).

Null Hypothesis

The null hypothesis (H_0) represents a statement of no effect or no difference. It is the hypothesis that there is no significant outlier in the dataset. Essentially, H_0 assumes that all data points belong to the same distribution and that none of them are extreme enough to be considered outliers.

Mathematically, the null hypothesis for the Grubbs test can be stated as:

H_0: text{There are no outliers in the dataset.}


Alternative Hypothesis

The alternative hypothesis (H_1) represents a statement that contradicts the null hypothesis. It is the hypothesis that there is at least one significant outlier in the dataset. If the data provide sufficient evidence against H_0, then we reject H_0 in favor of H_1.

For the Grubbs test, the alternative hypothesis can be stated as:

H_1: text{There is at least one outlier in the dataset.}


Calculate the Test Statistic

Calculating the test statistic helps us quantify how extreme the most distant data point is compared to the rest of the data. The Grubbs test statistic (G) is designed to measure the largest standardized deviation from the mean. It is given by the formula:

G = displaystylefrac{max(|x_i - bar{x}|)}{s}

where:


Determining the Critical Value with t-Distribution

The t-distribution (Figure 3), also known as Student’s t-distribution, is a probability distribution that is symmetric and bell-shaped like the normal distribution but has heavier tails. This distribution is particularly useful when dealing with small sample sizes or when the population standard deviation is unknown.


Key Characteristics of the t-Distribution

  • Degrees of Freedom: The shape of the t-distribution is determined by the degrees of freedom, which are typically the sample size minus one (n-1). As the degrees of freedom increase, the t-distribution approaches the normal distribution.
  • Heavier Tails: Compared to the normal distribution, the t-distribution has thicker tails, meaning it gives more weight to extreme values. This characteristic accounts for the increased variability expected in smaller samples.
  • Symmetry: Like the normal distribution, the t-distribution is symmetric around its mean, which is zero.

In the context of the Grubbs test for outlier detection, the t-distribution is used to determine the critical value for the test statistic. A critical value allows us to compare the calculated test statistic G to a threshold to decide whether an outlier is present. The critical value is essentially a threshold that defines the boundary for determining whether a data point is an outlier.

If the test statistic G exceeds this critical value (Figure 4), we reject the null hypothesis H_0 and conclude that there is an outlier in the dataset. Conversely, if G is less than or equal to the critical value, we do not reject H_0.

Figure 4: Critical values in left- (top), right- (middle), and two-tailed (bottom) distributions (source: Cuemath).

The critical value for the Grubbs test G_text{critical} for a “two-tailed” distribution (Figure 4, bottom plot) is calculated using the following formula:

G_text{critical} = displaystylefrac{(n-1) sqrt{t^2_{(alpha / (2n), n-2)}}}{sqrt{n} sqrt{n-2 + t^2_{(alpha / (2n), n-2)}}}

where:

For a “left-tailed” or “right-tailed” Grubbs test (Figure 4, top and middle plots), the critical value can be calculated using a similar formula:

G_text{critical} = displaystylefrac{(n-1) sqrt{t^2_{(alpha / (n), n-2)}}}{sqrt{n} sqrt{n-2 + t^2_{(alpha / (n), n-2)}}}

where instead of using t_{(alpha / (2n), n-2)}, we use t_{(alpha / (n), n-2)} as the critical value from the t-distribution.

Note: We need to use statistical tables (Table 1) or software (e.g., Python or R) to find the critical value from the t-distribution for the chosen alpha and degrees of freedom (text{df}).

Table 1: Critical value of t-distribution for different degrees of freedom and significance levels (source: Grade Calculator).

In this section, we will see how to perform the Grubbs test in Python for sample datasets with small sample sizes. We will use the outliers package that has a predefined implementation of the Grubbs test for the left-, right-, and two-tailed distributions. We will install the outliers package via pip install outlier_utils.


Left-Tailed Grubbs Test

The left-tailed Grubbs test is a variation of the traditional Grubbs test, specifically designed to detect a single outlier that is significantly lower than the other values in a dataset (Figure 4, top plot). While the standard Grubbs test can identify both high and low outliers, the left-tailed version focuses exclusively on identifying values that are unusually small compared to the rest of the data.

The following code snippet uses smirnov_grubbs.min_test_indices() from the outliers package to obtain the outlier on the left side of the distribution.

import numpy as np
from outliers import smirnov_grubbs as grubbs
 
# define sample data
data1 = np.array([20, 21, 26, 24, 29,
                 22, 21, 50, 28, 27, 5])


# perform min Grubbs' test
data1_without_outlier = grubbs.min_test(data1, alpha=.05)
print(f"After removing left outlier(s) in {data1} is : ", data1_without_outlier)


# define sample data
data2 = np.array([20, 21, 26, 24, 29,
                 22, 21, 30, 28, 27, 5])


# perform min Grubbs' test
data2_without_outlier = grubbs.min_test(data2, alpha=.05)
print(f"After removing left outlier(s) in {data2} is : ", data2_without_outlier)
After removing left outlier(s) in [20 21 26 24 29 22 21 50 28 27  5] is :  [20 21 26 24 29 22 21 50 28 27  5]
After removing left outlier(s) in [20 21 26 24 29 22 21 30 28 27  5] is :  [20 21 26 24 29 22 21 30 28 27]

This Python code demonstrates how to detect and remove left-sided outliers from datasets using the Grubbs test provided by the smirnov_grubbs module. On Lines 1 and 2, we import the necessary libraries, numpy for numerical operations and smirnov_grubbs for the Grubbs test. Lines 5 and 6 define the first sample dataset, data1, which includes a potential outlier. On Lines 10 and 11, we perform the left-tailed Grubbs test on data1 to detect and remove any low outliers, printing the dataset after the outlier removal.

Similarly, Lines 15 and 16 define the second sample dataset, data2. On Lines 20 and 21, we again apply the left-tailed Grubbs test to data2 to identify and remove any left outliers and then print the resulting dataset. This process helps us clean the data by removing values that are significantly lower than the others, ensuring more accurate analysis.

The output of the code shows the result of applying the left-tailed Grubbs test to each dataset to detect and remove low outliers.

  1. For the first dataset, [20, 21, 26, 24, 29, 22, 21, 50, 28, 27, 5], the test does not identify any significant low outlier, so the dataset remains unchanged.
  2. In contrast, for the second dataset, [20, 21, 26, 24, 29, 22, 21, 30, 28, 27, 5], the test detects the value 5 as a low outlier and removes it, resulting in the updated dataset [20, 21, 26, 24, 29, 22, 21, 30, 28, 27].

Thus, this process helps clean the data by eliminating extreme values that could skew the analysis.


Right-Tailed Grubbs Test

Similar to the left-tailed Grubbs test, the right-tailed Grubbs test is used to detect a single outlier that is significantly larger than the other values in a dataset (Figure 4, middle plot).

The following code snippet uses smirnov_grubbs.max_test_indices() from the outliers package to get the outlier on the right side of the distribution.

import numpy as np
from outliers import smirnov_grubbs as grubbs
 
# define sample data
data1 = np.array([20, 21, 26, 24, 29,
                 22, 21, 50, 28, 27, 5])


# perform max Grubbs' test
data1_without_outlier = grubbs.max_test(data1, alpha=.05)
print(f"After removing right outlier(s) in {data1} is : ", data1_without_outlier)


# define sample data
data2 = np.array([20, 21, 26, 24, 29,
                 22, 21, 30, 28, 27, 5])


# perform max Grubbs' test
data2_without_outlier = grubbs.max_test(data2, alpha=.05)
print(f"After removing right outlier(s) in {data2} is : ", data2_without_outlier)

Output:

After removing right outlier(s) in [20 21 26 24 29 22 21 50 28 27  5] is :  [20 21 26 24 29 22 21 28 27  5]
After removing right outlier(s) in [20 21 26 24 29 22 21 30 28 27  5] is :  [20 21 26 24 29 22 21 30 28 27  5]

The output of the code shows the result of applying the right-tailed Grubbs test to each dataset to detect and remove low outliers.

  1. For the first dataset, [20, 21, 26, 24, 29, 22, 21, 50, 28, 27, 5], the test detects the value 50 as a high outlier and removes it, resulting in the updated dataset [20, 21, 26, 24, 29, 22, 21, 28, 27, 5].
  2. In contrast, for the second dataset, [20, 21, 26, 24, 29, 22, 21, 30, 28, 27, 5], the test does not identify any significant low outlier, so the dataset remains unchanged.

Two-Tailed Grubbs Test

The two-tailed Grubbs test (Figure 4, bottom plot) is designed to identify both high and low outliers (as discussed in the theory section). This test is an extension of the original Grubbs test, which can be used to find a single outlier that significantly deviates from the mean of the dataset. However, it does so on both ends of the distribution simultaneously.

The following code snippet uses smirnov_grubbs.test() from the outliers package to get the outlier on the right side of the distribution.

import numpy as np
from outliers import smirnov_grubbs as grubbs
 
# define sample data
data1 = np.array([20, 21, 26, 24, 29,
                 22, 21, 55, 28, 27, -1])

# perform two-sided Grubbs' test
data1_without_outlier = grubbs.test(data1, alpha=.05)
print(f"After removing outliers in {data1} is : ", data1_without_outlier)

# define sample data
data2 = np.array([20, 21, 26, 24, 29,
                 22, 21, 30, 28, 27, 5])

# perform two-sided Grubbs' test
data2_without_outlier = grubbs.test(data2, alpha=.05)
print(f"After removing outliers in {data2} is : ", data2_without_outlier)
After removing outliers in [20 21 26 24 29 22 21 55 28 27 -1] is :  [20 21 26 24 29 22 21 28 27]
After removing outliers in [20 21 26 24 29 22 21 30 28 27  5] is :  [20 21 26 24 29 22 21 30 28 27]

The output of the code shows the result of applying the two-tailed Grubbs test to each dataset to detect and remove low outliers.

  1. For the first dataset, [20, 21, 26, 24, 29, 22, 21, 55, 28, 27, -1], the test detects the values 55 and -1 as two outliers and removes them, resulting in the updated dataset [20, 21, 26, 24, 29, 22, 21, 28, 27].
  2. In contrast, for the second dataset, [20, 21, 26, 24, 29, 22, 21, 30, 28, 27, 5], the test detects the value 5 as the only outlier and removes it, resulting in the updated dataset [20, 21, 26, 24, 29, 22, 21, 30, 28, 27].

In this blog post, we delve into the essential task of identifying anomalies within datasets, a critical step for improving data integrity and analysis accuracy. We start by defining what an outlier is and explain its importance in various fields (e.g., finance, healthcare, and quality control). Understanding outliers helps in enhancing decision-making processes by identifying irregularities that could indicate significant events or errors.

The core of the blog post focuses on the Grubbs test, a powerful statistical method for detecting outliers in normally distributed data. We provide detailed steps on how to formulate hypotheses, calculate the test statistic, and determine the critical value using the t-distribution. We explain the significance of both the null and alternative hypotheses, and cover different variations of the Grubbs test, including left-tailed, right-tailed, and two-tailed tests, each tailored to detect specific types of outliers.

By the end of the post, readers will have a comprehensive understanding of how to apply the Grubbs test to their datasets, effectively identifying and addressing outliers to maintain data accuracy.


Citation Information

Mangla, P. “Anomaly Detection: How to Find Outliers Using the Grubbs Test,” PyImageSearch, P. Chugh, S. Huot, and P. Thakur, eds., 2025, https://pyimg.co/a5mlu

@incollection{Mangla_2025_anomaly-detection-find-outliers-using-grubbs-test,
  author = {Puneet Mangla},
  title = {{Anomaly Detection: How to Find Outliers Using the Grubbs Test}},
  booktitle = {PyImageSearch},
  editor = {Puneet Chugh and Susan Huot and Piyush Thakur},
  year = {2025},
  url = {https://pyimg.co/a5mlu},
}

Join the PyImageSearch Newsletter and Grab My FREE 17-page Resource Guide PDF

Enter your email address below to join the PyImageSearch Newsletter and download my FREE 17-page Resource Guide PDF on Computer Vision, OpenCV, and Deep Learning.






Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.