## P-values below a certain threshold are sometimes used as a technique to pick relevant features. Advice below suggests methods to use them accurately.

Multiple hypothesis testing occurs after we repeatedly test models on various features, because the probability of obtaining a number of false discoveries increases with the variety of tests. For instance, in the sphere of genomics, scientists often need to test whether any of the hundreds of genes have a significantly different activity in an consequence of interest. Or whether jellybeans cause pimples.

On this blog post, we are going to cover few of the favored methods used to account for multiple hypothesis testing by adjusting model p-values:

- False Positive Rate (FPR)
- Family-Clever Error Rate (FWER)
- False Discovery Rate (FDR)

and explain when it is smart to make use of them.

This document might be summarized in the next image:

We’ll create a simulated example to raised understand how various manipulation of p-values can result in different conclusions. To run this code, we want Python with `pandas`

, `numpy`

, `scipy`

and `statsmodels`

libraries installed.

For the aim of this instance, we start by making a Pandas DataFrame of 1000 features. 990 of which (99%) may have their values generated from a Normal distribution with mean = 0, called a Null model. (In a function `norm.rvs()`

used below, mean is ready using a `loc`

argument.) The remaining 1% of the features will probably be generated from a Normal distribution mean = 3, called a Non-Null model. We’ll use these as representing interesting features that we would really like to find.

`import pandas as pd`

import numpy as np

from scipy.stats import norm

from statsmodels.stats.multitest import multipletestsnp.random.seed(42)

n_null = 9900

n_nonnull = 100

df = pd.DataFrame({

'hypothesis': np.concatenate((

['null'] * n_null,

['non-null'] * n_nonnull,

)),

'feature': range(n_null + n_nonnull),

'x': np.concatenate((

norm.rvs(loc=0, scale=1, size=n_null),

norm.rvs(loc=3, scale=1, size=n_nonnull),

))

})

For every of the 1000 features, p-value is a probability of observing the worth a minimum of as large, if we assume it was generated from a Null distribution.

P-values might be calculated from a cumulative distribution ( `norm.cdf()`

from `scipy.stats`

) which represents the probability of obtaining a worth equal to or **lower than** the one observed. Then to calculate the p-value we calculate `1 - norm.cdf()`

to search out the probability **greater than** the one observed:

`df['p_value'] = 1 - norm.cdf(df['x'], loc = 0, scale = 1)`

df

The primary concept is named a False Positive Rate and is defined as a fraction of null hypotheses that we flag as “significant” (also called Type I errors). The p-values we calculated earlier might be interpreted as a false positive rate by their very definition: they’re probabilities of obtaining a worth a minimum of as large as a specified value, after we sample a Null distribution.

For illustrative purposes, we are going to apply a typical (magical 🧙) p-value threshold of 0.05, but any threshold might be used:

`df['is_raw_p_value_significant'] = df['p_value'] <= 0.05`

df.groupby(['hypothesis', 'is_raw_p_value_significant']).size()

`hypothesis is_raw_p_value_significant`

non-null False 8

True 92

null False 9407

True 493

dtype: int64

notice that out of our 9900 null hypotheses, 493 are flagged as “significant”. Due to this fact, a False Positive Rate is: FPR = 493 / (493 + 9940) = 0.053.

The important problem with FPR is that in an actual scenario we don’t a priori know which hypotheses are null and which are usually not. Then, the raw p-value by itself (False Positive Rate) is of limited use. In our case when the fraction of non-null features could be very small, a lot of the features flagged as significant will probably be null, because there are numerous more of them. Specifically, out of 92 + 493 = 585 features flagged true (“positive”), only 92 are from our non-null distribution. That implies that a majority or about 84% of reported significant features (493 / 585) are false positives!

So, what can we do about this? There are two common methods of addressing this issue: as an alternative of False Positive Rate, we are able to calculate Family-Clever Error Rate (FWER) or a False Discovery Rate (FDR). Each of those methods takes the set of raw, unadjusted, p-values as an input, and produces a brand new set of “adjusted p-values” as an output. These “adjusted p-values” represent estimates of *upper bounds* on FWER and FDR. They might be obtained from `multipletests()`

function, which is a component of the `statsmodels`

Python library:

`def adjust_pvalues(p_values, method):`

return multipletests(p_values, method = method)[1]

Family-Clever Error Rate is a probability of falsely rejecting a number of null hypotheses, or in other words flagging true Null as Non-null, or a probability of seeing a number of false positives.

When there is just one hypothesis being tested, this is the same as the raw p-value (false positive rate). Nevertheless, the more hypotheses are tested, the more likely we’re going to get a number of false positives. There are two popular ways to estimate FWER: Bonferroni and Holm procedures. Although neither Bonferroni nor Holm procedures make any assumptions concerning the dependence of tests run on individual features, they will probably be overly conservative. For instance, in the intense case when the entire features are similar (same model repeated 10,000 times), no correction is required. While in the opposite extreme, where no features are correlated, some kind of correction is required.

## Bonferroni procedure

One of the crucial popular methods for correcting for multiple hypothesis testing is a Bonferroni procedure. The rationale this method is popular is because it is extremely easy to calculate, even by hand. This procedure multiplies each p-value by the full variety of tests performed or sets it to 1 if this multiplication would push it past 1.

`df['p_value_bonf'] = adjust_pvalues(df['p_value'], 'bonferroni')`

df.sort_values('p_value_bonf')

## Holm procedure

Holm’s procedure provides a correction that’s more powerful than Bonferroni’s procedure. The one difference is that the p-values are usually not all multiplied by the full variety of tests (here, 10000). As an alternative, each sorted p-value is multiplied progressively by a decreasing sequence 10000, 9999, 9998, 9997, …, 3, 2, 1.

`df['p_value_holm'] = adjust_pvalues(df['p_value'], 'holm')`

df.sort_values('p_value_holm').head(10)

We are able to confirm this ourselves: the last tenth p-value on this output is multiplied by 9991: 7.943832e-06 * 9991 = 0.079367. Holm’s correction can also be the default method for adjusting p-values in `p.adjust()`

function in R language.

If we again apply our p-value threshold of 0.05, let’s have a look how these adjusted p-values affect our predictions:

`df['is_p_value_holm_significant'] = df['p_value_holm'] <= 0.05`

df.groupby(['hypothesis', 'is_p_value_holm_significant']).size()

`hypothesis is_p_value_holm_significant`

non-null False 92

True 8

null False 9900

dtype: int64

These results are much different than after we applied the identical threshold to the raw p-values! Now, only 8 features are flagged as “significant”, and all 8 are correct — they were generated from our Non-null distribution. It’s because the probability of getting even one feature flagged incorrectly is just 0.05 (5%).

Nevertheless, this approach has a downside: it didn’t flag other 92 Non-null features as significant. While it was very stringent to ensure not one of the null features slipped in, it was in a position to find only 8% (8 out of 100) non-null features. This might be seen as taking a distinct extreme than the False Positive Rate approach.

Is there a more middle ground? The reply is “yes”, and that middle ground is False Discovery Rate.

What if we’re OK with letting some false positives in, but capturing greater than single-digit percent of true positives? Perhaps we’re OK with having *some* false positive, just not that many who they overwhelm the entire features we flag as significant — as was the case within the FPR example.

This might be done by controlling for False Discovery Rate (slightly than FWER or FPR) at a specified threshold level, say 0.05. False Discovery Rate is defined a fraction of false positives amongst all features flagged as positive: FDR = FP / (FP + TP), where FP is the variety of False Positives and TP is the variety of True Positives. By setting FDR threshold to 0.05, we’re saying we’re OK with having 5% (on average) false positives amongst all of our features we flag as positive.

There are several methods to regulate FDR and here we are going to describe methods to use two popular ones: Benjamini-Hochberg and Benjamini-Yekutieli procedures. Each of those procedures are similar although more involved than FWER procedures. They still depend on sorting the p-values, multiplying them with a particular number, after which using a cut-off criterion.

## Benjamini-Hochberg procedure

Benjamini-Hochberg (BH) procedure assumes that every of the tests are *independent*. Dependent tests occur, for instance, if the features being tested are correlated with one another. Let’s calculate the BH-adjusted p-values and compare it to our earlier result from FWER using Holm’s correction:

`df['p_value_bh'] = adjust_pvalues(df['p_value'], 'fdr_bh')`

df[['hypothesis', 'feature', 'x', 'p_value', 'p_value_holm', 'p_value_bh']]

.sort_values('p_value_bh')

.head(10)

`df['is_p_value_holm_significant'] = df['p_value_holm'] <= 0.05`

df.groupby(['hypothesis', 'is_p_value_holm_significant']).size()

`hypothesis is_p_value_holm_significant`

non-null False 92

True 8

null False 9900

dtype: int64

`df['is_p_value_bh_significant'] = df['p_value_bh'] <= 0.05`

df.groupby(['hypothesis', 'is_p_value_bh_significant']).size()

`hypothesis is_p_value_bh_significant`

non-null False 67

True 33

null False 9898

True 2

dtype: int64

BH procedure now accurately flagged 33 out of 100 non-null features as significant — an improvement from the 8 with the Holm’s correction. Nevertheless, it also flagged 2 null features as significant. So, out of the 35 features flagged as significant, the fraction of incorrect features is: 2 / 33 = 0.06 so 6%.

Note that on this case now we have 6% FDR rate, though we aimed to regulate it at 5%. FDR will probably be controlled at a 5% rate *on average*: sometimes it could be lower and sometimes it could be higher.

## Benjamini-Yekutieli procedure

Benjamini-Yekutieli (BY) procedure controls FDR no matter whether tests are independent or not. Again, it’s value noting that each one of those procedures try to determine *upper bounds* on FDR (or FWER), in order that they could also be less or more conservative. Let’s compare the BY procedure with a BH and Holm procedures above:

`df['p_value_by'] = adjust_pvalues(df['p_value'], 'fdr_by')`

df[['hypothesis', 'feature', 'x', 'p_value', 'p_value_holm', 'p_value_bh', 'p_value_by']]

.sort_values('p_value_by')

.head(10)

`df['is_p_value_by_significant'] = df['p_value_by'] <= 0.05`

df.groupby(['hypothesis', 'is_p_value_by_significant']).size()

`hypothesis is_p_value_by_significant`

non-null False 93

True 7

null False 9900

dtype: int64

BY procedure is stricter in controlling FDR; on this case much more so than the Holm’s procedure for controlling FWER, by flagging only 7 non-null features as significant! The important advantage of using it’s after we know the info may contain a high variety of correlated features. Nevertheless, in that case we may want to contemplate filtering out correlated features in order that we don’t must test all of them.

At the tip, the alternative of procedure is left to the user and relies on what the evaluation is attempting to do. Quoting Benjamini, Hochberg (Royal Stat. Soc. 1995):

Often the control of the FWER just isn’t quite needed. The control of the FWER is essential when a conclusion from the assorted individual inferences is more likely to be erroneous when a minimum of considered one of them is.

This stands out as the case, for instance, when several latest treatments are competing against an ordinary, and a single treatment is chosen from the set of treatments that are declared significantly higher than the usual.

In other cases, where we could also be OK to have some false positives, FDR methods akin to BH correction provide less stringent p-value adjustments and should be preferrable if we primarily need to increase the variety of true positives that pass a certain p-value threshold.

There are other adjustment methods not mentioned here, notably a q-value which can also be used for FDR control, and on the time of writing exists only as an R package.