## p-value

Enters the infamous p-value. It’s a number that answers the query: what’s the probability of observing the chi-2 value we got or an excellent more extreme one, provided that the null hypothesis is true? Or, using some notation, the p-value represents the probability of observing the information assuming the null hypothesis is true: P(data|H₀) (To be precise, the p-value is defined as P(test_static(data) > T | H₀), where T is the chosen threshold for the test statistic). Notice how that is different from what we are literally inquisitive about, which is the probability that our hypothesis is true given the information we’ve observed: P(H₀|data).

**what p-value represents: P(data|H₀)what we normally want: P(H₀|data)**

Graphically speaking, the p-value is the sum of the blue probability density to the fitting of the red line. The best approach to compute it’s to calculate one minus the cumulative distribution on the observed value, that’s one minus the probability mass on the left side.

`1 - chi2.cdf(chisq, df=1)`

This provides us 0.0396. If there was no data drift, we might get the test statistic we’ve got or an excellent larger one in roughly 4% of the cases. Not that seldom, in spite of everything. In most use cases, the p-value is conventionally in comparison with the importance level of 1% or 5%. If it’s lower than that, one rejects the null. Let’s be conservative and follow the 1% significance threshold. In our case with a p-value of virtually 4%, there just isn’t enough evidence to reject it. Hence, no data drift was detected.

To be sure that our test was correct, let’s confirm it with scipy’s built-in test function.

`from scipy.stats import chi2_contingency`chisq, pvalue, df, expected = chi2_contingency(cont_table)

print(chisq, pvalue)

`4.232914541135393 0.03964730311588313`

That is how hypothesis testing works. But how relevant is it for data drift detection in a production machine learning system?

Statistics, in its broadest sense, is the science of constructing inferences about entire populations based on small samples. When the famous t-test was first published originally of the twentieth century, all calculations were made with pen and paper. Even today, students in STATS101 courses will learn that a “large sample” starts from 30 observations.

Back in the times when data was hard to gather and store, and manual calculations were tedious, statistically rigorous tests were an important approach to answer questions on the broader populations. Nowadays, nevertheless, with often abundant data, many tests diminish in usefulness.

The characteristic is that many statistical tests treat the quantity of knowledge as evidence. With less data, the observed effect is more liable to random variation as a consequence of sampling error, and with a number of data, its variance decreases. Consequently, the very same observed effect is stronger evidence against the null hypothesis with more data than with less.

As an example this phenomenon, consider comparing two corporations, A and B, when it comes to the gender ratio amongst their employees. Let’s imagine two scenarios. First, let’s take random samples of 10 employees from each company. At company A, 6 out of 10 are women while at company B, 4 out of 10 are women. Second, let’s increase our sample size to 1000. At company A, 600 out of 1000 are women, and at B, it’s 400. In each scenarios, the gender ratios were the identical. Nevertheless, more data seems to supply stronger evidence for the undeniable fact that company A employs proportionally more women than company A, doesn’t it?

This phenomenon often manifests in hypothesis testing with large data samples. The more data, the lower the p-value, and so the more likely we’re to reject the null hypothesis and declare the detection of some type of statistical effect, similar to data drift.

Let’s see whether this holds for our chi-2 test for the difference in frequencies of a categorical variable. In the unique example, the serving set was roughly ten times smaller than the training set. Let’s multiply the frequencies within the serving set by a set of scaling aspects between 1/100 and 10 and calculate the chi-2 statistic and the test’s p-value every time. Notice that multiplying all frequencies within the serving set by the identical constant doesn’t impact their distribution: the one thing we’re changing is the scale of certainly one of the sets.

`training_freqs = np.array([10_322, 24_930, 30_299])`

serving_freqs = np.array([1_015, 2_501, 3_187])p_values, chi_sqs = [], []

multipliers = [0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]

for serving_size_multiplier in multipliers:

augmented_serving_freqs = serving_freqs * serving_size_multiplier

cont_table = pd.DataFrame([

training_freqs,

augmented_serving_freqs,

])

chi_sq, pvalue, _, _ = chi2_contingency(cont_table)

p_values.append(pvalue)

chi_sqs.append(chi_sq)

The values on the multiplier equal to at least one are those we’ve calculated before. Notice how with a serving size just 3 times larger (marked with a vertical dashed line) our conclusion changes completely: we get the chi-2 statistic of 11 and the p-value of virtually zero, which in our case corresponds to indicating data drift.

The consequence of that is the increasing amount of false alarms. Although these effects will probably be statistically significant, they are going to not necessarily be significant from the performance monitoring standpoint. With a big enough data set, even the tiniest of knowledge drifts will probably be indicated even whether it is so weak that it doesn’t deteriorate the model’s performance.

Having learned this, you could be tempted to suggest dividing the serving data into a variety of chunks and running multiple tests with smaller data sets. Unfortunately, this just isn’t a very good idea either. To know why, we want to deeply understand what the p-value really means.

We have now already defined the p-value because the probability of observing the test statistic no less than as unlikely because the one we’ve actually observed, provided that the null hypothesis is true. Let’s attempt to unpack this mouthful.

The null hypothesis means no effect, in our case: no data drift. Because of this whatever differences there are between the training and serving data, they’ve emerged as a consequence of random sampling. The p-value can subsequently be seen because the probability of getting the differences we got, provided that they only come from randomness.

Hence, our p-value of roughly 0.1 implies that in the whole absence of knowledge drift, 10% of tests will erroneously signal data drift as a consequence of random likelihood. This stays consistent with the notation for what the p-value represents which we introduced earlier: P(data|H₀). If this probability is 0.1, then provided that H₀ is true (no drift), we’ve a ten% likelihood of observing the information no less than as different as what we observed (in keeping with the test statistic)

That is the explanation why running more tests on smaller data samples just isn’t a very good idea: if as an alternative of testing the serving data from your complete day every day, we might split it into 10 chunks and run 10 tests every day, we might find yourself with one false alarm each day, on average! This will result in the so-called alert fatigue, a situation wherein you might be bombarded by alerts to the extent that you just stop being attentive to them. And when data drift really does occur, you may miss it.

We have now seen that detecting data drift based on a test’s p-value may be unreliable, resulting in many false alarms. How can we do higher? One solution is to go 180 degrees and resort to Bayesian testing, which allows us to directly estimate what we want, P(H₀|data), quite than the p-value, P(data|H₀).