How the conditional probability changes as a function of the three probability elements
I recently talked in regards to the causes of model performance degradation, meaning when their prediction quality drops with respect to the moment we trained and deployed our models. On this other post, I proposed a brand new way of fascinated with the causes of model degradation. In that framework, the so-called conditional probability comes out as the worldwide cause.
The conditional probability is, by definition, composed of three probabilities which I call the particular causes. An important learning of this restructure of concepts is that covariate shift and conditional shift usually are not two separate or parallel concepts. Conditional shift can occur as a function of covariate shift.
With this restructuring, I imagine it becomes easier to think in regards to the causes and it becomes more logical to interpret the shifts that we observe in our applications.
That is the scheme of causes and model performance for machine learning models:
On this scheme, we see the clear path that connects the causes to the prediction performance of our estimated models. One fundamental assumption we want to make in statistical learning is that our models are “good” estimators of the actual models (real decision boundaries, real regression functions, etc.). “Good” can have different meanings, comparable to unbiased estimators, precise estimators, complete estimators, sufficient estimators, etc. But, for the sake of simplicity and the upcoming discussion, let’s say that they’re good within the sense that they’ve a small prediction error. In other words, we assume that they’re representative of the actual models.
With this assumption, we’re capable of search for the causes of model degradation of the estimated model in the possibilities P(X), P(Y), P(X|Y), and consequently, P(Y|X).
So, what we’ll do today is to exemplify and walk through different scenarios to see how P(Y|X) changes as a function of the three probabilities P(X|Y), P(X), and P(Y). We’ll achieve this through the use of a population of just a few points in a 2D space and calculating the possibilities from these sample points in the best way Laplace would do. The aim is to digest the hierarchy scheme of causes of model degradation, keeping P(Y|X) as the worldwide cause, and the opposite three as the particular causes. In that way, we are able to understand, for instance, how a possible covariate shift may be sometimes the argument of the conditional shift relatively than being a separate shift of its own.
The instance
The case we’ll draw for our lesson today is a quite simple one. We now have an area of two covariates X1 and X2 and the output Y is a binary variable. That is what our model space looks like:
You see there that the space is organized in 4 quadrants and the choice boundary on this space is the cross. Which means the model classifies samples at school 1 in the event that they lie within the 1st and third quadrants, and at school 0 otherwise. For the sake of this exercise, we’ll walk through the several cases comparing P(Y=1|X1>a). This shall be our conditional probability to showcase. When you are wondering why not taking also X2, it’s just for the simplicity of the exercise. It doesn’t affect the insight we would like to know.
When you’re still with a bittersweet feeling, taking P(Y=1|X1>a) is similar to P(Y=1|X1>a, -inf
, so theoretically, we’re still taking X2 into consideration.
Reference model
So to begin with, we calculate our showcase probability and we obtain 1/2. Just about here our group of samples is sort of uniform throughout the space and the prior probabilities are also uniform:
Shifts are coming up
- One extra sample appears in the underside right quadrant. So the very first thing we ask is: Are we talking a couple of covariate shift?
Well, yes, because there’s more sampling in X1>a than there was before. So, is that this only a covariate shift but not a conditional shift? Let’s see. Here is the calculation of all the identical probabilities as before with the updated set of points (The chances that modified are in orange):
What did we see here? The truth is, not only did we get a covariate shift, but overall, all the possibilities modified. The prior probability also modified since the covariate shift brought a brand new point of sophistication 1 making the incidence of this class greater than class 2. Then also, the inverse probability P(X1>a|Y=1) modified precisely due to prior shift. All of that overall led to a conditional shift so we now got P(Y=1|X1>a)=2/3 as an alternative of 1/2.
Here’s a thought bubble. An important one actually.
With this shift within the sampling distribution, we obtained shifts in all the possibilities that play a job in the entire scheme of our models. Yet, the choice boundary that existed based on the initial sampling remained valid for this shift.
What does this mean?
Despite the fact that we obtained a conditional shift, the choice boundary didn’t necessarily degrade. Because the choice boundary comes from the expected value, if we calculate this value based on the present shift, the boundary may remain the identical but with a special conditional probability.
2. Samples at the primary quadrant don’t exist anymore.
So, for X1>a things remained unchanged. Let’s see what happens to the conditional probability we’re showcasing and its elements.
Intuitively, because inside X1>a things remain unchanged, the conditional probability remained the identical. Yet, after we have a look at P(X1>a) we obtain 2/3 as an alternative of 1/2 in comparison with the training sampling. So here we’ve got a covariate shift without a conditional shift.
From a math perspective, how can the covariate probability change without the conditional probability changing? It’s because P(Y=1) and P(X1>a|Y=1) modified accordingly to the covariate probability. Subsequently the compensation makes up for an unchanged conditional probability.
With these changes, just as before, the choice boundary remained valid.
3. Throwing in some samples in numerous quadrants while the choice boundary remained valid.
We now have here 2 extra mixtures. In a single case, the prior remained the identical while the opposite two probabilities modified, still not changing the conditional probability. Within the second case, only the inverse probability was related to a conditional shift. Check the shifts here below. The latter is a fairly necessary one, so don’t miss it!
With this, we’ve got now a fairly solid perspective on how the conditional probability can change as a function of the opposite three probabilities. But most significantly, we also know that not all conditional shifts invalidate the prevailing decision boundary. So what’s the cope with it?
Concept drift
Within the previous post, I also proposed a more specific way of defining concept drift (or concept shift). The proposal is:
We consult with a change within the concept when the choice boundary or regression function becomes invalid when the possibilities at play are shifting.
So, the crucial point about that is that if the choice boundary becomes invalid, surely there’s a conditional shift. The reverse, as we discussed within the previous post and as we saw within the examples above, shouldn’t be necessarily true.
This won’t be so implausible from a practical perspective since it signifies that to actually know if there’s an idea drift, we is perhaps forced to re-estimate the boundary or function. But not less than, for our theoretical understanding, that is just as fascinating.
Here’s an example through which we’ve got a concept drift, naturally with a conditional shift, but actually and not using a covariate or a previous shift.
How cool is that this separation of components? The one element that modified here was the inverse probability, but, contrary to the previous shift we studied above, this modification within the inverse probability was linked to the change in the choice boundary. Now, a sound decision boundary is barely the separation in accordance with X1>a discarding the boundary dictated by X2.
What have we learned?
We now have walked very slowly through the decomposition of the causes of model degradation. We studied different shifts of the probability elements and the way they relate to the degradation of the prediction performance of our machine learning models. An important insights are:
- A conditional shift is a world reason behind prediction degradation in machine learning models
- The precise causes are covariate shift, prior shift, and inverse probability shift
- We will have many various cases of probability shifts while the choice boundary stays valid
- A change in the choice boundary causes a conditional shift, however the reverse shouldn’t be necessarily true!
- Concept drift could also be more specifically related to the choice boundary relatively than with the general conditional probability distribution
What follows from this? Reorganizing our practical solutions in light of this hierarchy of definitions is the most important invitation I make. We’d find so many wanted answers to our current questions regarding the best way through which we are able to monitor our models.
When you are currently working on model performance monitoring using these definitions, don’t hesitate to share your thoughts on this framework.
Pleased considering to everyone!