Home Artificial Intelligence An Intuitive View on Mutual Information The Textbook Definition Bob vs Mutual Information Beyond Linear Correlation My “Layman” Definition of Mutual Information

An Intuitive View on Mutual Information The Textbook Definition Bob vs Mutual Information Beyond Linear Correlation My “Layman” Definition of Mutual Information

An Intuitive View on Mutual Information
The Textbook Definition
Bob vs Mutual Information
Beyond Linear Correlation
My “Layman” Definition of Mutual Information

We will break down the Mutual Information formula into the next parts:

The x, X and y, Y

x and y are the person observations/values that we see in our data. X and Y are only the set of those individual values. example can be as follows:

Discrete/Binary statement of umbrella-wielding and weather

And assuming we’ve 5 days of observations of Bob on this exact sequence:

Discrete/Binary statement of umbrella-wielding and weather over 5 days

Individual/Marginal Probability

These are only the easy probability of observing a selected x or y of their respective sets of possible X and Y values.

Take x = 1 for example: the probability is solely 0.4 (Bob carried an umbrella 2 out of 5 days of his vacation).

Joint Probability

That is the probability of observing a selected x and y from the joint probability of (X, Y). The joint probability (X, Y) is solely just the set of paired observations. We pair them up in response to their index.

In our case with Bob, we pair the observations up based on which day they occurred.

You could be tempted to leap to a conclusion after taking a look at the pairs:

Since there are equal-value pairs occurring 80% of the time, it clearly means that folks carry umbrellas BECAUSE it’s raining!

Well I’m here to play the devil’s advocate and say that that will just be a freakish coincidence:

If the prospect of rain may be very low in Singapore, and, independently, the likelihood of Bob carrying umbrella can also be equally low (because he hates holding extra stuff), are you able to see that the percentages of getting (0,0) paired observations will likely be very high naturally?

So what can we do to prove that these paired observations should not by coincidence?

Joint Versus Individual Probabilities

We will take the ratio of each probabilities to provide us a clue on the “extent of coincidence”.

Within the denominator, we take the product of each individual probabilities of a selected x and particular y occurring. Why did we accomplish that?

Peering into the common-or-garden coin toss

Recall the primary lesson you took in statistics class: calculating the probability of getting 2 heads in 2 tosses of a good coin.

  • 1st Toss [ p(x) ]: There’s a 50% likelihood of getting heads
  • 2nd Toss [ p(y) ]: There’s still a 50% likelihood of getting heads, because the end result is independent of what happened within the 1st toss
  • The above 2 tosses make up your individual probabilities
  • Subsequently, the theoretical probability of getting each heads in 2 independent tosses is 0.5 * 0.5 = 0.25 ( p(x).p(y) )

And for those who actually do perhaps 100 sets of that double-coin-toss experiment, you’ll likely see that you just get the (heads, heads) result 25% of the time. The 100 sets of experiment is definitely your (X, Y) joint probability set!

Hence, if you take the ratio of joint versus combined-individual probabilities, you get a worth of 1.

This is definitely the true expectation for independent events: the joint probability of a selected pair of values occurring is precisely equal to the product of their individual probabilities! Similar to what you were taught in fundamental statistics.

Now imagine that your 100-set experiment yielded (heads, heads) 90% of the time. Surely that may’t be a coincidence…

You expected 25% since you already know that they’re independent events, yet what was observed is an extreme skew of this expectation.

To place this qualitative feeling into numbers, the ratio of probabilities is now a whopping 3.6 (0.9 / 0.25), essentially 3.6x more frequent than we expected.

As such, we begin to think that perhaps the coin tosses were not independent. Perhaps the results of the first toss might even have some unexplained effect on the 2nd toss. Perhaps there’s some level of association/dependence between 1st and 2nd toss.

That’s what Mutual Information tries to tells us!

Expected Value of Observations

For us to be fair to Bob, we should always not only take a look at the times where his claims are mistaken, i.e. calculate the ratio of probabilities of (0,0) and (1,1).

We should always also calculate the ratio of probabilities for when his claims are correct, i.e. (0,1) and (1,0).

Thereafter, we are able to aggregate all 4 scenarios in an expected value method, which just means “taking the common”: aggregate up all ratio of probabilities for every observed pair in (X, Y), then divide it by the variety of observations.

That’s the purpose of those two summation terms. For continuous variables like my stock market example, we’ll then use integrals as a substitute.

Logarithm of Ratios

Much like how we calculate the probability of getting 2 consecutive heads for the coin toss, we’re also now calculating the extra probability of seeing the 5 pairs that we observed.

For the coin toss, we calculate by multiplying the possibilities of every toss. For Bob, it’s the identical: the probabilities have multiplicative effect on one another to provide us the sequence that we observed within the joint set.

With logarithms, we turn multiplicative effects into additive ones:

Converting the ratio of probabilities to their logarithmic variants, we are able to now simply just calculate the expected value as described above using summation of their logarithms.

Be happy to make use of log-base 2, e, or 10, it doesn’t matter for the needs of this text.

Putting It All Together

Formula for Mutual Information for Discrete Observations
Formula for Mutual Information for Discrete Observations

Let’s now prove Bob mistaken by calculating the Mutual Information. I’ll use log-base e (natural logarithm) for my calculations:

So what does the worth of 0.223 tell us?

Let’s first assume Bob is true, and that using umbrellas are independent from presence of rain:

  • We all know that the joint probability will exactly equal the product of the person probabilities.
  • Subsequently, for each x and y permutation, the ratio of probabilities = 1.
  • Taking the logarithm, that equates to 0.
  • Thus, the expected value of all permutations (i.e. Mutual Information) is due to this fact 0.

But because the Mutual Information rating that we calculated is non-zero, we are able to due to this fact prove to Bob that he’s mistaken!


Please enter your comment!
Please enter your name here