
The LLN is interesting as much for what it doesn’t say as for what it does

O n August 24, 1966, a talented playwright by the name Tom Stoppard staged a play in Edinburgh, Scotland. The play had a curious title, “Rosencrantz and Guildenstern Are Dead.” Its central characters, Rosencrantz and Guildenstern, are childhood friends of Hamlet (of Shakespearean fame). The play opens with Guildenstern repeatedly tossing coins which keep coming up Heads. Each consequence makes Guildenstern’s money-bag lighter and Rosencrantz’s, heavier. Because the drumbeat of Heads continues with a pitiless persistence, Guildenstern is apprehensive. He worries if he’s secretly willing each coin to come back up Heads as a self-inflicted punishment for some long-forgotten sin. Or if time stopped after the primary flip, and he and Rosencrantz are experiencing the identical consequence over and once again.
Stoppard does an excellent job of showing how the laws of probability are woven into our view of the world, into our sense of expectation, into the very fabric of human thought. When the 92nd flip also comes up as Heads, Guildenstern asks if he and Rosencrantz are throughout the control of an unnatural reality where the laws of probability now not operate.
Guildenstern’s fears are after all unfounded. Granted, the likelihood of getting 92 Heads in a row is unimaginably small. In truth, it’s a decimal point followed by 28 zeroes followed by 2. Guildenstern is more prone to be hit on the top by a meteorite.
Guildenstern only has to come back back the subsequent day to flip one other sequence of 92 coin tosses and the result will almost actually be vastly different. If he were to follow this routine every single day, he’ll discover that on most days the variety of Heads will roughly match the variety of tails. Guildenstern is experiencing an interesting behavior of our universe generally known as the Law of Large Numbers.
The LLN, because it is known as, is available in two flavors: the weak and the strong. The weak LLN could be more intuitive and easier to relate to. But it’s also easy to misinterpret. I’ll cover the weak version in this text and leave the discussion on the strong version for a later article.
The weak Law of Large Numbers concerns itself with the connection between the sample mean and the population mean. I’ll explain what it says in plain text:
Suppose you draw a random sample of a certain size, say 100, from the population. By the best way, make a mental note of the term sample size. The size of the sample is the ringmaster, the grand pooh-bah of this law. Now calculate the mean of this sample and set it aside. Next, repeat this process many persistently. What you’ll get is a set of imperfect means. The means are imperfect because there’ll at all times be a ‘gap’, a delta, a deviation between them and the true population mean. Let’s assume you’ll tolerate a certain deviation. In case you select a sample mean at random from this set of means, there might be a probability that absolutely the difference between the sample mean and the population mean will exceed your tolerance.
The weak Law of Large Numbers says that the probability of this deviation’s exceeding your chosen level of tolerance will shrink to zero because the sample size grows to either infinity or to the dimensions of the population.
Irrespective of how tiny is your chosen level of tolerance, as you draw sets of samples of ever increasing size, it’ll develop into increasingly unlikely that the mean of a randomly chosen sample from the set will exceed this tolerance.
To see how the weak LLN works we’ll run it through an example. And for that, allow me, should you will, to take you to the cold, brooding expanse of the Northeastern North Atlantic Ocean.
Day-after-day, the Government of Ireland publishes a dataset of water temperature measurements taken from the surface of the North East North Atlantic. This dataset accommodates a whole lot of hundreds of measurements of surface water temperature indexed by latitude and longitude. As an illustration, the info for June 21, 2023 is as follows:
It’s form of hard to assume what eight hundred thousand surface temperature values appear like. So let’s create a scatter plot to visualise this data. I’ve shown this plot below. The vacant white areas within the plot represent Ireland and the UK.
As a student of statistics, you won’t ever have access to the ‘population’. So that you’ll be correct in severely chiding me if I declare this population of 800,000 temperature measurements because the ‘population’. But bear with me for somewhat while. You’ll soon see why, in our quest to grasp the LLN, it helps us to think about this data because the ‘population’.
So let’s assume that this data is — ahem…cough — the population. The common surface water temperature across the 810219 locations on this population of values is 17.25840 degrees Celsius. 17.25840 is solely the typical of the 810K temperature measurements. We’ll designate this value because the population mean, μ. Remember this value. You’ll must consult with it often.
Now suppose this population of 810219 values just isn’t accessible to you. As a substitute, all you’ve got access to is a meager little sample of 20 random locations drawn from this population. Here’s one such random sample:
The mean temperature of the sample is 16.9452414 degrees C. That is our sample mean X_bar which is computed as follows:
X_bar = (X1 + X2 + X3 + … + X20) / 20
You may just as easily draw a second, a 3rd, indeed any variety of such random samples of size 20 from the identical population. Listed below are just a few random samples for illustration:
A fast aside on what a random sample really is
Before moving ahead, let’s pause a bit to get a certain degree of perspective on the concept of a random sample. It is going to make it easier to grasp how the weak LLN works. And to accumulate this attitude, I need to introduce you to the casino slot machine:
The slot machine shown above accommodates three slots. Each time you crank down the arm of the machine, the machine fills each slot with an image that the machine has chosen randomly from an internally maintained population of images akin to an inventory of fruit pictures. Now imagine a slot machine with 20 slots named X1 through X20. Assume that the machine is designed to pick values from a population of 810219 temperature measurements. Whenever you pull down the arm, each certainly one of the 20 slots — X1 through X20 — fills with a randomly chosen value from the population of 810219 values. Due to this fact, X1 through X20 are random variables that may each hold any value from the population. Taken together they form a random sample. Put one other way, each element of a random sample is itself a random variable.
X1 through X20 have just a few interesting properties:
- The worth that X1 acquires is independent of the values that X2 thru X20 acquire. The identical applies to X2, X3, …,X20. Thus X1 thru X20 are independent random variables.
- Because X1, X2,…, X20 can each hold any value from the population, the mean of every of them is the population mean, μ. Using the notation E() for expectation, we write this result as follows:
E(X1) = E(X2) = … = E(X20) = μ. - X1 thru X20 have equivalent probability distributions.
Thus, X1, X2,…,X20 are independent, identically distributed (i.i.d.) random variables.
…and now we get back to showing how the weak LLN works
Let’s compute the mean (denoted by X_bar) of this 20 element sample and set it aside. Now let’s once more crank down the machine’s arm and out will pop one other 20-element random sample. We’ll compute its mean and set it aside too. If we repeat this process one thousand times, we could have computed one thousand sample means.
Here’s a table of 1000 sample means computed this fashion. We’ll designate them as X_bar_1 to X_bar_1000:
Now consider the next statement rigorously:
Because the sample mean is calculated from a random sample, the sample mean is itself a random variable.
At this point, should you are sagely nodding your head and stroking your chin, it is rather much the proper thing to do. The belief that the sample mean is a random variable is some of the penetrating realizations one can have in statistics.
Notice also how each sample mean within the table above is a ways away from the population mean, μ. Let’s plot a histogram of those sample means to see how they’re distributed around μ:
Many of the sample means appear to lie near the population mean of 17.25840 degrees Celsius. Nevertheless, there are some which are considerably distant from μ. Suppose your tolerance for this distance is 0.25 degrees Celsius. In case you were to plunge your hand into this bucket of 1000 sample means, grab whichever mean falls inside your grasp and pull it out. What might be the probability that absolutely the difference between this mean and μ is the same as or greater than 0.25 degrees C? To estimate this probability, you have to count the variety of sample means which are a minimum of 0.25 degrees away from μ and divide this number by 1000.
Within the above table, this count happens to be 422 and so the probability P(|X_bar — μ | ≥ 0.25) works out to be 422/1000 = 0.422
Let’s park this probability for a minute.
Now repeat the entire above steps, but this time use a sample size of 100 as an alternative of 20. So here’s what you’ll do: draw 1000 random samples each of size 100, take the mean of every sample, store away all those means, count those which are a minimum of 0.25 degrees C away from μ, and divide this count by 1000. If that gave the impression of the labors of Hercules, you weren’t mistaken. So take a moment to catch your breath. And once you’re all caught up, notice below what you’ve got got because the fruit on your labors.
The table below accommodates the means from the 1000 random samples, each of size 100:
Out of those one thousand means, fifty-six means occur to deviate by least 0.25 degrees C from μ. That offers you the probability that you simply’ll run into such a mean as 56/1000 = 0.056. This probability is decidedly smaller than the 0.422 we computed earlier when the sample size was only 20.
In case you repeat this sequence of steps multiple times, every time with a unique sample size that increases incrementally, you’re going to get yourself a table stuffed with probabilities. I’ve done this exercise for you by dialing up the sample size from 10 through 490 in steps of 10. Here’s the consequence:
Each row on this table corresponds to 1000 different samples that I drew at random from the population of 810219 temperature measurements. The sample_size column mentions the dimensions of every of those 1000 samples. Once drawn, I took the mean of every sample and counted those that were a minimum of 0.25 degrees C aside from μ. The num_exceeds_tolerance column mentions this count. The probability column is num_exceeds_tolerance / sample_size.
Notice how this count attenuates rapidly because the sample size increases. And so does the corresponding probability P(|X_bar — μ | ≥ 0.25). By the point the sample size reaches 320, the probability has decayed to zero. It blips as much as 0.001 occasionally but that’s because I actually have drawn a finite variety of samples. If every time I draw 10000 samples as an alternative of 1000, not only will the occasional blips flatten out however the attenuation of probabilities may even develop into smoother.
The next graph plots P(|X_bar — μ | ≥ 0.25) against sample size. It puts in sharp relief how the probability plunges to zero because the sample size grows.
Rather than 0.25 degrees C, what should you selected a unique tolerance — either a lower or a better value? Will the probability decay regardless of your chosen level of tolerance? The next family of plots illustrates the reply to this query.
Irrespective of how frugal, how tiny, is your alternative of the tolerance (ε), the probability P(|X_bar — μ | ≥ ε) will at all times converge to zero because the sample size grows. That is the weak Law of Large Numbers in motion.
The behavior of the weak LLN could be formally stated as follows:
Suppose X1, X2, …, Xn are i.i.d. random variables that together form a random sample of size n. Suppose X_bar_n is the mean of this sample. Suppose also that E(X1) = E(X2) = … = E(Xn) = μ. Then for any non-negative real number ε the probability of X_bar_n being a minimum of ε away from μ tends to zero as the dimensions of the sample tends to infinity. The next exquisite equation captures this behavior:
Over the 310 12 months history of this law, mathematicians have been capable of progressively calm down the requirement that X1 through Xn be independent and identically distributed while still preserving the spirit of the law.
The principle of “convergence in probability”, the “plim” notation, and the art of claiming really vital things in really few words
The actual form of converging to some value using probability because the technique of transport is known as convergence in probability. On the whole, it’s stated as follows:
Within the above equation, X_n and X are random variables. ε is a non-negative real number. The equation says that as n tends to infinity, X_n converges in probability to X.
Throughout the immense expanse of statistics, you’ll keep running right into a quietly unassuming notation called plim. It’s pronounced ‘p lim’, or ‘plim’ (just like the word ‘ plum’ but with in ‘i’), or probability limit. plim is the short form way of claiming that a measure akin to the mean converges in probability to a particular value. Using plim, the weak Law of Large Numbers could be stated pithily as follows:
Or just as:
The brevity of notation just isn’t the least surprising. Mathematicians are drawn to brevity like bees to nectar. In the case of conveying profound truths, mathematics could well be probably the most ink-efficient field. And inside this efficiency-obsessed field, plim occupies podium position. You’ll struggle to unearth as profound an idea as plim expressed in lesser amount of ink, or electrons.
But struggle no more. If the laconic great thing about plim left you wanting for more, here’s one other, possibly much more efficient, notation that conveys the identical meaning as plim:
At the highest of this text, I discussed that the weak Law of Large Numbers is noteworthy for what it doesn’t say as much as for what it does say. Let me explain what I mean by that. The weak LLN is commonly misinterpreted to mean that because the sample size increases, its mean approaches the population mean or various generalizations of that concept. As we saw, such ideas concerning the weak LLN harbor no attachment to reality.
In truth, let’s bust a few myths regarding the weak LLN instantly.
MYTH #1: Because the sample size grows, the sample mean tends to the population mean.
This is sort of possibly probably the most frequent misinterpretation of the weak LLN. Nevertheless, the weak LLN makes no such assertion. To see why that’s, consider the next situation: you’ve got managed to get your arms around a extremely large sample. When you gleefully admire your achievement, it is best to also pose yourself the next questions: Simply because your sample is large, must it even be well-balanced? What’s stopping nature from sucker punching you with a large sample that accommodates an equally giant amount of bias? The reply is completely nothing! In truth, isn’t that what happened to Guildenstern along with his sequence of 92 Heads? It was, in any case, a very random sample! If it just so happens to have a big bias, then despite the massive sample size, the bias will blast away the sample mean to some extent that’s far-off from the true population value. Conversely, a small sample can prove to be exquisitely well-balanced. The purpose is, because the sample size increases, the sample mean isn’t guaranteed to dutifully advance toward the population mean. Nature doesn’t provide such unnecessary guarantees.
MYTH #2: Because the sample size increases, just about the whole lot concerning the sample — its median, its variance, its standard deviation — converges to the population values of the identical.
This sentence is 2 myths bundled into one easy-to-carry package. Firstly, the weak LLN postulates a convergence in probability, not in value. Secondly, the weak LLN applies to the convergence in probability of only the sample mean, not some other statistic. The weak LLN doesn’t address the convergence of other measures akin to the median, variance, or standard deviation.
It’s one thing to state the weak LLN, and even reveal how it really works using real-world data. But how will you ensure that it at all times works? Are there circumstances wherein it should play spoilsport — situations wherein the sample mean simply doesn’t converge in probability to the population value? To know that, you have to prove the weak LLN and, in doing so, precisely define the conditions wherein it should apply.
It so happens that the weak LLN has a deliciously mouth-watering proof that uses as certainly one of its ingredients, the endlessly tantalizing Chebyshev’s Inequality. If that whets your appetite, stay tuned for my next article on the proof of the weak Law of Large Numbers.
It is going to be impolite to take leave off this topic without assuaging our friend Guildenstern’s worries. Let’s develop an appreciation for just how unquestionably unlikely a result it was that he experienced. We’ll simulate the act of tossing 92 unbiased coins using a pseudo-random generator. Heads might be encoded as 1 and tails as 0. We’ll record the mean value of the 92 outcomes. The mean value is the fraction of times that the coin got here up Heads. We’ll repeat this experiment ten thousand times to acquire ten thousand technique of 92 coin tosses, and we’ll plot their frequency distribution. After completing this exercise, we are going to get the next form of histogram plot:
We see that almost all of the sample means are grouped across the population mean of 0.5. Guildenstern’s result — getting 92 Heads in a row —is an exceptionally unlikely consequence. Due to this fact, the frequency of this consequence can be vanishingly small. But contrary to Guildenstern’s fears, there’s nothing unnatural concerning the consequence and the laws of probability proceed to operate with their usual gusto. Guildenstern’s consequence is solely lurking contained in the distant regions of the left tail of the plot, waiting with infinite patience to pounce upon some luckless coin-flipper whose only mistake was to be unimaginably unlucky.