Within the previous article, we focused on two varieties of distributions: the Gaussian distribution and the Power Law distribution. We saw that these distributions had diametrically opposite statistical properties. Namely, Power Laws are driven by rare events, while Gaussians are usually not.
This rare-event-driven property raised 3 problems with lots of our favourite statistical tools (e.g. mean, standard deviation, regression, etc.) in analyzing Power Laws. The takeaway was that if data are Gaussian-like, one can use common approaches like regression and computing expectation values without worry. Nevertheless, if data are more Power Law-like, these techniques can provide incorrect and misleading results.
We also saw a 3rd (more mischievous) distribution that might resemble each a Gaussian and a Power Law (despite their opposite properties) called a Log Normal distribution.
This ambiguity presents challenges for practitioners in deciding the best strategy to analyze a given dataset. To assist overcome these challenges, it could actually be advantageous to find out whether data fit a Power Law, Log Normal, or another kind of distribution.
A preferred way of fitting a Power Law to real-world data is what I’ll call the “Log-Log approach” [1]. The concept comes from taking the logarithm of the Power Law’s probability density function (PDF), as derived below.
The above derivation translates the Power Law’s PDF definition right into a linear equation, as shown within the figure below.
This suggests that the histogram of knowledge following an influence law will follow a straight line. In practice, what this looks like is generating a histogram for some data and plotting it on a log-log plot [1]. One might go even further and perform a linear regression to estimate the distribution’s α value (here, α = -m+1).
Nevertheless, there are significant limitations to this approach. These are described in reference [1] and summarized below.
- Slope (hence α) estimations are subject to systematic errors
- Regression errors might be hard to estimate
- Fit can look good even when the distribution doesn’t follow a Power Law
- Matches may not obey basic conditions for probability distributions e.g. normalization
While the Log-Log approach is straightforward to implement, its limitations make it lower than optimal. As an alternative, we will turn to a more mathematically sound approach via Maximum Likelihood, a widely used statistical method for inferring the best parameters for a model given some data.
Maximum Likelihood consists of two key steps. Step 1: obtain a likelihood function. Step 2: maximize the likelihood with respect to your model parameters.
Step 1: Write Likelihood Function
Likelihood is a special kind of probability. Put simply, it quantifies the probability of our data given a selected model. We are able to express it because the joint probability over all our observed data [3]. Within the case of a Pareto distribution, we will write this as follows.
To make maximizing the likelihood slightly easier, it’s customary to work with the log-likelihood (they’re maximized by the identical value of α).
Step 2: Maximize Likelihood
With a (log) likelihood function in hand, we will now frame the duty of determining one of the best selection of parameters as an optimization problem. To search out the optimal α value based on our data, this boils all the way down to setting the derivative of l(α) with respect to α equal to zero after which solving for α. A derivation of that is given below.
This provides us with the so-called Maximum Likelihood estimator for α. With this, we will plug in observed values of x to estimate a Pareto distribution’s α value.
With the theoretical foundation set, let’s see what this looks like when applied to real-world data (from my social media accounts).
One domain by which fat-tailed data are prevalent is social media. As an illustration, a small percentage of creators get the majority of the eye, a minority of Medium blogs get nearly all of reads, and so forth.
Here we’ll use the powerlaw Python library to find out whether data from my various social media channels (i.e. Medium, YouTube, LinkedIn) truly follow a Power Law distribution. The info and code for these examples can be found on the GitHub repository.
Artificial Data
Before applying the Maximum Likelihood-based approach to messy data from the true world, let’s see what happens once we apply this method to artificial data (truly) generated from Pareto and Log Normal distributions, respectively. It will help ground our expectations before using the approach on data by which we have no idea the “true” underlying distribution class.
First, we import some helpful libraries.
import numpy as np
import matplotlib.pyplot as plt
import powerlaw
import pandas as pdnp.random.seed(0)
Next, let’s generate data from Pareto and Log Normal distributions.
# power law data
a = 2
x_min = 1
n = 1_000
x = np.linspace(0, n, n+1)
s_pareto = (np.random.pareto(a, len(x)) + 1) * x_min# log normal data
m = 10
s = 1
s_lognormal = np.random.lognormal(m, s, len(x)) * s * np.sqrt(2*np.pi)
To get a way of what these data appear like, it’s helpful to plot histograms. Here, I plot a histogram of every sample’s raw values and the log of the raw values. This latter distribution makes it easier to differentiate between Power Law and Log Normal data visually.
As we will see from the above histograms, the distributions of raw values look qualitatively similar for each distributions. Nevertheless, we will see a stark difference within the log distributions. Namely, the log Power Law distribution is extremely skewed and never mean-centered, while the log of the Log Normal distribution is paying homage to a Gaussian distribution.
Now, we will use the powerlaw library to suit a Power Law to every sample and estimate values for α and x_min. Here’s what that appears like for our Power Law sample.
# fit power to power law data
results = powerlaw.Fit(s_pareto)# printing results
print("alpha = " + str(results.power_law.alpha)) # note: powerlaw lib's alpha definition is different than standard i.e. a_powerlawlib = a_standard + 1
print("x_min = " + str(results.power_law.xmin))
print('p = ' + str(compute_power_law_p_val(results)))
# Calculating best minimal value for power law fit
# alpha = 2.9331912195958676
# x_min = 1.2703447024073973
# p = 0.999
The fit does an honest job at estimating the true parameter values (i.e. a=3, x_min=1), as seen by the alpha and x_min values printed above. The worth p above quantifies the standard of the fit. The next p means a greater fit (more on this value in section 4.1 of ref [1]).
We are able to do an analogous thing for the Log Normal distribution.
# fit power to log normal data
results = powerlaw.Fit(s_lognormal)
print("alpha = " + str(results.power_law.alpha)) # note: powerlaw lib's alpha definition is different than standard i.e. a_powerlawlib = a_standard + 1
print("x_min = " + str(results.power_law.xmin))
print('p = ' + str(compute_power_law_p_val(results)))# Calculating best minimal value for power law fit
# alpha = 2.5508694755027337
# x_min = 76574.4701482522
# p = 0.999
We are able to see that the Log Normal sample also matches a Power Law distribution well (p=0.999). Notice, nevertheless, that the x_min value is much within the tail. While this will be helpful for some use cases, it doesn’t tell us much in regards to the distribution that most closely fits all the info within the sample.
To beat this, we will manually set the x_min value to the sample minimum and redo the fit.
# fixing xmin in order that fit must include all data
results = powerlaw.Fit(s_lognormal, xmin=np.min(s_lognormal))
print("alpha = " + str(results.power_law.alpha))
print("x_min = " + str(results.power_law.xmin))# alpha = 1.3087955873576855
# x_min = 2201.318351239509
The .Fit() method also robotically generates estimates for a Log Normal distribution.
print("mu = " + str(results.lognormal.mu))
print("sigma = " + str(results.lognormal.sigma))# mu = 10.933481999687547
# sigma = 0.9834599169175509
The estimated Log Normal parameter values are near the actual values (mu=10, sigma=1), so the fit did job once more!
Nevertheless, by fixing x_min, we lost our quality metric p (for whatever reason, the strategy doesn’t generate values for it when x_min is provided). So this begs the query, which distribution parameters should I go together with? The Power Law or Log Normal?
To reply this query, we will compare the Power Law fit to other candidate distributions via Log-likelihood ratios (R). A positive R implies the Power Law is a greater fit, while a negative R implies the choice distribution is best. Moreover, each comparison gives us a significance value (p). That is demonstrated within the code block below.
distribution_list = ['lognormal', 'exponential', 'truncated_power_law',
'stretched_exponential', 'lognormal_positive']for distribution in distribution_list:
R, p = results.distribution_compare('power_law', distribution)
print("power law vs " + distribution +
": R = " + str(np.round(R,3)) +
", p = " + str(np.round(p,3)))
# power law vs lognormal: R = -776.987, p = 0.0
# power law vs exponential: R = -737.24, p = 0.0
# power law vs truncated_power_law: R = -419.958, p = 0.0
# power law vs stretched_exponential: R = -737.289, p = 0.0
# power law vs lognormal_positive: R = -776.987, p = 0.0
As shown above, every alternative distribution is preferred over the Power Law when including all the info within the Log Normal sample. Moreover, based on the likelihood ratios, the lognormal and lognormal_positive matches work best.
Real-world Data
Now that we’ve applied the powerlaw library to data where we all know the bottom truth let’s try it on data for which the underlying distribution is unknown.
We are going to follow an analogous procedure as we did above but with data from the true world. Here, we’ll analyze the next data. Monthly followers gained on my Medium profile, earnings across all my YouTube videos, and each day impressions on my LinkedIn posts for the past yr.
We’ll start by plotting histograms.
Two things jump out to me from these plots. One, all three look more just like the Log Normal histograms than the Power Law histograms we saw before. Two, the Medium and YouTube distributions are sparse, meaning they might have insufficient data for drawing strong conclusions.
Next, we’ll apply the Power Law fit to all three distributions while setting x_min because the smallest value in each sample. The outcomes of this are printed below.
To find out which distribution is best, we will again do head-to-head comparisons of the Power Law fit to some alternatives. These results are given below.
Using the rule of thumb significance cutoff of p<0.1 we will draw the next conclusions. Medium followers and LinkedIn impressions best fit a Log Normal distribution, while a Power Law best represents YouTube earnings.
In fact, because the Medium followers and YouTube earrings data here is proscribed (N<100), we should always take any conclusions from those data with a grain of salt.
Many standard statistical tools break down when applied to data following a Power Law distribution. Accordingly, detecting Power Laws in empirical data might help practitioners avoid incorrect analyses and misleading conclusions.
Nevertheless, Power Laws are an extreme case of the more general phenomenon of fat tails. In the subsequent article of this series, we’ll take this work one step further and quantify fat-tailedness for any given dataset via 4 handy heuristics.