Exploring model specification with Google’s LightweightMMM
Bayesian marketing mix modeling has been receiving increasingly attention, especially with the recent releases of open source tools like LightweightMMM (Google) or PyMC Marketing (PyMC Labs). Although these frameworks simplify the complexities of Bayesian modeling, it continues to be crucial for the user to have an understanding of fundamental Bayesian concepts and have the ability to grasp the model specification.
In this text, I take Google’s LightweightMMM as a practical example and show the intuition and meaning of the prior specifications of this framework. I exhibit the simulation of prior samples using Python and the scipy library.
I take advantage of the info made available by Robyn under MIT Licence.
The dataset consists of 208 weeks of revenue (from 2015–11–23 to 2019–11–11) having:
- 5 media spend channels: tv_S, ooh_S, print_S, facebook_S, search_S
- 2 media channels which have also the exposure information (Impression, Clicks): facebook_I, search_clicks_P
- Organic media without spend: newsletter
- Control variables: events, holidays, competitor sales (competitor_sales_B)
The specification of the LightweightMMM model is defined as follows:
This specification represents an additive linear regression model that explains the worth of a response (goal variable) at a particular time point t.
Let’s break down each component within the equation:
- α: This component represents the intercept or the baseline value of the response. It’s the expected value of the response when all other aspects are zero.
- trend: This component captures the increasing or decreasing trend of the response over time.
- seasonality: This component represents periodic fluctuations within the response.
- media_channels: This component accounts for the influence of media channels (television, radio, online ads) on the response.
- other_factors: This component encompasses another variables which have influence on the response reminiscent of weather, economic indicators or competitor activities.
Below, I am going through each of the components intimately and explain methods to interpret the prior specifications. As a reminder, a previous distribution is an assumed distribution of some parameter with none knowledge of the underlying data.
Intercept
The intercept is defined to follow a half-normal distribution with a typical deviation of two. A half-normal distribution is a continuous probability distribution that resembles a standard distribution but is restricted to positive values only. The distribution is characterised by a single parameter, the usual deviation (scale). Half-normal distribution implies that the intercept can get only positive values.
The next code generates samples from the prior distribution of the intercept and visualizes the probability density function (PDF) for a half-normal distribution with a scale of two. For visualizations of other components, please discuss with the accompanying source code within the Github repo.
from scipy import statsscale = 2
halfnormal_dist = stats.halfnorm(scale=scale)
samples = halfnormal_dist.rvs(size=1000)
plt.figure(figsize=(20, 6))
sns.histplot(samples, bins=50, kde=False, stat='density', alpha=0.5)
sns.lineplot(x=np.linspace(0, 6, 100),
y=halfnormal_dist.pdf(np.linspace(0, 6, 100)), color='r')
plt.title(f"Half-Normal Distribution with scale={scale}")
plt.xlabel('x')
plt.ylabel('P(X=x)')
plt.show()
Trend
The trend is defined as a power-law relationship between time t and the trend value. The parameter μ represents the amplitude or magnitude of the trend, while k controls the steepness or curvature of the trend.
The parameter μ is drawn from a standard distribution with a mean of 0 and a typical deviation of 1. This suggests that μ follows a typical normal distribution, centered around 0, with standard deviation of 1. The traditional distribution allows for positive and negative values of μ, representing upward or downward trends, respectively.
The parameter k is drawn from a uniform distribution between 0.5 and 1.5. The uniform distribution ensures that k takes values that end in an affordable and meaningful curvature for the trend.
The plot below depicts separate components obtained from the prior distributions: a sample of the intercept and trend, each represented individually.
Seasonality
Each component γ is drawn from a standard distribution with a mean of 0 and a typical deviation of 1.
By combining the cosine and sine functions with different γ, cyclic patterns can modeled to capture the seasonality present in the info. The cosine and sine functions represent the oscillating behavior observed over the period of 52 units (weeks).
The plot below illustrates a sample of the seasonality, intercept and trend obtained from the prior distributions.
Other aspects (control variables)
Each factor coefficient λ is drawn from a standard distribution with a mean of 0 and a typical deviation of 1, which suggests that λ can take positive or negative values, representing the direction and magnitude of the influence each factor has on the end result.
The plot below depicts separate components obtained from the prior distributions: a sample of the intercept, trend, seasonality and control variables (competitor_sales_B, newsletter, holidays and events) each represented individually.
Media Channels
The distribution for β coefficient of a media channel m is specified as a half-normal distribution, where the usual deviation parameter v is decided by the sum of the entire cost related to media channel m. The whole cost reflects the investment or resources allocated to that individual media channel.
Media Transformations
In these equations, we’re modeling the media channels’ behavior using a series of transformations, reminiscent of adstock and Hill saturation.
The variable media channels represents the transformed media channels at time point t. It’s obtained by applying a change to the raw media channel value x. The Hill transformation is controlled by the parameters K a half saturation point (0 < k ≤ 1), and shape S controlling the steepness of the curve (s > 0).
The variable x∗ represents the transformed media channels value at time t after undergoing the adstock transformation. It’s calculated by adding the present raw media channel value to the product of the previous transformed value and the adstock decay parameter λ.
Parameters K and S follow gamma distributions with shape and scale parameters each set to 1, while λ follows a beta distribution with shape parameters 2 and 1.
The probability density function of the Hill Saturation parameters K and S are illustrated within the plot below:
shape = 1
scale = 1gamma_dist = stats.gamma(a=shape, scale=scale)
samples = gamma_dist.rvs(size=1000)
plt.figure(figsize=(20, 6))
sns.histplot(samples, bins=50, kde=False, stat='density', alpha=0.5)
sns.lineplot(x=np.linspace(0, 6, 100), y=gamma_dist.pdf(np.linspace(0, 6, 100)), color='r')
plt.title(f"Gamma Distribution for $K_m$ and $S_m$ with shape={shape} and scale={scale}")
plt.xlabel('x')
plt.ylabel('P(X=x)')
# Show the plot
plt.show()python
The probability density function of the adstock parameter λ is shown within the plot below:
A Note on the specification of the adstock parameter λ:
The probability density function of the Beta(α = 2, β = 1) distribution exhibits a positive trend, indicating that higher values have the next probability density. In media evaluation, different industries and media activities may exhibit various decay rates, with most media channels typically exhibiting small decay rates. As an example, Robyn suggests the next ranges of λ decay for common media channels: TV (0.3–0.8), OOH/Print/Radio (0.1–0.4), and digital (0–0.3).
Within the context of the Beta(α = 2, β = 1) distribution, higher probabilities are assigned to λ values closer to 1, while lower probabilities are assigned to values closer to 0. Consequently, outcomes or observations near the upper end of the interval [0, 1] usually tend to occur in comparison with outcomes near the lower end.
Alternatively, within the Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects, the decay parameter is defined as Beta(α = 3, β = 3), whose probability density function is illustrated below. This distribution is symmetric around 0.5, indicating an equal likelihood of observing outcomes at each extremes and near the middle of the interval [0, 1].
The plot below depicts separate components obtained from the prior distributions: a sample of the intercept, trend, seasonality, control variables and media channels, each represented individually.
Combining all components
As mentioned earlier, LightweightMMM models an additive linear regression by combining various components reminiscent of intercept, trend, seasonality, media channels, and other aspects sampled from their prior distributions to acquire the predictive response. The plot below visualizes the true response and the expected response sampled from the prior predictive distribution.
Visualizing a single sample against the true response value allows us to look at how the model’s prediction compares to the actual end result for a particular set of parameter values. It could actually provide an intuitive understanding of how the model performs in that individual instance.
Prior predictive check
So as get more robust insights, it is usually advisable to sample multiple times from the prior predictive distribution and measure the uncertainty. The prior predictive check helps assess the adequacy of the chosen model and evaluate whether the model’s predictions align with our expectations, before observing any actual data.
The plot depicted below visualizes the prior predictive distribution by showing the expected revenue (mean) at each point, together with measures of uncertainty. We are able to see that the true revenue falls inside the range of the usual deviation, indicating that the model specification is suitable for the observed data.
Bayesian marketing mix modeling may take considerable time to master. I hope that this text helped you to reinforce your understanding of prior distributions and Bayesian marketing model specifications.
The whole code may be downloaded from my Github repo
Thanks for reading!