You need to use another prior distribution on your parameters to create more interesting regularizations. You may even say that your parameters *w* are normally distributed but **correlated** with some correlation matrix Σ*.*

Allow us to assume that Σ ispositive-definite, i.e. we’re within the non-degenerate case. Otherwise, there is no such thing as a densityp(w).

For those who do the mathematics, one can find out that we then must optimize

for some matrix Γ. **Note: Γ is invertible and now we have Σ⁻¹ = ΓᵀΓ. **This can be called **Tikhonov regularization**.

**Hint:** start with the indisputable fact that

and do not forget that positive-definite matrices could be decomposed right into a product of some invertible matrix and its transpose.

Great, so we defined our model and know what we wish to optimize. But how can we optimize it, i.e. learn the perfect parameters that minimize the loss function? And when is there a novel solution? Let’s discover.

## Odd Least Squares

Allow us to assume that we don’t regularize and don’t use sample weights. Then, the MSE could be written as

This is sort of abstract, so allow us to write it in another way as

Using matrix calculus, you may take the derivative of this function with respect to *w *(we assume that the bias term *b* is included there)*.*

For those who set this gradient to zero, you find yourself with

If the (*n *× *k*)-matrix *X* has a rank of *k*, so does the (*k *× *k*)-matrix *X*ᵀ*X, *i.e. it’s invertible*. Why? *It follows from rank(*X*)* = *rank(*X*ᵀ*X*)*.*

On this case, we get the **unique solution**

Note:Software packages don’t optimize like this but as an alternative use gradient descent or other iterative techniques since it is quicker. Still, the formula is sweet and offers us some high-level insights in regards to the problem.

But is that this really a minimum? We are able to discover by computing the Hessian, which is *X*ᵀ*X. *The matrix is positive-semidefinite since *w*ᵀ*X*ᵀ*Xw = |Xw|² *≥ 0 for any *w*. It’s even **strictly** positive-definite since *X*ᵀ*X* is invertible, i.e. 0 will not be an eigenvector, so our optimal *w* is indeed minimizing our problem.

## Perfect Multicollinearity

That was the friendly case. But what happens if *X* has a rank smaller than *k*? This might occur if now we have two features in our dataset where one is a multiple of the opposite, e.g. we use the features *height (in m)* and *height (in cm)* in our dataset. Then now we have *height (in cm) = 100 * height (in m).*

It could actually also occur if we one-hot encode categorical data and don’t drop considered one of the columns. For instance, if now we have a feature *color* in our dataset that could be red, green, or blue, then we are able to one-hot encode and find yourself with three columns *color_red, color_green,* and *color_blue*. For these features, now we have *color_red + color_green + color_blue = *1, which induces perfect multicollinearity as well.

In these cases, the rank of *X*ᵀ*X *can be smaller than *k*, so this matrix will not be invertible.

End of story.

Or not? Actually, no, because it might mean two things: (*X*ᵀ*X*)*w = X*ᵀ*y *has

- no solution or
- infinitely many solutions.

It seems that in our case, we are able to obtain one solution using the Moore-Penrose inverse. Which means that we’re within the case of infinitely many solutions, all of them giving us the identical (training) mean squared error loss.

If we denote the Moore-Penrose inverse of *A* by *A*⁺, we are able to solve the linear system of equations as

To get the opposite infinitely many solutions, just add the null space of *X*ᵀ*X *to this specific solution.

## Minimization With Tikhonov Regularization

Recall that we could add a previous distribution to our weights. We then had to attenuate

for some invertible matrix Γ. Following the identical steps as in extraordinary least squares, i.e. taking the derivative with respect to *w* and setting the result to zero, the answer is

The neat part:

XᵀX + ΓᵀΓ is at all times invertible!

Allow us to discover why. It suffices to point out that the null space of *X*ᵀ*X* + ΓᵀΓ is barely {0}. So, allow us to take a *w *with (*X*ᵀ*X* + ΓᵀΓ)*w* = 0. Now, our goal is to point out that *w *= 0.

From (*X*ᵀ*X* + ΓᵀΓ)*w* = 0 it follows that

which in turn implies |Γ*w*| = 0 → Γ*w = *0*. *Since* *Γ is invertible, *w* needs to be 0. Using the identical calculation, we are able to see that the Hessian can be positive-definite.