## Understand the logic behind the basic algorithm used contained in the gradient descent

In time series evaluation, there is usually a necessity to know the trend direction of a sequence by taking into consideration previous values. Approximation of the following values in a sequence may be performed in several ways, including the usage of easy baselines or the development of advanced machine learning models.

An** exponential (weighted) moving average** is a sturdy trade-off between these two methods. Having an easy recursive method under the hood makes it possible to efficiently implement the algorithm. At the identical time, it is vitally flexible and may be successfully adapted for many varieties of sequences.

This text covers the motivation behind the tactic, an outline of its workflow and bias correction — an efficient technique to beat a bias obstacle in approximation.

Imagine an issue of approximating a given parameter that changes in time. On every iteration, we’re aware of all of its previous values. The target is to predict the following value which is dependent upon the previous values.

One in all the naive strategies is to easily take the common of the last several values. This might work in certain cases however it is just not very suitable for scenarios when a parameter is more depending on probably the most recent values.

One in all the possible ways to beat this issue is to distribute higher weights to newer values and assign fewer weights to prior values. The exponential moving average is strictly a technique that follows this principle. **It is predicated on the belief that newer values of a variable contribute more to the formation of the following value than precedent values**.

To grasp how the exponential moving average works, allow us to take a look at its recursive equation:

- vₜ is a time series that approximates a given variable. Its index t corresponds to the timestamp t. Since this formula is recursive, the worth v₀ for the initial timestamp t = 0 is required. In practice, v₀ is frequently taken as 0.
- θ is the commentary on the present iteration.
- β is a hyperparameter between 0 and 1 which defines how weight importance must be distributed between a previous average value vₜ-₁ and the present commentary θ

Allow us to write this formula for first several parameter values:

In consequence, the ultimate formula looks like this:

We will see that probably the most recent commentary θ has a weight of 1, the second last commentary — β, the third last — β², etc. Since 0 < β < 1, the multiplication term βᵏ goes exponentially down with the rise of k, *so the older the observations, the less essential they’re*. Finally, every sum term is multiplied by (1 —β).

In practice, the worth for β is frequently chosen near 0.9.

Using the famous second wonderful limit from mathematical evaluation, it is feasible to prove the next limit:

By making a substitution β = 1 – *x*, we will rewrite it in the shape below:

We also know that within the equation for the exponential moving average, every commentary value is multiplied by a term βᵏ where k indicates what number of timestamps ago the commentary was computed. Because the base β is equal in each cases, we will equate the exponents of each formulas:

By utilizing this equation, for a selected value of β, we will compute an approximate variety of timestamps t it takes for weight terms to succeed in the worth of 1 / e ≈ 0.368). It implies that observations computed inside last t iterations have a weight term greater than 1 / e and people more precedent calculated out of last t timestamp range provide you with weights lower than 1 / e having a much less significance.

In point of fact, weights lower than 1 / e make a tiny impact on the exponentially weighted average. That’s the reason it is claimed that **for a given value of β, the exponential weighted average takes into consideration the last t = 1 / (1 – β) observations**.

To get a greater sense of the formula, allow us to plug in numerous values for β**:**

As an example, taking β

= 0.9 indicates that roughly in t = 10 iterations, the load decays to 1 / e, in comparison with the load of the present commentary. In other words, the exponential weighted average mostly depends only on the last t = 10 observations.

The common problem with using exponential weighted average is that in most problems it cannot approximate well the primary series values. It occurs as a consequence of the absence of a sufficient amount of knowledge on the primary iterations. For instance, imagine we’re given the next time series sequence:

The goal is to approximate it with the exponential weighted average. Nevertheless, if we use the traditional formula, then the primary several values will put a big weight on v₀ which is 0 whereas a lot of the points on the scatterplot are above 20. As a consequence, a sequence of first weighted averages might be too low to exactly approximate the unique sequence.

One in all the naive solutions is to take a price for v₀ being near the primary commentary θ₁. Though this approach works well in some situations, it remains to be not perfect, especially in cases when a given sequence is volatile. For instance, if θ₂ differs an excessive amount of from θ₁, then while calculating the second value v₂, the weighted average will normally put far more importance on the previous trend v₁ than the present commentary θ₂. In consequence, the approximation might be very poor.

A far more flexible solution is to make use of a method called “**bias correction**”. As an alternative of simply using computed values vₖ, they’re divided by (1 —βᵏ). Assuming that β is chosen near 0.9–1, this expression tends to be near 0 for first iterations where k is small. Thus, as a substitute of slowly accumulating the primary several values where v₀ = 0, they are actually divided by a comparatively small number scaling them into larger values.

Basically, this scaling works thoroughly and precisely adapts the primary several terms. When k becomes larger, the denominator progressively approaches 1, thus progressively omitting the effect of this scaling which isn’t any longer needed, because ranging from a certain iteration, the algorithm can rely with a high confidence on its recent values with none additional scaling.

In this text, we’ve got covered an especially useful technique for approximating a time series sequence. The robustness of the exponential weighted average algorithm is primarily achieved by its hyperparameter β which may be adapted for a specific sort of sequence. Aside from it, the introduced bias correction mechanism makes it possible to efficiently approximate data even on early timestamps when there is just too little information.

Exponential weighted average has a large application scope in time series evaluation. Moreover, it utilized in variations of gradient descent algorithm for convergence acceleration. One of the crucial popular of them is the Momentum optimizer in deep learning which removes unnecessary oscillations of an optimized function aligning it more precisely towards a neighborhood minimum.

*All images unless otherwise noted are by the writer*