
Techniques for Optimizing Step-Size/Learning Rate in Gradient Descent for Machine Learning

- Introduction
- Method 1: Fixed Step Size
- Method 2: Exact Line Search
- Method 3: Backtracking Line Search
- Conclusion
When training any machine learning model, Gradient Descent is one of the crucial commonly used techniques to optimize for the parameters. Gradient descent offers an efficient way of minimizing the loss function for the incoming data, especially in cases, where they is probably not a closed-form solution to the issue. On the whole, consider a machine learning problem defined by a convex and differentiable function f: Rᵈ → R (most loss functions follow these properties). The goal is to search out x* ∈ Rᵈ that minimizes the loss function:
Gradient Descent provides an iterative approach to solving this problem. The update rule is given as:
Where x⁽ᵏ⁾ refers back to the value of x within the kth iteration of the algorithm, and tₖ refers back to the step size or the educational rate of the model within the kth iteration. The final workflow of the algorithm is given as follows:
- Determine the loss function f and compute its gradient ∇f.
- Start with a random selection of x ∈ Rᵈ, call it x⁽⁰⁾(the starting iterate).
- Until you reach the stopping criteria (e.g., the error falls below a certain threshold), do the next:
A) Determine the direction along which x have to be reduced or increased. Under gradient descent, that is given by the direction opposite to the gradient of the loss function evaluated at the present iterate. vₖ = ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)
B) Determine the step size or the magnitude of the change: tₖ.
C) Update the iterate: x⁽ᵏ⁾= x⁽ᵏ ⁻ ¹⁾ − tₖ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)
That’s the complete workflow in a nutshell: Take the present iterate, search for the direction during which it must be updated (vₖ), determine the magnitude of the update (tₖ), and update it.
So, what’s this text about? In this text, our focus will likely be on step 3B: finding the optimal step size or the magnitude of tₖ. In relation to gradient descent, that is one of the crucial neglected elements of optimizing your model. The scale of the step can greatly affect how briskly your algorithm converges to the answer in addition to the accuracy of the answer it converges to. most frequently, data scientists simply set a hard and fast value for the step size during the complete learning process or may occasionally use validation techniques to coach it. But, there are a lot of more efficient ways to go about solving this problem. In this text, we’ll discuss 3 alternative ways to find out the step size tₖ:
- Fixed Step Size
- Exact Line Search
- Backtracking Line Search (Armijo’s Rule)
For every of those methods, we’ll discuss the idea and implement it to calculate the primary few iterates for an example. Specifically, we’ll consider the next loss function to judge the model:
The 3D-Plot of the function is shown below:
From the figure, it’s evident that the worldwide minima is x* = [0; 0]. Throughout this text, we’ll manually calculate the primary few iterates and computationally determine the variety of steps for convergence for every of those methods. We can even trace the pattern of the descent (aka the iterate trajectory) to grasp how each of those techniques affects the [process of convergence. Usually, it’s easier to refer to the contour plot of the function (instead of its 3D plot) to better evaluate the different trajectories that follow. The contour plot of the function can be easily generated via the following code:
# Load Packages
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
sns.set(style="darkgrid")
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D
# Define Function
f = lambda x,y: 2*x**2 + 3*y**2 - 2*x*y - 1# Plot contour
X = np.arange(-1, 1, 0.005)
Y = np.arange(-1, 1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
cmap = plt.cm.get_cmap('viridis')
plt.contour(X,Y,Z,250, cmap=cmap)
Let’s start!
This method is the only to make use of, and the one mostly used for training the ML model. This involves setting:
One must be very careful while selecting the fitting t under this method. While a small value of t can result in very accurate solutions, the convergence can change into quite slow. Then again, a big t makes the algorithm faster, but at the fee of accuracy. Using this method requires the implementer to fastidiously balance the trade-off between the speed of convergence and the accuracy of the answer yielded.
In practice, most data scientists use validation techniques comparable to hold-out validation or k-fold cross-validation to optimize for t. This method involves making a partition of the training data (called the validation data), which is used to optimize for the performance by running the algorithm on a discrete set of values that t can take. Let’s take a look at our example:
Step one is to compute it’s gradient:
For all subsequent calculations, we’ll take the initialization to be x⁽⁰⁾= [1; 1]. Under this strategy, we set:
The primary two iterates are calculated as follows:
We compute the remaining iterates programmatically via the next Python Code:
# Define the function f(x, y)
f = lambda x, y: 2*x**2 + 3*y**2 - 2*x*y - 1# Define the derivative of f(x, y)
def df(x, y):
return np.array([4*x - 2*y, 6*y - 2*x])
# Perform gradient descent optimization
def grad_desc(f, df, x0, y0, t=0.1, tol=0.001):
x, y = [x0], [y0] # Initialize lists to store x and y coordinates
num_steps = 0 # Initialize the variety of steps taken
# Proceed until the norm of the gradient is below the tolerance
while np.linalg.norm(df(x0, y0)) > tol:
v = -df(x0, y0) # Compute the direction of descent
x0 = x0 + t*v[0] # Update x coordinate
y0 = y0 + t*v[1] # Update y coordinate
x.append(x0) # Append updated x coordinate to the list
y.append(y0) # Append updated y coordinate to the list
num_steps += 1 # Increment the variety of steps taken
return x, y, num_steps
# Run the gradient descent algorithm with initial point (1, 1)
a, b, n = grad_desc(f, df, 1, 1)
# Print the variety of steps taken for convergence
print(f"Variety of Steps to Convergence: {n}")
Within the above code, we have now defined the next convergence criteria (which will likely be consistently utilized for all methods):
On running the above code, we discover that it takes around 26 steps to converge. The next plot shows the trajectory of the iterates in the course of the gradient descent:
# Plot the contours
X = np.arange(-1.1, 1.1, 0.005)
Y = np.arange(-1.1, 1.1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)
n = len(a)
for i in range(n - 1):
plt.plot([a[i]],[b[i]],marker='o',markersize=7, color ='r')
plt.plot([a[i + 1]],[b[i + 1]],marker='o',markersize=7, color ='r')
plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i],
head_width=0, head_length=0, fc='r', ec='r', linewidth=2.0)
To have a greater understanding of how critical it’s to decide on the fitting t on this method, let’s see gauge the effect of accelerating or decreasing t. If we decrease the worth of t from 0.1 to 0.01, the variety of steps to converge increases drastically from 26 to 295. The iterate trajectory for this case is shown below:
Nevertheless, on increasing the t from 0.1 to 0.2, the variety of steps to converge decreases from 26 to a mere 11, as shown by the next trajectory:
Nevertheless, it is vital to notice that this doesn’t all the time be the case. If the worth of the step size is simply too large, it is feasible that the iterates simply bounce away from the optimal solution and never converge. In actual fact, increasing t from 0.2 to only around 0.3 causes the iterate values to shoot up, making it unattainable to converge. That is seen from the next contour plot (with t = 0.3) for just the primary 8 steps:
Thus, it is clear that finding the fitting value of t is incredibly vital on this method and even a small increase or decrease can greatly affect the algorithm’s ability to converge. Now, let’s talk concerning the next method to find out t.
On this method, we don’t assign an easy pre-fixed value of t at every iteration. As a substitute, we treat the issue of finding the optimal t itself as a 1D optimization problem. In other words, we’re eager about finding one of the best update t, that minimizes the worth of the function:
Notice how cool that is! Now we have a multi-dimensional optimization problem (minimizing f) that we try to solve using gradient descent. We all know one of the best direction to update our iterate (vₖ = − ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)), but we’d like to search out the optimal step size tₖ. In other words, the worth of the function for the subsequent iterate only is dependent upon the worth of tₖ that we selected to make use of. So, we treat this as one other (but simpler!) optimization problem:
So, we update x⁽ᵏ⁾ to be the iterate that best minimizes the loss function f. This really helps increase the speed of convergence. Nevertheless, it also adds an extra time requirement: To compute the minimizer of g(t). Often, this isn’t an issue because it’s a 1D function, but sometimes it could take longer than expected. Thus, while using this method, it’s necessary to balance the trade-off between the variety of steps reduced to converge and the extra time requirement to compute the argmin. Let’s take a look at our example:
The primary two iterates are calculated as follows:
We compute the remaining iterates programmatically via the next Python Code
# Import package for 1D Optimization
from scipy.optimize import minimize_scalardef grad_desc(f, df, x0, y0, tol=0.001):
x, y = [x0], [y0] # Initialize lists to store x and y coordinates
num_steps = 0 # Initialize the variety of steps taken
# Proceed until the norm of the gradient is below the tolerance
while np.linalg.norm(df(x0, y0)) > tol:
v = -df(x0, y0) # Compute the direction of descent
# Define optimizer function for searching t
g = lambda t: f(x0 + t*v[0], y0 + t*v[1])
t = minimize_scalar(g).x # Minimize t
x0 = x0 + t*v[0] # Update x coordinate
y0 = y0 + t*v[1] # Update y coordinate
x.append(x0) # Append updated x coordinate to the list
y.append(y0) # Append updated y coordinate to the list
num_steps += 1 # Increment the variety of steps taken
return x, y, num_steps
# Run the gradient descent algorithm with initial point (1, 1)
a, b, n = grad_desc(f, df, 1, 1)
# Print the variety of steps taken for convergence
print(f"Variety of Steps to Convergence: {n}")
Just as before, within the above code, we have now defined the next convergence criteria (which will likely be consistently utilized for all methods):
On running the above code, we discover that it takes only 10 steps to converge ( a serious improvement from the fixed step size). The next plot shows the trajectory of the iterates in the course of the gradient descent:
# Plot the contours
X = np.arange(-1.1, 1.1, 0.005)
Y = np.arange(-1.1, 1.1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)
n = len(a)
for i in range(n - 1):
plt.plot([a[i]],[b[i]],marker='o',markersize=7, color ='r')
plt.plot([a[i + 1]],[b[i + 1]],marker='o',markersize=7, color ='r')
plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], head_width=0,
head_length=0, fc='r', ec='r', linewidth=2.0)
Now, let’s talk concerning the next method to find out t.
Backtracking is an adaptive method of selecting the optimal step size. In my experience, I even have found this to be one of the crucial useful strategies for optimizing the step size. The convergence is frequently much faster than fixed step size without the complications of maximizing a 1D function g(t) in a precise line search. This method involves starting out with a slightly large step size (t¯ = 1) and continuing to diminish it until a required decrease in f(x) is observed. Allow us to first take a take a look at the algorithm and subsequently, we will likely be discussing the specifics:
In other words, we start with a big step size (which is vital often within the initial stages of the algorithm) and check if it helps us improve the present iterate by a given threshold. If the step size is found to be too large, we reduce it by multiplying it with a scalar constant β ∈ (0, 1). We repeat this process until a desired decrease in f is obtained. Specifically, we elect the most important t such that:
i.e., the decrease is at the least σt || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||². But, why this value? It will possibly be mathematically shown (via Taylor’s first order expansion) that t || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||² is the minimum decrease in f that we are able to expect through an improvement made in the course of the current iteration. There’s an extra factor of σ within the condition. That is to account for the actual fact, that even when we cannot achieve the theoretically guaranteed decrease t || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||², we at the least hope to attain a fraction of it scaled by σ. That’s to say, we require that the achieved reduction if f be at the least a hard and fast fraction σ of the reduction promised by the first-order Taylor approximation of f at x⁽ᵏ⁾. If the condition isn’t fulfilled, we scale down t to a smaller value via β. Let’s take a look at our example (setting t¯= 1, σ = β = 0.5):
The primary two iterates are calculated as follows:
Likewise,
We compute the remaining iterates programmatically via the next Python Code:
# Perform gradient descent optimization
def grad_desc(f, df, x0, y0, tol=0.001):
x, y = [x0], [y0] # Initialize lists to store x and y coordinates
num_steps = 0 # Initialize the variety of steps taken
# Proceed until the norm of the gradient is below the tolerance
while np.linalg.norm(df(x0, y0)) > tol:
v = -df(x0, y0) # Compute the direction of descent
# Compute the step size using Armijo line search
t = armijo(f, df, x0, y0, v[0], v[1])
x0 = x0 + t*v[0] # Update x coordinate
y0 = y0 + t*v[1] # Update y coordinate
x.append(x0) # Append updated x coordinate to the list
y.append(y0) # Append updated y coordinate to the list
num_steps += 1 # Increment the variety of steps taken
return x, y, num_stepsdef armijo(f, df, x1, x2, v1, v2, s = 0.5, b = 0.5):
t = 1
# Perform Armijo line search until the Armijo condition is satisfied
while (f(x1 + t*v1, x2 + t*v2) > f(x1, x2) +
t*s*np.matmul(df(x1, x2).T, np.array([v1, v2]))):
t = t*b # Reduce the step size by an element of b
return t
# Run the gradient descent algorithm with initial point (1, 1)
a, b, n = grad_desc(f, df, 1, 1)
# Print the variety of steps taken for convergence
print(f"Variety of Steps to Convergence: {n}")
Just as before, within the above code, we have now defined the next convergence criteria (which will likely be consistently utilized for all methods):
On running the above code, we discover that it takes only 10 steps to converge. The next plot shows the trajectory of the iterates in the course of the gradient descent:
# Plot the contours
X = np.arange(-1.1, 1.1, 0.005)
Y = np.arange(-1.1, 1.1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)
n = len(a)
for i in range(n - 1):
plt.plot([a[i]],[b[i]],marker='o',markersize=7, color ='r')
plt.plot([a[i + 1]],[b[i + 1]],marker='o',markersize=7, color ='r')
plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], head_width=0,
head_length=0, fc='r', ec='r', linewidth=2.0)
In this text, we familiarised ourselves with a number of the useful techniques to optimize for the step size within the gradient descent algorithm. Specifically, we covered 3 essential techniques: Fixed Step Size, which involves maintaining the identical step size or learning rate throughout the training process, Exact Line Search, which involves minimizing the loss as a function of t, and Armijo Backtracking involving a gradual reduction within the step size until a threshold is met. While these are a number of the most fundamental techniques which you could use to tune your optimization, there exist an unlimited array of other methods (eg. setting t as a function of the variety of iterations). These tools are generally used in additional complex settings, comparable to Stochastic Gradient Descent. The aim of this text was not only to introduce you to those techniques but in addition to make you aware of the intricacies that may affect your optimization algorithm. While most of those techniques are utilized in the context of Gradient descent, they may also be applied to other optimization algorithms (e.g., Newton-Raphson Method). Each of those techniques has its own merits and will be preferred over the others for specific applications and algorithms.
Hope you enjoyed reading this text! In case you will have any doubts or suggestions, do reply within the comment box. Please be at liberty to contact me via mail.
For those who liked my article and wish to read more of them, please follow me.
Note: All images have been made by the writer.