Home Artificial Intelligence Understanding Gradient Descent for Machine Learning What’s Loss Function? 1. Batch Gradient Descent 2. Stochastic Gradient Descent 3. Mini-Batch Gradient Descent Conclusion

Understanding Gradient Descent for Machine Learning What’s Loss Function? 1. Batch Gradient Descent 2. Stochastic Gradient Descent 3. Mini-Batch Gradient Descent Conclusion

0
Understanding Gradient Descent for Machine Learning
What’s Loss Function?
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-Batch Gradient Descent
Conclusion

A deep dive into Batch, Stochastic, and Mini-Batch Gradient Descent algorithms using Python

Towards Data Science
Photo by Lucas Clara on Unsplash

Gradient descent is a well-liked optimization algorithm that’s utilized in machine learning and deep learning models reminiscent of linear regression, logistic regression, and neural networks. It uses first-order derivatives iteratively to attenuate the price function by updating model coefficients (for regression) and weights (for neural networks).

In this text, we are going to delve into the mathematical theory of gradient descent and explore learn how to perform calculations using Python. We’ll examine various implementations including Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent, and assess their effectiveness on a variety of test cases.

While following the article, you possibly can take a look at the Jupyter Notebook on my GitHub for complete evaluation and code.

Before a deep dive into gradient descent, let’s first undergo the loss function.

Loss or cost are used interchangeably to explain the error in a prediction. A loss value indicates how different a prediction is from the actual value and the loss function aggregates all of the loss values from multiple data points right into a single number.

You may see within the image below, the model on the left has high loss whereas the model on the suitable has low loss and suits the info higher.

High loss vs low loss (blue lines) from the corresponding regression line in yellow.

The loss function (J) is used as a performance measurement for prediction algorithms and the major goal of a predictive model is to attenuate its loss function, which is set by the values of the model parameters (i.e., θ0 and θ1).

For instance, linear regression models regularly use squared loss to compute the loss value and mean squared error is the loss function that averages all squared losses.

Squared Loss value (L2 Loss) and Mean Squared Error (MSE)

The linear regression model works behind the scenes by going through several iterations to optimize its coefficients and reach the bottom possible mean squared error.

What’s Gradient Descent?

The gradient descent algorithm is frequently described with a mountain analogy:

⛰ Imagine yourself standing atop a mountain, with limited visibility, and you ought to reach the bottom. While descending, you may encounter slopes and pass them using larger or smaller steps. Once you have reached a slope that is sort of leveled, you may know that you’ve got arrived at the bottom point. ⛰

In technical terms, gradient refers to those slopes. When the slope is zero, it might indicate that you just’ve reached a function’s minimum or maximum value.

Like within the mountain analogy, GD minimizes the starting loss value by taking repeated steps in the other way of the gradient to scale back the loss function.

At any given point on a curve, the steepness of the slope might be determined by a tangent line — a straight line that touches the purpose (red lines within the image above). Much like the tangent line, the gradient of a degree on the loss function is calculated with respect to the parameters, and a small step is taken in the other way to scale back the loss.

To summarize, the means of gradient descent might be broken down into the next steps:

  1. Select a place to begin for the model parameters.
  2. Determine the gradient of the price function with respect to the parameters and continually adjust the parameter values through iterative steps to attenuate the price function.
  3. Repeat step 2 until the price function not decreases or the utmost variety of iterations is reached.

We are able to examine the gradient calculation for the previously defined cost (loss) function. Although we’re utilizing linear regression with an intercept and coefficient, this reasoning might be prolonged to regression models incorporating several variables.

Linear regression function with 2 parameters, cost function, and objective function
Partial derivatives calculated wrt model parameters

💡 Sometimes, the purpose that has been reached may only be a local minimum or a plateau. In such cases, the model must proceed iterating until it reaches the worldwide minimum. Reaching the worldwide minimum is unfortunately not guaranteed but with a correct variety of iterations and a learning rate we are able to increase the possibilities.

When using gradient descent, it will be significant to pay attention to the potential challenge of stopping at an area minimum or on a plateau. To avoid this, it is crucial to decide on the suitable variety of iterations and learning rate. We’ll discuss this further in the next sections.

Learning_rate is the hyperparameter of gradient descent to define the dimensions of the educational step. It could be tuned using hyperparameter tuning techniques.

  • If the learning_rate is ready too high it could end in a jump that produces a loss value greater than the start line. A high learning_rate might cause gradient descent to diverge, leading it to repeatedly obtain higher loss values and stopping it from finding the minimum.
Example case: A high learning rate causes GD to diverge
  • If the learning_rate is ready too low it will possibly result in a lengthy computation process where gradient descent iterates through quite a few rounds of gradient calculations to succeed in convergence and discover the minimum loss value.
Example case: A low learning rate causes GD to take an excessive amount of time to converge

The worth of the educational step is set by the slope of the curve, which implies that as we approach the minimum point, the educational steps change into smaller.

When using low learning rates, the progress made will likely be regular, whereas high learning rates may end in either exponential progress or being stuck at low points.

Image adapted from https://cs231n.github.io/neural-networks-3/

LEAVE A REPLY

Please enter your comment!
Please enter your name here