Vectorized Logistic Regression

17 hours ago
The underlying math behind any Artificial Neural Network (ANN) algorithm will be overwhelming to grasp. Furthermore, the matrix and vector operations used to represent feed-forward and back-propagation computations during batch training of the model can add to the comprehension overload. While succinct matrix and vector notations make sense, peeling through such notations all the way down to subtle working details of such matrix operations would bring more clarity. I spotted that the most effective approach to understand such subtle details is to think about a bare minimum network model. I couldn’t find a greater algorithm than Logistic Regression to explore what goes under the hood since it has all of the bells and whistles of ANN, similar to multidimensional inputs, the network weights, the bias, forward propagation operations, activations that apply non-linear function, loss function, and gradients-based back-propagation. My intent for this blog is to share my notes and findings of the matrix and vector operations which might be core to Logistic Regression model.
Temporary Synopsis of Logistic Regression
Despite its name, Logistic Regression is a classification algorithm and never a regression algorithm. Typically it’s used for binary classification to predict the probability of an instance belonging to one among two classes, for instance, predicting if an email is spam or not. As such, in Logistic Regression, the dependent or goal variable is taken into account a categorical variable. For instance, an email being spam is represented as 1 and never spam as 0. The first goal of the Logistic Regression model is to determine a relationship between the input variables (features) and the probability of the goal variable. For instance, given the characteristics of an email as a set of input features, a Logistic Regression model would discover a relationship between such features and the probability of the e-mail being spam. If ‘Y’ represents the output class, similar to an email being spam, ‘X’ represents the input features, the probability will be designated as π = Pr( Y = 1 | X, βi), where βi represents the logistic regression parameters that include model weights ‘wi’ and a bias parameter ‘b’. Effectively, a Logistic Regression predicts the probability of Y = 1 given the input features and the model parameters. Specifically, the probability π is modeled as an S-Shaped logistic function called the Sigmoid function, given by π = e^z/(1 + e^z) or equivalently by π = 1/(1 + e^-z), where z = βi . X. The sigmoid function allows for a smooth curve bounded between 0 and 1, making it suitable for estimating probabilities. Essentially, a Logistic Regression model applies the sigmoid function on a linear combination of the input features to predict a probability between 0 and 1. A typical approach to determining an instance’s output class is thresholding the expected probability. For instance, if the expected probability is bigger than or equal to 0.5, the instance is classed as belonging to class 1; otherwise, it is classed as class 0.
A Logistic Regression model is trained by fitting the model to the training data after which minimizing a loss function to regulate the model parameters. A loss function estimates the difference between the expected and actual probabilities of the output class. Probably the most common loss function utilized in training a Logistic Regression model is the Log Loss function, also often known as Binary Cross Entropy Loss function. The formula for the Log Loss function is as follows:
L = — ( y * ln(p) + (1 — y) * ln(1 — p) )
Where:
- L represents the Log Loss.
- y is the ground-truth binary label (0 or 1).
- p is the expected probability of the output class.
A Logistic Regression model adjusts its parameters by minimizing the loss function using techniques similar to gradient descent. Given a batch of input features and their ground-truth class labels, training of the model is carried out in several iterations, called epochs. In each epoch, the model carries forward propagation operations to estimate losses and backward propagation operations to reduce the loss function and adjust the parameters. All such operations in an epoch employ matrix and vector computations as illustrated in the following sections.
Matrix and Vector Notations
Please note that I used LaTeX scripts to create the mathematical equations and matrix/vector representations embedded as images on this blog. If anyone is interested by the LaTeX scripts, don’t hesitate to contact me; I shall be completely happy to share.
As shown within the schematic diagram above, a binary Logistic Regression classifier is used for instance to maintain the illustrations easy. As shown below, a matrix X represents the ‘m’ variety of input instances. Each input instance comprises an ’n’ variety of features and is represented as a column, an input features vector, throughout the matrix X, making it a (n x m) sized matrix. The super-script (i) represents the ordinal variety of the input vector within the matrix X. The sub-script ‘j’ represents the ordinal index of the feature inside an input vector. The matrix Y of size (1 x m) captures the ground-truth labels corresponding to every input vector within the matrix X. The model weights are represented by a column vector W of size (n x 1) comprising ’n’ weight parameters corresponding to every feature within the input vector. While there is simply one bias parameter ‘b’, for illustrating matrix/vector operations, a matrix B of size (1 x m) comprising ‘m’ variety of the identical bias b parameter is taken into account.
Forward Propagation
Step one within the forward propagation operation is to compute a linear combination of model parameters and input features. The notation for such matrix operation is shown below where a brand new matrix Z is evaluated:
Note using the transpose of weight matrix W. The above operation within the matrix expanded representation is as follows:
The above matrix operation leads to the computation of matrix Z of size (1 x m) as shown below:
The subsequent step is to derive activations by applying the sigmoid function on the computed linear mixtures for every input as shown in the next matrix operation. This leads to an activation matrix A of size (1 x m).
Backward Propagation
Backward propagation or back-propagation is a method to compute the contributions of every parameter to the general error or loss brought on by incorrect predictions at the top of every epoch. The person loss contributions are evaluated by computing the gradients of the loss function with respect to (w.r.t) each model parameter. A gradient or derivative of a function is the speed of change or the slope of that function w.r.t a parameter considering other parameters as constants. When evaluated for a particular parameter value or point, the sign of the gradient indicates during which direction the function increases, and the gradient magnitude indicates the steepness of the slope. The log loss function as shown below is a bowl-shaped convex function with one global minimum point. As such, normally, the gradient of the log loss function w.r.t a parameter points in the wrong way to the worldwide minima. Once gradients are evaluated, each parameter value is updated using the parameter’s gradient, typically by utilizing a method called gradient descent.
The gradient for every parameter is computed using the chain rule. The chain rule enables the computation of derivatives of functions which might be composed of other functions. Within the case of Logistic Regression, the log loss L is a function of activation ‘a’ and ground-truth label ‘y’, while ‘a’ itself is a sigmoid function of ‘z’ and ‘z’ is a linear function of weights ‘w’ and bias ‘b’ implying that the loss function L is a function composed of other functions as shown below.
Using the chain rule of partial derivatives, the gradients of weight and bias parameters will be computed as follows:
Derivation of Gradients for Single Input Instance
Before we review the matrix and vector representations that come into play as a part of updating the parameters in a single shot, we’ll first derive the gradients using a single input instance to grasp the premise for such representations higher.
Assuming that ‘a’ and ‘z’ represent computed values for a single input instance with the ground-truth label ‘y’, the gradient of the loss function w.r.t ‘a’ will be derived as follows. Note that this gradient is the primary quantity required to guage the chain rule to derive parameter gradients later.
Given the gradient of loss function w.r.t ‘a’, the gradient of loss function w.r.t ‘z’ will be derived using the next chain rule:
The above chain rule implies that the gradient of ‘a’ w.r.t ‘z’ must even be derived. Note that ‘a’ is computed by applying the sigmoid function on ‘z’. Due to this fact, the gradient of ‘a’ w.r.t ‘z’ will be derived by utilizing the sigmoid function expression as follows:
The above derivation is expressed by way of ‘e’, and it seems that additional computations are needed to guage the gradient of ‘a’ w.r.t ‘z’. We all know that ‘a’ gets computed as a part of forward propagation. Due to this fact to eliminate any additional computations, the above derivative will be fully expressed by way of ‘a’ as a substitute as follows:
Plugging within the above terms expressed in ‘a’, the gradient of ‘a’ w.r.t ‘z’ is as follows:
Now that we’ve got the gradient of loss function w.r.t ‘a’ and the gradient of ‘a’ w.r.t ‘z’, the gradient of loss function w.r.t ‘z’ will be evaluated as follows:
We got here a good distance in evaluating the gradient of loss function w.r.t ‘z’. We still need to guage the gradients of loss function w.r.t model parameters. We all know that ‘z’ is a linear combination of model parameters and features of an input instance ‘x’ as shown below:
Using the chain rule the gradient of loss function w.r.t weight parameter ‘wi’ gets evaluated as shown below:
Similarly, the gradient of the loss function w.r.t ‘b’ gets evaluated as follows:
Matrix and Vector Representation of Parameter Updates using Gradients
Now that we understand gradient formulas for model parameters derived using a single input instance, we will represent the formulas in matrix and vector forms accounting for all the training batch. We’ll first vectorize gradients of the loss function w.r.t ‘z’ given by the next expression:
The vector type of the above for all the ‘m’ variety of instances is:
Similarly, the gradients of the loss function w.r.t each weight ‘wi’ will be vectorized. The gradient of the loss function w.r.t weight ‘wi’ for a single instance is given by:
The vector type of the above for all weights across all ‘m’ input instances is evaluated because the mean of ‘m’ gradients as follows:
Similarly, the resultant gradient of loss function w.r.t ‘b’ across all ‘m’ input instances is computed as a mean of the person instance gradients as follows:
Given the model weights gradient vector and the general gradient for bias, the model parameters get updated as follows. The parameter updates as shown below are based on the technique called gradient descent where a learning rate is used. A learning rate is a hyper-parameter utilized in optimization techniques similar to gradient descent to manage the step size of adjustments made at each epoch to the model parameters based on computed gradients. Effectively, a learning rate acts as a scaling factor, influencing the speed and convergence of the optimization algorithm.
Conclusion
As evident from the matrix and vector representations illustrated on this blog, Logistic Regression enables a bare minimum network model to grasp the subtle working details of such matrix and vector operations. Most machine-learning libraries encapsulate such nitty-gritty mathematical details but as a substitute expose well-defined programming interfaces at the next level, similar to forward or backward propagation. While understanding all such subtle details might not be required to develop models using such libraries, such details do make clear the mathematical intuitions behind such algorithms. Nonetheless, such understanding will definitely help carry forward the underlying mathematical intuitions to other models similar to ANN, Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Generative Adversarial Networks (GAN).