
The next plot shows the log loss when y = 1:
The log loss equals to 0 only in case of an ideal prediction (p = 1 and y = 1, or p = 0 and y = 0), and approaches infinity because the prediction gets worse (i.e., when y = 1 and p → 0 or y = 0 and p → 1).
The cost function calculates the common loss over the entire data set:
The associated fee function might be written in a vectorized form as follows:
where y = (y₁, …, yₙ) is a vector that accommodates all of the labels of the training samples, and p = (p₁, …, pₙ) is a vector that accommodates all the expected probabilities of the model for all of the training samples.
This cost function is convex, i.e., it has a single global minimum. Nonetheless, there isn’t a closed-form solution for locating the optimal w* (as a consequence of the non-linearities introduced by the log function). Subsequently, we’d like to make use of iterative optimization methods reminiscent of gradient descent with the intention to find the minimum.
Gradient descent is an iterative approach for locating a minimum of a function, where we take small steps in the wrong way of the gradient with the intention to catch up with to the minimum:
With a view to use gradient descent to search out the minimum of the least squares cost, we’d like to compute the partial derivatives of J(w) with respect to every certainly one of the weights.
The partial derivative of J(w) with respect to any of the weights wⱼ is:
Proof:
In vectorized form, we will write the gradient vector as follows:
And the gradient descent update rule is:
where α is a learning rate that controls the step size (0 < α < 1).
Note that each time you utilize gradient descent, it’s essential to ensure that your data set is normalized (otherwise gradient descent may take steps of various sizes in numerous directions, which is able to make it unstable).
We are going to now implement the logistic regression model in Python from scratch, including its cost function and gradient computation, optimizing the model using gradient descent, evaluation of the model, and plotting the ultimate decision boundary.
For the demonstration we are going to use the Iris data set (BSD license). The unique data set accommodates 150 samples of Iris flowers that belong to certainly one of three species (Setosa, Versicolor and Virginica). We are going to make it right into a binary classification problem through the use of only the primary two sorts of flowers (Setosa and Versicolor). As well as, we are going to use only first two features of every flower (sepal width and sepal length).
Loading the Data Set
Let’s first import the required libraries and fix the random seed with the intention to get reproducible results:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsnp.random.seed(0)
Next, we load the information set:
from sklearn.datasets import load_irisiris = load_iris()
X = iris.data[:, :2] # Take only the primary two features
y = iris.goal
# Take only the setosa and versicolor flowers
X = X[(y == 0) | (y == 1)]
y = y[(y == 0) | (y == 1)]
Let’s plot the information:
def plot_data(X, y):
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y], style=iris.target_names[y],
palette=['r','b'], markers=('s','o'), edgecolor='k')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.legend()
plot_data(X, y)
As might be seen, the information set is linearly separable, subsequently logistic regression should give you the chance to search out the boundary between the 2 classes.
Next, we’d like so as to add a column of ones to the features matrix X with the intention to represent the bias (w₀):
# Add a column for the bias
n = X.shape[0]
X_with_bias = np.hstack((np.ones((n, 1)), X))
We now split the information set into training and test sets:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X_with_bias, y, random_state=0)
Model Implementation
We at the moment are able to implement the logistic regression model. We start by defining a helper function to compute the sigmoid function:
def sigmoid(z):
""" Compute the sigmoid of z (z generally is a scalar or a vector). """
z = np.array(z)
return 1 / (1 + np.exp(-z))
Next, we implement the fee function that returns the fee of a logistic regression model with parameters w on a given data set (X, y), and in addition its gradient with respect to w.
def cost_function(X, y, w):
""" J, grad = cost_function(X, y, w) computes the fee of a logistic regression model
with parameters w and the gradient of the fee w.r.t. to the parameters. """
# Compute the fee
p = sigmoid(X @ w)
J = -(1/n) * (y @ np.log(p) + (1-y) @ np.log(1-p)) # Compute the gradient
grad = (1/n) * X.T @ (p - y)
return J, grad
Note that we’re using the vectorized types of the fee and the gradient functions which have been shown previously.
To sanity check this function, let’s compute the fee and gradient of the model on some random weight vector:
w = np.random.rand(X_train.shape[1])
cost, grad = cost_function(X_train, y_train, w)print('w:', w)
print('Cost at w:', cost)
print('Gradient at w (zeros):', grad)
The output we get is:
w: [0.5488135 0.71518937 0.60276338]
Cost at w: 2.314505839067951
Gradient at w (zeros): [0.36855061 1.86634895 1.27264487]
Gradient Descent Implementation
We now implement gradient descent with the intention to find the optimal w* that minimizes the fee function of the model on a given training set. The algorithm will run at most max_iter passes over the training set (defaults to 5000), unless the fee has not decreased by a minimum of tol (defaults to 0.0001) because the previous iteration, wherein case the training stops immediately.
def optimize_model(X, y, alpha=0.01, max_iter=5000, tol=0.0001):
""" Optimize the model using gradient descent.
X, y: The training set
alpha: The training rate
max_iter: The utmost variety of passes over the training set (epochs)
tol: The stopping criterion. Training will stop when (new_cost > cost - tol)
"""
w = np.random.rand(X.shape[1])
cost, grad = cost_function(X, y, w)for i in range(max_iter + 1):
w = w - alpha * grad
new_cost, grad = cost_function(X, y, w)
if new_cost > cost - tol:
print(f'Converged after {i} iterations')
return w, new_cost
cost = new_cost
print('Maximum variety of iterations reached')
return w, cost
Normally at this point you would need to normalize your data set, since gradient descent doesn’t work well with features which have different scales. In our specific data set normalization shouldn’t be needed because the ranges of the 2 features are similar.
Let’s now call this function to optimize our model:
opt_w, cost = optimize_model(X_train, y_train)print('opt_w:', opt_w)
print('Cost at opt_w:', cost)
The algorithm converges after 1,413 iterations and the optimal w* we get is:
Converged after 1413 iterations
opt_w: [ 0.28014029 0.80541854 -1.48367938]
Cost at opt_w: 0.28389717767222555
There are other optimizers you need to use which are sometimes faster than gradient descent, reminiscent of conjugate gradient (CG) and truncated Newton (TNC). See scipy.optimize.minimize for more details on methods to use these optimizers.
Using the Model for Predictions
Now that we’ve found the optimal parameters of the model, we will use it for predictions.
First, let’s write a function that gets a matrix of recent samples X and returns their probabilities of belonging to the positive class:
def predict_prob(X, w):
""" Return the probability that samples in X belong to the positive class
X: the feature matrix (every row in X represents one sample)
w: the learned logistic regression parameters
"""
p = sigmoid(X @ w)
return p
The function computes the predictions of the model by simply taking the sigmoid of Xᵗw (which computes σ(wᵗx) for each row x within the matrix).
For instance, let’s discover the probability that a sample positioned at (6, 2) belongs to the versicolor class:
predict_prob([[1, 6, 2]], opt_w)
array([0.89522808])
This sample has 89.52% probability of being a versicolor flower. This is sensible since this sample is positioned well inside the area of the versicolor flowers removed from the border between the classes.
Alternatively, the probability that a sample positioned at (5.5, 3) belongs to the versicolor class is:
predict_prob([[1, 5.5, 3]], opt_w)
array([0.56436688])
This time the probability is way lower (only 56.44%), since this sample is near the border between the classes.