The next plot shows the log loss when *y* = 1:

The log loss equals to 0 only in case of an ideal prediction (*p* = 1 and *y* = 1, or *p* = 0 and *y* = 0), and approaches infinity because the prediction gets worse (i.e., when *y* = 1 and *p* → 0 or *y* = 0 and *p* → 1).

The** cost function **calculates the common loss over the entire data set:

The associated fee function might be written in a vectorized form as follows:

where **y** = (*y*₁, …, *yₙ*) is a vector that accommodates all of the labels of the training samples, and **p** = (*p*₁, …, *pₙ*) is a vector that accommodates all the expected probabilities of the model for all of the training samples.

This cost function is convex, i.e., it has a single global minimum. Nonetheless, there isn’t a closed-form solution for locating the optimal **w*** (as a consequence of the non-linearities introduced by the log function). Subsequently, we’d like to make use of iterative optimization methods reminiscent of gradient descent with the intention to find the minimum.

Gradient descent is an iterative approach for locating a minimum of a function, where we take small steps in the wrong way of the gradient with the intention to catch up with to the minimum:

With a view to use gradient descent to search out the minimum of the least squares cost, we’d like to compute the partial derivatives of *J*(**w**) with respect to every certainly one of the weights.

The partial derivative of *J*(**w**) with respect to any of the weights *wⱼ* is:

**Proof**:

In vectorized form, we will write the gradient vector as follows:

And the gradient descent update rule is:

where *α* is a learning rate that controls the step size (0 < *α *< 1).

Note that each time you utilize gradient descent, it’s essential to ensure that your data set is **normalized **(otherwise gradient descent may take steps of various sizes in numerous directions, which is able to make it unstable).

We are going to now implement the logistic regression model in Python from scratch, including its cost function and gradient computation, optimizing the model using gradient descent, evaluation of the model, and plotting the ultimate decision boundary.

For the demonstration we are going to use the Iris data set (BSD license). The unique data set accommodates 150 samples of Iris flowers that belong to certainly one of three species (Setosa, Versicolor and Virginica). We are going to make it right into a binary classification problem through the use of only the primary two sorts of flowers (Setosa and Versicolor). As well as, we are going to use only first two features of every flower (sepal width and sepal length).

## Loading the Data Set

Let’s first import the required libraries and fix the random seed with the intention to get reproducible results:

`import numpy as np`

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as snsnp.random.seed(0)

Next, we load the information set:

`from sklearn.datasets import load_iris`iris = load_iris()

X = iris.data[:, :2] # Take only the primary two features

y = iris.goal

# Take only the setosa and versicolor flowers

X = X[(y == 0) | (y == 1)]

y = y[(y == 0) | (y == 1)]

Let’s plot the information:

`def plot_data(X, y):`

sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y], style=iris.target_names[y],

palette=['r','b'], markers=('s','o'), edgecolor='k')

plt.xlabel(iris.feature_names[0])

plt.ylabel(iris.feature_names[1])

plt.legend()

`plot_data(X, y)`

As might be seen, the information set is linearly separable, subsequently logistic regression should give you the chance to search out the boundary between the 2 classes.

Next, we’d like so as to add a column of ones to the features matrix *X* with the intention to represent the bias (*w*₀):

`# Add a column for the bias`

n = X.shape[0]

X_with_bias = np.hstack((np.ones((n, 1)), X))

We now split the information set into training and test sets:

`from sklearn.model_selection import train_test_split`X_train, X_test, y_train, y_test = train_test_split(X_with_bias, y, random_state=0)

## Model Implementation

We at the moment are able to implement the logistic regression model. We start by defining a helper function to compute the sigmoid function:

`def sigmoid(z):`

""" Compute the sigmoid of z (z generally is a scalar or a vector). """

z = np.array(z)

return 1 / (1 + np.exp(-z))

Next, we implement the fee function that returns the fee of a logistic regression model with parameters **w** on a given data set (*X*, **y**), and in addition its gradient with respect to **w**.

`def cost_function(X, y, w):`

""" J, grad = cost_function(X, y, w) computes the fee of a logistic regression model

with parameters w and the gradient of the fee w.r.t. to the parameters. """

# Compute the fee

p = sigmoid(X @ w)

J = -(1/n) * (y @ np.log(p) + (1-y) @ np.log(1-p)) # Compute the gradient

grad = (1/n) * X.T @ (p - y)

return J, grad

Note that we’re using the vectorized types of the fee and the gradient functions which have been shown previously.

To sanity check this function, let’s compute the fee and gradient of the model on some random weight vector:

`w = np.random.rand(X_train.shape[1])`

cost, grad = cost_function(X_train, y_train, w)print('w:', w)

print('Cost at w:', cost)

print('Gradient at w (zeros):', grad)

The output we get is:

`w: [0.5488135 0.71518937 0.60276338]`

Cost at w: 2.314505839067951

Gradient at w (zeros): [0.36855061 1.86634895 1.27264487]

## Gradient Descent Implementation

We now implement gradient descent with the intention to find the optimal **w*** that minimizes the fee function of the model on a given training set. The algorithm will run at most *max_iter* passes over the training set (defaults to 5000), unless the fee has not decreased by a minimum of *tol* (defaults to 0.0001) because the previous iteration, wherein case the training stops immediately.

`def optimize_model(X, y, alpha=0.01, max_iter=5000, tol=0.0001):`

""" Optimize the model using gradient descent.

X, y: The training set

alpha: The training rate

max_iter: The utmost variety of passes over the training set (epochs)

tol: The stopping criterion. Training will stop when (new_cost > cost - tol)

"""

w = np.random.rand(X.shape[1])

cost, grad = cost_function(X, y, w)for i in range(max_iter + 1):

w = w - alpha * grad

new_cost, grad = cost_function(X, y, w)

if new_cost > cost - tol:

print(f'Converged after {i} iterations')

return w, new_cost

cost = new_cost

print('Maximum variety of iterations reached')

return w, cost

Normally at this point you would need to normalize your data set, since gradient descent doesn’t work well with features which have different scales. In our specific data set normalization shouldn’t be needed because the ranges of the 2 features are similar.

Let’s now call this function to optimize our model:

`opt_w, cost = optimize_model(X_train, y_train)`print('opt_w:', opt_w)

print('Cost at opt_w:', cost)

The algorithm converges after 1,413 iterations and the optimal **w*** we get is:

`Converged after 1413 iterations`

opt_w: [ 0.28014029 0.80541854 -1.48367938]

Cost at opt_w: 0.28389717767222555

There are other optimizers you need to use which are sometimes faster than gradient descent, reminiscent of conjugate gradient (CG) and truncated Newton (TNC). See scipy.optimize.minimize for more details on methods to use these optimizers.

## Using the Model for Predictions

Now that we’ve found the optimal parameters of the model, we will use it for predictions.

First, let’s write a function that gets a matrix of recent samples *X* and returns their probabilities of belonging to the positive class:

`def predict_prob(X, w):`

""" Return the probability that samples in X belong to the positive class

X: the feature matrix (every row in X represents one sample)

w: the learned logistic regression parameters

"""

p = sigmoid(X @ w)

return p

The function computes the predictions of the model by simply taking the sigmoid of *Xᵗ***w **(which computes *σ*(**w***ᵗ***x**) for each row **x** within the matrix).

For instance, let’s discover the probability that a sample positioned at (6, 2) belongs to the versicolor class:

`predict_prob([[1, 6, 2]], opt_w)`

`array([0.89522808])`

This sample has 89.52% probability of being a versicolor flower. This is sensible since this sample is positioned well inside the area of the versicolor flowers removed from the border between the classes.

Alternatively, the probability that a sample positioned at (5.5, 3) belongs to the versicolor class is:

`predict_prob([[1, 5.5, 3]], opt_w)`

`array([0.56436688])`

This time the probability is way lower (only 56.44%), since this sample is near the border between the classes.