Home Artificial Intelligence Tips on how to Evaluate the Performance of Your ML/ AI Models 1. Split the dataset for higher evaluation. 2. Define your evaluation metrics. 3. Validate and tune the model’s hyperparameters. 4. Iterate and refine Final Thoughts Reference

Tips on how to Evaluate the Performance of Your ML/ AI Models 1. Split the dataset for higher evaluation. 2. Define your evaluation metrics. 3. Validate and tune the model’s hyperparameters. 4. Iterate and refine Final Thoughts Reference

Tips on how to Evaluate the Performance of Your ML/ AI Models
1. Split the dataset for higher evaluation.
2. Define your evaluation metrics.
3. Validate and tune the model’s hyperparameters.
4. Iterate and refine
Final Thoughts

An accurate evaluation is the one approach to performance improvement

Towards Data Science
Photo by Scott Graham on Unsplash

Learning by doing is among the best approaches to learning anything, from tech to a brand new language or cooking a brand new dish. Once you might have learned the fundamentals of a field or an application, you may construct on that knowledge by acting. Constructing models for various applications is the most effective approach to make your knowledge concrete regarding machine learning and artificial intelligence.

Though each fields (or really sub-fields, since they do overlap) have applications in a wide range of contexts, the steps to learning construct a model are kind of the identical whatever the goal application field.

AI language models corresponding to ChatGPT and Bard are gaining popularity and interest from each tech novices and general audiences because they will be very useful in our every day lives.

Now that more models are being released and presented, one may ask, what makes a “good” AI/ ML model, and the way can we evaluate the performance of 1?

That is what we’re going to cover in this text. But again, we assume you have already got an AI or ML model built. Now, you must evaluate and improve its performance (if needed). But, again, whatever the kind of model you might have and your end application, you may take steps to guage your model and improve its performance.

To assist us follow through with the concepts, let’s use the Wine dataset from sklearn [1], apply the support vector classifier (SVC), after which test its metrics.

So, let’s jump right in…

First, let’s import the libraries we’ll use (don’t worry about what each of those do now, we’ll get to that!).

import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import matplotlib.pyplot as plt

Now, we read our dataset, apply the classifier, and evaluate it.

wine_data = datasets.load_wine()
X = wine_data.data
y = wine_data.goal

Depending in your stage in the educational process, it’s possible you’ll need access to a considerable amount of data which you could use for training and testing, and evaluating. Also, you need to use different data to coach and test your model because that may prevent you from genuinely assessing your model’s performance.

To beat that challenge, split your data into three smaller random sets and use them for training, testing, and validating.

A very good rule of thumb to do this split is a 60,20,20 approach. You’ll use 60% of the info for training, 20% for validation, and 20% for testing. It’s essential shuffle your data before you do the split to make sure a greater representation of that data.

I do know which will sound complicated, but luckily, ticket-learn got here to the rescue by offering a function to perform that split for you, train_test_split().

So, we are able to take our dataset and split it like so:

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20, train_size=0.60, random_state=1, stratify=y)

Then use the training portion of it as input to the classifier.

#Scale data
sc = StandardScaler()
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#Apply SVC model
svc = SVC(kernel='linear', C=10.0, random_state=1)
svc.fit(X_train, Y_train)
#Obtain predictions
Y_pred = svc.predict(X_test)

At this point, we’ve some results to “evaluate.”

Before starting the evaluation process, we must ask ourselves a necessary query concerning the model we use: what would make this model good?

The reply to this query is determined by the model and the way you propose to make use of it. That being said, there are standard evaluation metrics that data scientists use once they wish to test the performance of an AI/ ML model, including:

  1. Accuracy is the proportion of correct predictions by the model out of the overall prediction. Meaning, after I run the model, what number of predictions are true amongst all predictions? This text goes into depth about testing the accuracy of a model.
  2. Precision is the proportion of true positive predictions by the model out of all positive predictions. Unfortunately, precision and accuracy are sometimes confused; one approach to make the difference between them clear is to think about accuracy because the closeness of the predictions to the actual values, while precision is how close the proper predictions are to one another. So, accuracy is an absolute measure, yet each are vital to guage the model’s performance.
  3. Recall is the proportion of true positive predictions from all actual positive instances within the dataset. Recall goals to search out related predictions inside a dataset. Mathematically, if we increase the recall, we decrease the precision of the model.
  4. F1 rating is the combination mean of precision and recall, providing a balanced measure of a model’s performance using each precision and recall. This video by CodeBasics discusses the relation between precision, recall, and F1 rating and find the optimal balance of those evaluation metrics.
Video By CodeBasics

Now, let’s calculate the several metrics for the expected data. The best way we’ll do this is by first displaying the confusion matrix. The confusion matrix is just the actual results of information vs. the expected results.

conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
#Plot the confusion matrix
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(conf_matrix, cmap=plt.cm.Oranges, alpha=0.3)
for i in range(conf_matrix.shape[0]):
for j in range(conf_matrix.shape[1]):
ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center', size='xx-large')
plt.xlabel('Predicted Values', fontsize=18)
plt.ylabel('Actual Values', fontsize=18)

The confusion matrix to our dataset will look something like,

If we take a look at this confusion matrix, we are able to see that the actual value was “1” in some cases while the expected value was “0”. Which implies the classifier will not be a %100 accurate.

We will calculate this classifier’s accuracy, precision, recall, and f1 rating using this code.

print('Precision: %.3f' % precision_score(Y_test, Y_pred, average='micro'))
print('Recall: %.3f' % recall_score(Y_test, Y_pred, average='micro'))
print('Accuracy: %.3f' % accuracy_score(Y_test, Y_pred))
print('F1 Rating: %.3f' % f1_score(Y_test, Y_pred, average='micro'))

For this particular example, the outcomes for those are:

  1. Precision = 0.889
  2. Recall = 0.889
  3. Accuracy = 0.889
  4. F1 rating = 0.889

Though you may really use different approaches to guage your models, some evaluation methods will higher estimate the model’s performance based on the model type. For instance, along with the above methods, if the model you’re evaluating is a regression (or it includes regression) model, you may also use:

– Mean Squared Error (MSE) mathematically is the common of the squared differences between predicted and actual values.

– Mean Absolute Error (MAE) is the common of absolutely the differences between predicted and actual values.

Those two metrics are closely related, but implementation-wise, MAE is easier (no less than mathematically) than MSE. Nonetheless, MAE doesn’t do well with significant errors, unlike MSE, which emphasizes the errors (since it squares them).

Before discussing hyperparameters, let’s first differentiate between a hyperparameter and a parameter. A parameter is a way a model is defined to resolve an issue. In contrast, hyperparameters are used to check, validate, and optimize the model’s performance. Hyperparameters are sometimes chosen by the info scientists (or the client, in some cases) to manage and validate the educational means of the model and hence, its performance.

There are various kinds of hyperparameters which you could use to validate your model; some are general and will be used on any model, corresponding to:

  • Learning Rate: this hyperparameter controls how much the model must be modified in response to some error when the model’s parameters are updated or altered. Selecting the optimal learning rate is a trade-off with the time needed for the training process. If the educational rate is low, then it could decelerate the training process. In contrast, if the educational rate is just too high, the training process might be faster, however the model performance may suffer.
  • Batch Size: The dimensions of your training dataset will significantly affect the model’s training time and learning rate. So, finding the optimal batch size is a skill that is usually developed as you construct more models and grow your experience.
  • Variety of Epochs: An epoch is an entire cycle for training the machine learning model. The variety of epochs to make use of varies from one model to a different. Theoretically, more epochs result in fewer errors within the validation process.

Along with the above hyperparameters, there are model-specific hyperparameters corresponding to regularization strength or the variety of hidden layers in implementing a neural network. This 15 mins Video by APMonitor explores various hyperparameters and their differences.

Video by APMonitor

Validating an AI/ ML model will not be a linear process but more of an iterative one. You undergo the info split, the hyperparameters tuning, analyzing, and validating the outcomes often greater than once. The variety of times you repeat that process is determined by the evaluation of the outcomes. For some models, it’s possible you’ll only need to do that once; for others, it’s possible you’ll must do it a few times.

If it’s essential repeat the method, you’ll use the insights from the previous evaluation to enhance the model’s architecture, training process, or hyperparameter settings until you’re satisfied with the model’s performance.

Once you start constructing your individual ML and AI models, you’ll quickly realize that selecting and implementing the model is the simple a part of the workflow. Nonetheless, testing and evaluation is the part that may take many of the development process. Evaluating an AI/ ML model is an iterative and infrequently time-consuming process, and it requires careful evaluation, experimentation, and fine-tuning to attain the specified performance.

Luckily, the more experience you might have constructing more models, the more systematic the means of evaluating your model’s performance will get. And it’s a worthwhile skill considering the importance of evaluating your model, corresponding to:

  1. Evaluating our models allows us to objectively measures the model’s metrics which helps in understanding its strengths and weaknesses and provides insights into its predictive or decision-making capabilities.
  2. If different models that may solve the identical problems exist, then evaluating them enables us to match their performance and select the one which suits our application best.
  3. Evaluation provides insights into the model’s weaknesses, allowing for improvements through analyzing the errors and areas where the model underperforms.

So, have patience and keep constructing models; it gets higher and more efficient with the more models you construct. Don’t let the method details discourage you. It might appear to be a posh process, but when you understand the steps, it would turn into second nature to you.

[1] Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California,
School of Information and Computer Science. (CC BY 4.0)


Please enter your comment!
Please enter your name here