Home Artificial Intelligence Deep Dive into PFI for Model Interpretability What’s the Permutation Feature Importance Interpreting the PFI Training Interpretation Test Interpretation The issues with PFI Implementing the PFI in Python Conclusion

Deep Dive into PFI for Model Interpretability What’s the Permutation Feature Importance Interpreting the PFI Training Interpretation Test Interpretation The issues with PFI Implementing the PFI in Python Conclusion

0
Deep Dive into PFI for Model Interpretability
What’s the Permutation Feature Importance
Interpreting the PFI
Training Interpretation
Test Interpretation
The issues with PFI
Implementing the PFI in Python
Conclusion

One other interpretability tool on your toolbox

Towards Data Science
Photo by fabio on Unsplash

Knowing the best way to assess your model is important on your work as a knowledge scientist. Nobody will log off in your solution for those who’re not in a position to fully understand and communicate it to your stakeholders. Because of this knowing interpretability methods is so vital.

The shortage of interpretability can kill a excellent model. I haven’t developed a model where my stakeholders weren’t thinking about understanding how the predictions were made. Due to this fact, knowing the best way to interpret a model and communicate it to the business is a necessary ability for a knowledge scientist.

On this post, we’re going to explore the Permutation Feature Importance (PFI), an model agnostic methodology that can assist us discover what are a very powerful features of our model, and subsequently, communicate higher what the model is considering when doing its predictions.

The PFI method tries to estimate how vital a feature is for model results based on what happens to the model when we modify the feature connected to the goal variable.

To do this, for every feature, we would like to investigate the importance, we random shuffle it while keeping all the opposite features and goal the identical way.

This makes the feature useless to predict the goal since we broke the connection between them by changing their joint distribution.

Then, we will use our model to predict our shuffled dataset. The quantity of performance reduction in our model will indicate how vital that feature is.

The algorithm then looks something like this:

  • We train a model in a training dataset after which assess its performance on each the training and the testing dataset
  • For every feature, we create a brand new dataset where the feature is shuffled
  • We then use the trained model to predict the output of the brand new dataset
  • The quotient of the brand new performance metric by the old one gives us the importance of the feature

Notice that if a feature just isn’t vital, the performance of the model mustn’t vary so much. Whether it is, then the performance must suffer so much.

Now that we all know the best way to calculate the PFI, how will we interpret it?

It relies on which fold we’re applying the PFI to. We often have two options: applying it to the training or the test dataset.

During training, our model learns the patterns of the info and tries to represent it. After all, during training, we now have no idea of how well our model generalizes to unseen data.

Due to this fact, by applying the PFI to the training dataset we’re going to see which features were essentially the most relevant for the training of the representation of the info by the model.

In business terms, this means which features were a very powerful for the model construction.

Now, if we apply the strategy to the test dataset, we’re going to see the feature impact on the generalization of the model.

Let’s give it some thought. If we see the performance of the model go down within the test set after we shuffled a feature, it implies that that feature was vital for the performance on that set. For the reason that test set is what we use to check generalization (for those who’re doing every little thing right), then we will say that it is necessary for generalization.

The PFI analyzes the effect of a feature in your model performance, subsequently, it doesn’t state anything in regards to the raw data. In case your model performance is poor, then any relation you discover with PFI shall be meaningless.

That is true for each sets, in case your model is underfitting (low prediction power on the training set) or overfitting (low prediction power on the test set) then you definately cannot take insights from this method.

Also, when two features are highly correlated the PFI can mislead your interpretation. In the event you shuffle one feature however the required information is encoded into one other one, then the performance may not suffer in any respect, which might make you think that the feature is useless, which will not be the case.

To implement the PFI in Python we must first import our required libraries. For this, we’re going to use mainly the libraries numpy, pandas, tqdm, and sklearn:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes, load_iris
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import accuracy_score, r2_score

Now, we must load our dataset, which goes to be the Iris dataset. Then, we’re going to suit a Random Forest to the info.

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=12, shuffle=True
)

rf = RandomForestClassifier(
n_estimators=3, random_state=32
).fit(X_train, y_train)

With our model fitted, let’s analyze its performance to see if we will safely apply the PFI to see how the features impact our model:

print(accuracy_score(rf.predict(X_train), y_train))
print(accuracy_score(rf.predict(X_test), y_test))

We will see we achieved a 99% accuracy on the training set and a 95.5% accuracy on the test set. Looks good for now. Let’s get the unique error scores for a later comparison:

original_error_train = 1 - accuracy_score(rf.predict(X_train), y_train)
original_error_test = 1 - accuracy_score(rf.predict(X_test), y_test)

Now let’s calculate the permutation scores. For that, it is common to run the shuffle for every feature several times to realize a statistic of the feature scores to avoid any coincidences. In our case, let’s do 10 repetitions for every feature:

n_steps = 10

feature_values = {}
for feature in range(X.shape[1]):
# We'll save each recent performance point for every feature
errors_permuted_train = []
errors_permuted_test = []

for step in range(n_steps):
# We grab the info again since the np.random.shuffle function shuffles in place
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12, shuffle=True)
np.random.shuffle(X_train[:, feature])
np.random.shuffle(X_test[:, feature])

# Apply our previously fitted model on the brand new data to get the performance
errors_permuted_train.append(1 - accuracy_score(rf.predict(X_train), y_train))
errors_permuted_test.append(1 - accuracy_score(rf.predict(X_test), y_test))

feature_values[f'{feature}_train'] = errors_permuted_train
feature_values[f'{feature}_test'] = errors_permuted_test

Now we now have a dictionary with the performance for every shuffle we did. Now, let’s generate a table that has, for every feature in each fold, the typical and the usual deviation of the performance compared to the unique performance of our model:

PFI = pd.DataFrame()
for feature in feature_values:
if 'train' in feature:
aux = feature_values[feature] / original_error_train
fold = 'train'
elif 'test' in feature:
aux = feature_values[feature] / original_error_test
fold = 'test'

PFI = PFI.append({
'feature': feature.replace(f'_{fold}', ''),
'pfold': fold,
'mean':np.mean(aux),
'std':np.std(aux),
}, ignore_index=True)

PFI = PFI.pivot(index='feature', columns='fold', values=['mean', 'std']).reset_index().sort_values(('mean', 'test'), ascending=False)

We’ll find yourself with something like this:

We will see that feature 2 appears to be a very powerful feature in our dataset for each folds, followed by feature 3. Since we’re not fixing the random seed for the shuffle function from numpy we will expect this number to differ.

We will then plot the importance in a graph to have a greater visualization of the importance:

The PFI is an easy methodology that may assist you to quickly discover a very powerful features. Go ahead and take a look at to use it to some model you’re developing to see the way it is behaving.

But additionally pay attention to the constraints of the strategy. Not knowing where a technique falls short will find yourself making you do an incorrect interpretation.

Also, notices that the PFI shows the importance of the feature but doesn’t states during which direction it’s influencing the model output.

So, tell me, how are you going to make use of this in your next models?

Stay tuned for more posts about interpretability methods that may improve your overall understanding of a model.

LEAVE A REPLY

Please enter your comment!
Please enter your name here