Home Artificial Intelligence Encoding Categorical Variables: A Deep Dive into Goal Encoding The issue with One Hot encoding Goal encoding — overview of basic principle Defining a Goal encoding class Goal encoding with SciKitLearn API References

Encoding Categorical Variables: A Deep Dive into Goal Encoding The issue with One Hot encoding Goal encoding — overview of basic principle Defining a Goal encoding class Goal encoding with SciKitLearn API References

0
Encoding Categorical Variables: A Deep Dive into Goal Encoding
The issue with One Hot encoding
Goal encoding — overview of basic principle
Defining a Goal encoding class
Goal encoding with SciKitLearn API
References

Data comes in several shapes and forms. Certainly one of those shapes and forms is generally known as categorical data.

Towards Data Science

This poses an issue because most Machine Learning algorithms use only numerical data as input. Nonetheless, categorical data will likely be not a challenge to take care of, because of easy, well-defined functions that transform them into numerical values. If you’ve taken any data science course, you will likely be accustomed to the one hot encoding strategy for categorical features. This strategy is great when your features have limited categories. Nonetheless, you’ll run into some issues when coping with high cardinal features (features with many categories)

Here is how you should utilize goal encoding to rework Categorical features into numerical values.

Photo by Sonika Agarwal on Unsplash

Early in any data science course, you might be introduced to at least one hot encoding as a key technique to take care of categorical values, and rightfully so, as this strategy works very well on low cardinal features (features with limited categories).

In a nutshell, One hot encoding transforms each category right into a binary vector, where the corresponding category is marked as ‘True’ or ‘1’, and all other categories are marked with ‘False’ or ‘0’.

import pandas as pd

# Sample categorical data
data = {'Category': ['Red', 'Green', 'Blue', 'Red', 'Green']}

# Create a DataFrame
df = pd.DataFrame(data)

# Perform one-hot encoding
one_hot_encoded = pd.get_dummies(df['Category'])

# Display the result
print(one_hot_encoded)

One hot encoding output — we could improve this by dropping one column because if we all know Blue and Green, we are able to figure the worth of Red. Image by creator

While this works great for features with limited categories (Lower than 10–20 categories), because the variety of categories increases, the one-hot encoded vectors change into longer and sparser, potentially resulting in increased memory usage and computational complexity, let’s take a look at an example.

The below code uses Amazon Worker Access data, made publicity available in kaggle: https://www.kaggle.com/datasets/lucamassaron/amazon-employee-access-challenge

The info incorporates eight categorical feature columns indicating characteristics of the required resource, role, and workgroup of the worker at Amazon.

data.info()
Column information. Image by creator
# Display the variety of unique values in each column
unique_values_per_column = data.nunique()

print("Variety of unique values in each column:")
print(unique_values_per_column)

The eight features have high cardinality. Image by creator

Using one hot encoding may very well be difficult in a dataset like this attributable to the high variety of distinct categories for every feature.

#Initial data memory usage
memory_usage = data.memory_usage(deep=True)
total_memory_usage = memory_usage.sum()
print(f"nTotal memory usage of the DataFrame: {total_memory_usage / (1024 ** 2):.2f} MB")
The initial dataset is 11.24 MB. Image by creator
#one-hot encoding categorical features
data_encoded = pd.get_dummies(data,
columns=data.select_dtypes(include='object').columns,
drop_first=True)

data_encoded.shape

After on-hot encoding, the dataset has 15 618 columns. Image by creator
The resulting data set is extremely sparse, meaning it incorporates a number of 0s and 1. Image by creator
# Memory usage for the one-hot encoded dataset
memory_usage = data_encoded.memory_usage(deep=True)
total_memory_usage = memory_usage.sum()
print(f"nTotal memory usage of the DataFrame: {total_memory_usage / (1024 ** 2):.2f} MB")
Dataset memory usage increased to 488.08 MB attributable to the increased variety of columns. Image by creator

As you possibly can see, one-hot encoding is just not a viable solution to take care of high cardinal categorical features, because it significantly increases the dimensions of the dataset.

In cases with high cardinal features, goal encoding is a greater option.

Goal encoding transforms a categorical feature right into a numeric feature without adding any extra columns, avoiding turning the dataset right into a larger and sparser dataset.

Goal encoding works by converting each category of a categorical feature into its corresponding expected value. The approach to calculating the expected value will rely on the worth you are attempting to predict.

For Regression problems, the expected value is solely the common value for that category.

For Classification problems, the expected value is the conditional probability on condition that category.

In each cases, we are able to get the outcomes by simply using the ‘group_by’ function in pandas.

#Example of the best way to calculate the expected value for Goal encoding of a Binary end result
expected_values = data.groupby('ROLE_TITLE')['ACTION'].value_counts(normalize=True).unstack()
expected_values
The resulting table indicates the probability of every `ACTION` end result by unique `Role_title` ID. Image by creator

The resulting table indicates the probability of every “ACTION” end result by unique “ROLE_TITLE” id. All that’s left to do is replace the “ROLE_TITLE” id with the values from the probability of “ACTION” being 1 in the unique dataset. (i.e as an alternative of category 117879 the dataset will show 0.889331)

While this may give us an intuition of how goal encoding works, using this straightforward method runs the danger of overfitting. Especially for rare categories, as in those cases, goal encoding will essentially provide the goal value to the model. Also, the above method can only take care of seen categories, so in case your test data has a brand new category, it won’t give you the chance to handle it.

To avoid those errors, that you must make the goal encoding transformer more robust.

To make goal encoding more robust, you possibly can create a custom transformer class and integrate it with scikit-learn in order that it may well be utilized in any model pipeline.

NOTE: The below code is taken from the book “The Kaggle Book” and will be present in Kaggle: https://www.kaggle.com/code/lucamassaron/meta-features-and-target-encoding

import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin

class TargetEncode(BaseEstimator, TransformerMixin):

def __init__(self, categories='auto', k=1, f=1,
noise_level=0, random_state=None):
if type(categories)==str and categories!='auto':
self.categories = [categories]
else:
self.categories = categories
self.k = k
self.f = f
self.noise_level = noise_level
self.encodings = dict()
self.prior = None
self.random_state = random_state

def add_noise(self, series, noise_level):
return series * (1 + noise_level *
np.random.randn(len(series)))

def fit(self, X, y=None):
if type(self.categories)=='auto':
self.categories = np.where(X.dtypes == type(object()))[0]

temp = X.loc[:, self.categories].copy()
temp['target'] = y
self.prior = np.mean(y)
for variable in self.categories:
avg = (temp.groupby(by=variable)['target']
.agg(['mean', 'count']))
# Compute smoothing
smoothing = (1 / (1 + np.exp(-(avg['count'] - self.k) /
self.f)))
# The larger the count the less full_avg is accounted
self.encodings[variable] = dict(self.prior * (1 -
smoothing) + avg['mean'] * smoothing)

return self

def transform(self, X):
Xt = X.copy()
for variable in self.categories:
Xt[variable].replace(self.encodings[variable],
inplace=True)
unknown_value = {value:self.prior for value in
X[variable].unique()
if value not in
self.encodings[variable].keys()}
if len(unknown_value) > 0:
Xt[variable].replace(unknown_value, inplace=True)
Xt[variable] = Xt[variable].astype(float)
if self.noise_level > 0:
if self.random_state is just not None:
np.random.seed(self.random_state)
Xt[variable] = self.add_noise(Xt[variable],
self.noise_level)
return Xt

def fit_transform(self, X, y=None):
self.fit(X, y)
return self.transform(X)

It’d look daunting at first, but let’s break down each a part of the code to know the best way to create a sturdy Goal encoder.

Class Definition

class TargetEncode(BaseEstimator, TransformerMixin):

This primary step ensures which you can use this transformer class in scikit-learn pipelines for data preprocessing, feature engineering, and machine learning workflows. It achieves this by inheriting the scikit-learn classes BaseEstimator and TransformerMixin.

Inheritance allows the TargetEncode class to reuse or override methods and attributes defined in the bottom classes, on this case, BaseEstimator and TransformerMixin

BaseEstimator is a base class for all scikit-learn estimators. Estimators are objects in scikit-learn with a “fit” method for training on data and a “predict” method for making predictions.

TransformerMixin is a mixin class for transformers in scikit-learn, it provides additional methods reminiscent of “fit_transform”, which mixes fitting and reworking in a single step.

Inheriting from BaseEstimator & TransformerMixin, allows TargetEncode to implement these methods, making it compatible with the scikit-learn API.

Defining the constructor

def __init__(self, categories='auto', k=1, f=1, 
noise_level=0, random_state=None):
if type(categories)==str and categories!='auto':
self.categories = [categories]
else:
self.categories = categories
self.k = k
self.f = f
self.noise_level = noise_level
self.encodings = dict()
self.prior = None
self.random_state = random_state

This second step defines the constructor for the “TargetEncode” class and initializes the instance variables with default or user-specified values.

The “categories” parameter determines which columns within the input data needs to be regarded as categorical variables for goal encoding. It’s Set by default to ‘auto’ to mechanically discover categorical columns in the course of the fitting process.

The parameters k, f, and noise_level control the smoothing effect during goal encoding and the extent of noise added during transformation.

Adding noise

This next step could be very vital to avoid overfitting.

def add_noise(self, series, noise_level):
return series * (1 + noise_level *
np.random.randn(len(series)))

The “add_noise” method adds random noise to introduce variability and stop overfitting in the course of the transformation phase.

“np.random.randn(len(series))” generates an array of random numbers from a typical normal distribution (mean = 0, standard deviation = 1).

Multiplying this array by “noise_level” scales the random noise based on the required noise level.”

This step contributes to the robustness and generalization capabilities of the goal encoding process.

Fitting the Goal encoder

This a part of the code trains the goal encoder on the provided data by calculating the goal encodings for categorical columns and storing them for later use during transformation.

def fit(self, X, y=None):
if type(self.categories)=='auto':
self.categories = np.where(X.dtypes == type(object()))[0]

temp = X.loc[:, self.categories].copy()
temp['target'] = y
self.prior = np.mean(y)
for variable in self.categories:
avg = (temp.groupby(by=variable)['target']
.agg(['mean', 'count']))
# Compute smoothing
smoothing = (1 / (1 + np.exp(-(avg['count'] - self.k) /
self.f)))
# The larger the count the less full_avg is accounted
self.encodings[variable] = dict(self.prior * (1 -
smoothing) + avg['mean'] * smoothing)

The smoothing term helps prevent overfitting, especially when coping with categories with small samples.

The strategy follows the scikit-learn convention for fit methods in transformers.

It starts by checking and identifying the specific columns and creating a short lived DataFrame, containing only the chosen categorical columns from the input X and the goal variable y.

The prior mean of the goal variable is calculated and stored within the prior attribute. This represents the general mean of the goal variable across your entire dataset.

Then, it calculates the mean and count of the goal variable for every category using the group-by method, as seen previously.

There may be a further smoothing step to forestall overfitting on categories with small numbers of samples. Smoothing is calculated based on the variety of samples in each category. The larger the count, the less the smoothing effect.

The calculated encodings for every category in the present variable are stored within the encodings dictionary. This dictionary will likely be used later in the course of the transformation phase.

Transforming the info

This a part of the code replaces the unique categorical values with their corresponding target-encoded values stored in self.encodings.

def transform(self, X):
Xt = X.copy()
for variable in self.categories:
Xt[variable].replace(self.encodings[variable],
inplace=True)
unknown_value = {value:self.prior for value in
X[variable].unique()
if value not in
self.encodings[variable].keys()}
if len(unknown_value) > 0:
Xt[variable].replace(unknown_value, inplace=True)
Xt[variable] = Xt[variable].astype(float)
if self.noise_level > 0:
if self.random_state is just not None:
np.random.seed(self.random_state)
Xt[variable] = self.add_noise(Xt[variable],
self.noise_level)
return Xt

This step has a further robustness check to make sure the goal encoder can handle latest or unseen categories. For those latest or unknown categories, it replaces them with the mean of the goal variable stored within the prior_mean variable.

When you need more robustness against overfitting, you possibly can arrange a noise_level greater than 0 so as to add random noise to the encoded values.

The fit_transform method combines the functionality of fitting and reworking the info by first fitting the transformer to the training data after which transforming it based on the calculated encodings.

Now that you simply understand how the code works, let’s see it in motion.

#Instantiate TargetEncode class
te = TargetEncode(categories='ROLE_TITLE')
te.fit(data, data['ACTION'])
te.transform(data[['ROLE_TITLE']])
Output with Goal encoded Role title. Image by creator

The Goal encoder replaced each “ROLE_TITLE” id with the probability of every category. Now, let’s do the identical for all features and check the memory usage after using Goal Encoding.

y = data['ACTION']
features = data.drop('ACTION',axis=1)

te = TargetEncode(categories=features.columns)
te.fit(features,y)
te_data = te.transform(features)

te_data.head()

Output, Goal encoded features. Image by creator
memory_usage = te_data.memory_usage(deep=True)
total_memory_usage = memory_usage.sum()
print(f"nTotal memory usage of the DataFrame: {total_memory_usage / (1024 ** 2):.2f} MB")
The resulting dataset only uses 2.25 MB, in comparison with 488.08 MB from the one-hot encoder. Image by creator

Goal encoding successfully transformed the specific data into numerical without creating extra columns or increasing memory usage.

To this point we’ve got created our own goal encoder class, nonetheless you don’t need to do that anymore.

In scikit-learn version 1.3 release, somewhere around June 2023, they introduced the Goal Encoder class to their API. Here is how you should utilize goal encoding with Scikit Learn

from sklearn.preprocessing import TargetEncoder

#Splitting the info
y = data['ACTION']
features = data.drop('ACTION',axis=1)

#Specify the goal type
te = TargetEncoder(smooth="auto",target_type='binary')
X_trans = te.fit_transform(features, y)

#Making a Dataframe
features_encoded = pd.DataFrame(X_trans, columns = features.columns)

Output from sklearn Goal Encoder transformation. Image by creator

Note that we’re getting barely different results from the manual Goal encoder class due to the sleek parameter and randomness on the noise level.

As you see, sklearn makes it easy to run goal encoding transformations. Nonetheless, it can be crucial to know how the transformation works under the hood first to know and explain the output.

While Goal encoding is a robust encoding method, it’s vital to think about the precise requirements and characteristics of your dataset and select the encoding method that most closely fits your needs and the necessities of the machine learning algorithm you intend to make use of.

[1] Banachewicz, K. & Massaron, L. (2022). The Kaggle Book: Data Evaluation and Machine Learning for Competitive Data Science. Packt>

[2] Massaron, L. (2022, January). Amazon Worker Access Challenge. Retrieved February 1, 2024, from https://www.kaggle.com/datasets/lucamassaron/amazon-employee-access-challenge

[3] Massaron, L. Meta-features and goal encoding. Retrieved February 1, 2024, from https://www.kaggle.com/luca-massaron/meta-features-and-target-encoding

[4] Scikit-learn.sklearn.preprocessing.TargetEncoder. In scikit-learn: Machine learning in Python (Version 1.3). Retrieved February 1, 2024, from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html

LEAVE A REPLY

Please enter your comment!
Please enter your name here