Data comes in several shapes and forms. Certainly one of those shapes and forms is generally known as categorical data.

This poses an issue because most Machine Learning algorithms use only numerical data as input. Nonetheless, categorical data will likely be not a challenge to take care of, because of easy, well-defined functions that transform them into numerical values. If you’ve taken any data science course, you will likely be accustomed to the one hot encoding strategy for categorical features. This strategy is great when your features have limited categories. Nonetheless, you’ll run into some issues when coping with high cardinal features (features with many categories)
Here is how you should utilize goal encoding to rework Categorical features into numerical values.
Early in any data science course, you might be introduced to at least one hot encoding as a key technique to take care of categorical values, and rightfully so, as this strategy works very well on low cardinal features (features with limited categories).
In a nutshell, One hot encoding transforms each category right into a binary vector, where the corresponding category is marked as ‘True’ or ‘1’, and all other categories are marked with ‘False’ or ‘0’.
import pandas as pd# Sample categorical data
data = {'Category': ['Red', 'Green', 'Blue', 'Red', 'Green']}
# Create a DataFrame
df = pd.DataFrame(data)
# Perform one-hot encoding
one_hot_encoded = pd.get_dummies(df['Category'])
# Display the result
print(one_hot_encoded)
While this works great for features with limited categories (Lower than 10–20 categories), because the variety of categories increases, the one-hot encoded vectors change into longer and sparser, potentially resulting in increased memory usage and computational complexity, let’s take a look at an example.
The below code uses Amazon Worker Access data, made publicity available in kaggle: https://www.kaggle.com/datasets/lucamassaron/amazon-employee-access-challenge
The info incorporates eight categorical feature columns indicating characteristics of the required resource, role, and workgroup of the worker at Amazon.
data.info()
# Display the variety of unique values in each column
unique_values_per_column = data.nunique()print("Variety of unique values in each column:")
print(unique_values_per_column)
Using one hot encoding may very well be difficult in a dataset like this attributable to the high variety of distinct categories for every feature.
#Initial data memory usage
memory_usage = data.memory_usage(deep=True)
total_memory_usage = memory_usage.sum()
print(f"nTotal memory usage of the DataFrame: {total_memory_usage / (1024 ** 2):.2f} MB")
#one-hot encoding categorical features
data_encoded = pd.get_dummies(data,
columns=data.select_dtypes(include='object').columns,
drop_first=True)data_encoded.shape
# Memory usage for the one-hot encoded dataset
memory_usage = data_encoded.memory_usage(deep=True)
total_memory_usage = memory_usage.sum()
print(f"nTotal memory usage of the DataFrame: {total_memory_usage / (1024 ** 2):.2f} MB")
As you possibly can see, one-hot encoding is just not a viable solution to take care of high cardinal categorical features, because it significantly increases the dimensions of the dataset.
In cases with high cardinal features, goal encoding is a greater option.
Goal encoding transforms a categorical feature right into a numeric feature without adding any extra columns, avoiding turning the dataset right into a larger and sparser dataset.
Goal encoding works by converting each category of a categorical feature into its corresponding expected value. The approach to calculating the expected value will rely on the worth you are attempting to predict.
For Regression problems, the expected value is solely the common value for that category.
For Classification problems, the expected value is the conditional probability on condition that category.
In each cases, we are able to get the outcomes by simply using the ‘group_by’ function in pandas.
#Example of the best way to calculate the expected value for Goal encoding of a Binary end result
expected_values = data.groupby('ROLE_TITLE')['ACTION'].value_counts(normalize=True).unstack()
expected_values
The resulting table indicates the probability of every “ACTION” end result by unique “ROLE_TITLE” id. All that’s left to do is replace the “ROLE_TITLE” id with the values from the probability of “ACTION” being 1 in the unique dataset. (i.e as an alternative of category 117879 the dataset will show 0.889331)
While this may give us an intuition of how goal encoding works, using this straightforward method runs the danger of overfitting. Especially for rare categories, as in those cases, goal encoding will essentially provide the goal value to the model. Also, the above method can only take care of seen categories, so in case your test data has a brand new category, it won’t give you the chance to handle it.
To avoid those errors, that you must make the goal encoding transformer more robust.
To make goal encoding more robust, you possibly can create a custom transformer class and integrate it with scikit-learn in order that it may well be utilized in any model pipeline.
NOTE: The below code is taken from the book “The Kaggle Book” and will be present in Kaggle: https://www.kaggle.com/code/lucamassaron/meta-features-and-target-encoding
import numpy as np
import pandas as pdfrom sklearn.base import BaseEstimator, TransformerMixin
class TargetEncode(BaseEstimator, TransformerMixin):
def __init__(self, categories='auto', k=1, f=1,
noise_level=0, random_state=None):
if type(categories)==str and categories!='auto':
self.categories = [categories]
else:
self.categories = categories
self.k = k
self.f = f
self.noise_level = noise_level
self.encodings = dict()
self.prior = None
self.random_state = random_state
def add_noise(self, series, noise_level):
return series * (1 + noise_level *
np.random.randn(len(series)))
def fit(self, X, y=None):
if type(self.categories)=='auto':
self.categories = np.where(X.dtypes == type(object()))[0]
temp = X.loc[:, self.categories].copy()
temp['target'] = y
self.prior = np.mean(y)
for variable in self.categories:
avg = (temp.groupby(by=variable)['target']
.agg(['mean', 'count']))
# Compute smoothing
smoothing = (1 / (1 + np.exp(-(avg['count'] - self.k) /
self.f)))
# The larger the count the less full_avg is accounted
self.encodings[variable] = dict(self.prior * (1 -
smoothing) + avg['mean'] * smoothing)
return self
def transform(self, X):
Xt = X.copy()
for variable in self.categories:
Xt[variable].replace(self.encodings[variable],
inplace=True)
unknown_value = {value:self.prior for value in
X[variable].unique()
if value not in
self.encodings[variable].keys()}
if len(unknown_value) > 0:
Xt[variable].replace(unknown_value, inplace=True)
Xt[variable] = Xt[variable].astype(float)
if self.noise_level > 0:
if self.random_state is just not None:
np.random.seed(self.random_state)
Xt[variable] = self.add_noise(Xt[variable],
self.noise_level)
return Xt
def fit_transform(self, X, y=None):
self.fit(X, y)
return self.transform(X)
It’d look daunting at first, but let’s break down each a part of the code to know the best way to create a sturdy Goal encoder.
Class Definition
class TargetEncode(BaseEstimator, TransformerMixin):
This primary step ensures which you can use this transformer class in scikit-learn pipelines for data preprocessing, feature engineering, and machine learning workflows. It achieves this by inheriting the scikit-learn classes BaseEstimator and TransformerMixin.
Inheritance allows the TargetEncode class to reuse or override methods and attributes defined in the bottom classes, on this case, BaseEstimator and TransformerMixin
BaseEstimator is a base class for all scikit-learn estimators. Estimators are objects in scikit-learn with a “fit” method for training on data and a “predict” method for making predictions.
TransformerMixin is a mixin class for transformers in scikit-learn, it provides additional methods reminiscent of “fit_transform”, which mixes fitting and reworking in a single step.
Inheriting from BaseEstimator & TransformerMixin, allows TargetEncode to implement these methods, making it compatible with the scikit-learn API.
Defining the constructor
def __init__(self, categories='auto', k=1, f=1,
noise_level=0, random_state=None):
if type(categories)==str and categories!='auto':
self.categories = [categories]
else:
self.categories = categories
self.k = k
self.f = f
self.noise_level = noise_level
self.encodings = dict()
self.prior = None
self.random_state = random_state
This second step defines the constructor for the “TargetEncode” class and initializes the instance variables with default or user-specified values.
The “categories” parameter determines which columns within the input data needs to be regarded as categorical variables for goal encoding. It’s Set by default to ‘auto’ to mechanically discover categorical columns in the course of the fitting process.
The parameters k, f, and noise_level control the smoothing effect during goal encoding and the extent of noise added during transformation.
Adding noise
This next step could be very vital to avoid overfitting.
def add_noise(self, series, noise_level):
return series * (1 + noise_level *
np.random.randn(len(series)))
The “add_noise” method adds random noise to introduce variability and stop overfitting in the course of the transformation phase.
“np.random.randn(len(series))” generates an array of random numbers from a typical normal distribution (mean = 0, standard deviation = 1).
Multiplying this array by “noise_level” scales the random noise based on the required noise level.”
This step contributes to the robustness and generalization capabilities of the goal encoding process.
Fitting the Goal encoder
This a part of the code trains the goal encoder on the provided data by calculating the goal encodings for categorical columns and storing them for later use during transformation.
def fit(self, X, y=None):
if type(self.categories)=='auto':
self.categories = np.where(X.dtypes == type(object()))[0]temp = X.loc[:, self.categories].copy()
temp['target'] = y
self.prior = np.mean(y)
for variable in self.categories:
avg = (temp.groupby(by=variable)['target']
.agg(['mean', 'count']))
# Compute smoothing
smoothing = (1 / (1 + np.exp(-(avg['count'] - self.k) /
self.f)))
# The larger the count the less full_avg is accounted
self.encodings[variable] = dict(self.prior * (1 -
smoothing) + avg['mean'] * smoothing)
The smoothing term helps prevent overfitting, especially when coping with categories with small samples.
The strategy follows the scikit-learn convention for fit methods in transformers.
It starts by checking and identifying the specific columns and creating a short lived DataFrame, containing only the chosen categorical columns from the input X and the goal variable y.
The prior mean of the goal variable is calculated and stored within the prior attribute. This represents the general mean of the goal variable across your entire dataset.
Then, it calculates the mean and count of the goal variable for every category using the group-by method, as seen previously.
There may be a further smoothing step to forestall overfitting on categories with small numbers of samples. Smoothing is calculated based on the variety of samples in each category. The larger the count, the less the smoothing effect.
The calculated encodings for every category in the present variable are stored within the encodings dictionary. This dictionary will likely be used later in the course of the transformation phase.
Transforming the info
This a part of the code replaces the unique categorical values with their corresponding target-encoded values stored in self.encodings.
def transform(self, X):
Xt = X.copy()
for variable in self.categories:
Xt[variable].replace(self.encodings[variable],
inplace=True)
unknown_value = {value:self.prior for value in
X[variable].unique()
if value not in
self.encodings[variable].keys()}
if len(unknown_value) > 0:
Xt[variable].replace(unknown_value, inplace=True)
Xt[variable] = Xt[variable].astype(float)
if self.noise_level > 0:
if self.random_state is just not None:
np.random.seed(self.random_state)
Xt[variable] = self.add_noise(Xt[variable],
self.noise_level)
return Xt
This step has a further robustness check to make sure the goal encoder can handle latest or unseen categories. For those latest or unknown categories, it replaces them with the mean of the goal variable stored within the prior_mean variable.
When you need more robustness against overfitting, you possibly can arrange a noise_level greater than 0 so as to add random noise to the encoded values.
The fit_transform method combines the functionality of fitting and reworking the info by first fitting the transformer to the training data after which transforming it based on the calculated encodings.
Now that you simply understand how the code works, let’s see it in motion.
#Instantiate TargetEncode class
te = TargetEncode(categories='ROLE_TITLE')
te.fit(data, data['ACTION'])
te.transform(data[['ROLE_TITLE']])
The Goal encoder replaced each “ROLE_TITLE” id with the probability of every category. Now, let’s do the identical for all features and check the memory usage after using Goal Encoding.
y = data['ACTION']
features = data.drop('ACTION',axis=1)te = TargetEncode(categories=features.columns)
te.fit(features,y)
te_data = te.transform(features)
te_data.head()
memory_usage = te_data.memory_usage(deep=True)
total_memory_usage = memory_usage.sum()
print(f"nTotal memory usage of the DataFrame: {total_memory_usage / (1024 ** 2):.2f} MB")
Goal encoding successfully transformed the specific data into numerical without creating extra columns or increasing memory usage.
To this point we’ve got created our own goal encoder class, nonetheless you don’t need to do that anymore.
In scikit-learn version 1.3 release, somewhere around June 2023, they introduced the Goal Encoder class to their API. Here is how you should utilize goal encoding with Scikit Learn
from sklearn.preprocessing import TargetEncoder#Splitting the info
y = data['ACTION']
features = data.drop('ACTION',axis=1)
#Specify the goal type
te = TargetEncoder(smooth="auto",target_type='binary')
X_trans = te.fit_transform(features, y)
#Making a Dataframe
features_encoded = pd.DataFrame(X_trans, columns = features.columns)
Note that we’re getting barely different results from the manual Goal encoder class due to the sleek parameter and randomness on the noise level.
As you see, sklearn makes it easy to run goal encoding transformations. Nonetheless, it can be crucial to know how the transformation works under the hood first to know and explain the output.
While Goal encoding is a robust encoding method, it’s vital to think about the precise requirements and characteristics of your dataset and select the encoding method that most closely fits your needs and the necessities of the machine learning algorithm you intend to make use of.
[1] Banachewicz, K. & Massaron, L. (2022). The Kaggle Book: Data Evaluation and Machine Learning for Competitive Data Science. Packt>
[2] Massaron, L. (2022, January). Amazon Worker Access Challenge. Retrieved February 1, 2024, from https://www.kaggle.com/datasets/lucamassaron/amazon-employee-access-challenge
[3] Massaron, L. Meta-features and goal encoding. Retrieved February 1, 2024, from https://www.kaggle.com/luca-massaron/meta-features-and-target-encoding
[4] Scikit-learn.sklearn.preprocessing.TargetEncoder
. In scikit-learn: Machine learning in Python (Version 1.3). Retrieved February 1, 2024, from https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html