How should we make a choice from label, one-hot, and goal encoding?
Why Do We Need Encoding?
Within the realm of machine learning, most algorithms demand inputs in numeric form, especially in lots of popular Python frameworks. For example, in scikit-learn, linear regression, and neural networks require numerical variables. This implies we’d like to remodel categorical variables into numeric ones for these models to grasp them. Nonetheless, this step isn’t all the time vital for models like tree-based ones.
Today, I’m thrilled to introduce three fundamental encoding techniques which can be essential for each budding data scientist! Plus, I’ve included a practical tip to provide help to see these techniques in motion at the top! Unless stated, all of the codes and pictures are created by the writer.
Label Encoding / Ordinal Encoding
Each label encoding and ordinal encoding involve assigning integers to different classes. The excellence lies in whether the specific variable inherently has an order. For instance, responses like ‘strongly agree,’ ‘agree,’ ‘neutral,’ ‘disagree,’ and ‘strongly disagree’ are ordinal as they follow a selected sequence. When a variable doesn’t have such an order, we use label encoding.
Let’s delve into label encoding.
I’ve prepared an artificial dataset with math test scores and students’ favorite subjects. This dataset is designed to reflect higher scores for college students preferring STEM subjects. The next code shows the way it is synthesized.
import numpy as np
import pandas as pdmath_score = [60, 70, 80, 90]
favorite_subject = ["History", "English", "Science", "Math"]
std_deviation = 5
num_samples = 30
# Generate 30 samples with a standard distribution
scores = []
subjects = []
for i in range(4):
scores.extend(np.random.normal(math_score[i], std_deviation, num_samples))
subjects.extend([favorite_subject[i]]*num_samples)
data = {'Rating': scores, 'Subject': subjects}
df_math = pd.DataFrame(data)
# Print the DataFrame
print(df_math.sample(frac=0.04))import numpy as np
import pandas as pd
import random
math_score = [60, 70, 80, 90]
favorite_subject = ["History", "English", "Science", "Math"]
std_deviation = 5 # Standard deviation in cm
num_samples = 30 # Variety of samples
# Generate 30 samples with a standard distribution
scores = []
subjects = []
for i in range(4):
scores.extend(np.random.normal(math_score[i], std_deviation, num_samples))
subjects.extend([favorite_subject[i]]*num_samples)
data = {'Rating': scores, 'Subject': subjects}
df_math = pd.DataFrame(data)
# Print the DataFrame
sampled_index = random.sample(range(len(df_math)), 5)
sampled = df_math.iloc[sampled_index]
print(sampled)
You’ll be amazed at how easy it’s to encode your data — it takes only a single line of code! You may pass a dictionary that maps between the topic name and number to the default approach to the pandas dataframe like the next.
# Easy way
df_math['Subject_num'] = df_math['Subject'].replace({'History': 0, 'Science': 1, 'English': 2, 'Math': 3})
print(df_math.iloc[sampled_index])
But what if you happen to’re coping with an unlimited array of classes, or perhaps you’re in search of a more straightforward approach? That’s where the scikit-learn library’s `LabelEncoder` function turns out to be useful. It routinely encodes your classes based on their alphabetical order. For the most effective experience, I like to recommend using version 1.4.0, which supports all of the encoders we’re discussing.
# Scikit-learn
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_math["Subject_num_scikit"] = le.fit_transform(df_math[['Subject']])
print(df_math.iloc[sampled_index])
Nonetheless, there’s a catch. Consider this: our dataset doesn’t imply an ordinal relationship between favorite subjects. For example, ‘History’ is encoded as 0, but that doesn’t mean it’s ‘inferior’ to ‘Math,’ which is encoded as 3. Similarly, the numerical gap between ‘English’ and ‘Science’ is smaller than that between ‘English’ and ‘History,’ but this doesn’t necessarily reflect their relative similarity.
This encoding approach also affects interpretability in some algorithms. For instance, in linear regression, each coefficient indicates the expected change within the consequence variable for a one-unit change in a predictor. But how will we interpret a ‘unit change’ in a subject that’s been numerically encoded? Let’s put this into perspective with a linear regression on our dataset.
from sklearn.linear_model import LinearRegressionmodel = LinearRegression()
model.fit(df_math[["Subject_num"]], df_math[["Score"]])
coefficients = model.coef_
print("Coefficients:", coefficients)
How can we interpret the coefficient 8.26 here? The naive way can be when the label changes by 1 unit, the test rating changes by 8. Nonetheless, it is just not really true from Science (encoded as 1) to History (encoded as 2) since I synthesized in a way that the mean rating can be 80 and 70 respectively. So, we must always not interpret the coefficient when there is no such thing as a meaning in the way in which we label each class!
Now, moving on to ordinal encoding, let’s apply it to a different synthetic dataset, this time specializing in height and faculty categories. I’ve tailored this dataset to reflect average heights for various school levels: 110 cm for kindergarten, 140 cm for elementary school, and so forth. Let’s see how this plays out.
import numpy as np
import pandas as pd# Set the parameters
mean_height = [110, 140, 160, 175, 180] # Mean height in cm
grade = ["kindergarten", "elementary school", "middle school", "high school", "college"]
std_deviation = 5 # Standard deviation in cm
num_samples = 10 # Variety of samples
# Generate 10 samples with a standard distribution
heights = []
grades = []
for i in range(5):
heights.extend(np.random.normal(mean_height[i], std_deviation, num_samples))
grades.extend([grade[i]]*10)
data = {'Grade': grades, 'Height': heights}
df = pd.DataFrame(data)
sampled_index = random.sample(range(len(df)), 5)
sampled = df.iloc[sampled_index]
print(sampled)
The `OrdinalEncoder` from scikit-learn’s preprocessing toolkit is an actual gem for handling ordinal variables. It’s intuitive, routinely determining the ordinal structure and encoding it accordingly. When you take a look at encoder.categories_, you possibly can check how the variable was encoded.
from sklearn.preprocessing import OrdinalEncoderencoder = OrdinalEncoder(categories=[grade])
df['Category'] = encoder.fit_transform(df[['Grade']])
print(encoder.categories_)
print(df.iloc[sampled_index])
In relation to ordinal categorical variables, interpreting linear regression models becomes more straightforward. The encoding reflects the degree of education in a numerical order — the upper the education level, the upper its corresponding value.
from sklearn.linear_model import LinearRegressionmodel = LinearRegression()
model.fit(df[["Category"]], df[["Height"]])
coefficients = model.coef_
print("Coefficients:", coefficients)
height_diff = [mean_height[i] - mean_height[i-1] for i in range(1, len(mean_height),1)]
print("Average Height Difference:", sum(height_diff)/len(height_diff))
The model reveals something quite intuitive: a one-unit change at school type corresponds to a 17.5 cm increase in height. This makes perfect sense given our dataset!
So, let’s wrap up with a fast summary of label/ordinal encoding:
Pros:
– Simplicity: It’s user-friendly and simple to implement.
– Efficiency: This method is light on computational resources and memory, creating only one recent numerical feature.
– Ideal for Ordinal Categories: It shines when coping with categorical variables which have a natural order.
Cons:
– Implied Order: One potential downside is that it could possibly introduce a way of order where none exists, potentially resulting in misinterpretation (like assuming a category labeled ‘3’ is superior to 1 labeled ‘2’).
– Not At all times Suitable: Certain algorithms, similar to linear or logistic regression, might incorrectly interpret the encoded numerical values as having ordinal significance.
One-hot encoding
Next up, let’s dive into one other encoding technique that addresses the interpretability issue: One-hot encoding.
The core issue with label encoding is that it imposes an ordinal structure on variables that don’t inherently have one, by replacing categories with numerical values. One-hot encoding tackles this by making a separate column for every class. Each of those columns comprises binary values, indicating whether the row belongs to that class. It’s like pivoting the information to a wider format, for individuals who are aware of that idea. To make this clearer, let’s see an example using the math_score and subject data. The `OneHotEncoder` from sklearn.preprocessing is ideal for this task.
from sklearn.preprocessing import OneHotEncoderdata = {'Rating': scores, 'Subject': subjects}
df_math = pd.DataFrame(data)
y = df_math["Score"] # Goal
x = df_math.drop('Rating', axis=1)
# Define encoder
encoder = OneHotEncoder()
x_ohe = encoder.fit_transform(x)
print("Type:",type(x_ohe))
# Convert x_ohe to array in order that it's more compatible
x_ohe = x_ohe.toarray()
print("Dimension:", x_ohe.shape)
# Convet back to pandas dataframe
x_ohe = pd.DataFrame(x_ohe, columns=encoder.get_feature_names_out())
df_math_ohe = pd.concat([y, x_ohe], axis=1)
sampled_ohe_idx = random.sample(range(len(df_math_ohe)), 5)
print(df_math_ohe.iloc[sampled_ohe_idx])
Now, as an alternative of getting a single ‘Subject’ column, our dataset features individual columns for every subject. This effectively eliminates any unintended ordinal structure! Nonetheless, the method here is a little more involved, so let me explain.
Like with label/ordinal encoding, you first must define your encoder. However the output of one-hot encoding differs: while label/ordinal encoding returns a numpy array, one-hot encoding typically produces a `scipy.sparse._csr.csr_matrix`. To integrate this with a pandas dataframe, you’ll must convert it into an array. Then, create a brand new dataframe with this array and assign column names, which you’ll get from the encoder’s `get_feature_names_out()` method. Alternatively, you possibly can get numpy array directly by setting `sparse_output=False` when defining the encoder.
Nonetheless, in practical applications, you don’t must undergo all these steps. I’ll show you a more streamlined approach using `make_column_transformer` towards the top of our discussion!
Now, let’s proceed with running a linear regression on our one-hot encoded data. This could make the interpretation much easier, right?
model = LinearRegression()
model.fit(x_ohe, y)coefficients = model.coef_
intercept = model.intercept_
print("Coefficients:", coefficients)
print(encoder.get_feature_names_out())
print("Intercept:",intercept)
But wait, why are the coefficients so tiny, and the intercept so large? What’s going flawed here? This conundrum is a selected issue in linear regression often called perfect multicollinearity. Perfect multicollinearity occurs when when one variable in a linear regression model might be perfectly predicted from the others, which within the case of one-hot encoding happens because one class might be inferred if all other classes are zero. To sidestep this problem, we are able to drop one among the classes by setting `OneHotEncoder(drop=”first”)`. Let’s try the impact of this adjustment.
encoder_with_drop = OneHotEncoder(drop="first")
x_ohe_drop = encoder_with_drop.fit_transform(x)# if you happen to don't sparse_output = False, you'll want to run the next to convert type
x_ohe_drop = x_ohe_drop.toarray()
x_ohe_drop = pd.DataFrame(x_ohe_drop, columns=encoder_with_drop.get_feature_names_out())
model = LinearRegression()
model.fit(x_ohe_drop, y)
coefficients = model.coef_
intercept = model.intercept_
print("Coefficients:", coefficients)
print(encoder_with_drop.get_feature_names_out())
print("Intercept:",intercept)
Here, the column for English has been dropped, and now the coefficients seem rather more reasonable! Plus, they’re easier to interpret. When all of the one-hot encoded columns are zero (indicating English as the favourite subject), we predict the test rating to be around 71 (aligned with our defined average rating for English). For History, it might be 71 minus 11 equals 60, for Math, 71 plus 19, and so forth.
Nonetheless, there’s a big caveat with one-hot encoding: it could possibly result in high-dimensional datasets, especially when the variable has numerous classes. Let’s consider a dataset that features 1000 rows, each representing a novel product with various features, including a category that spans 100 differing kinds.
# Define 1000 categories (for simplicity, these are only numbered)
categories = [f"Category_{i}" for i in range(1, 200)]manufacturers = ["Manufacturer_A", "Manufacturer_B", "Manufacturer_C"]
satisfied = ["Satisfied", "Not Satisfied"]
n_rows = 1000
# Generate random data
data = {
"Product_ID": [f"Product_{i}" for i in range(n_rows)],
"Category": [random.choice(categories) for _ in range(n_rows)],
"Price": [round(random.uniform(10, 500), 2) for _ in range(n_rows)],
"Quality": [random.choice(satisfied) for _ in range(n_rows)],
"Manufacturer": [random.choice(manufacturers) for _ in range(n_rows)],
}
df = pd.DataFrame(data)
print("Dimension before one-hot encoding:",df.shape)
print(df.head())
Note that the dataset’s dimensions are 1000 rows by 5 columns. Now, let’s observe the changes after applying a one-hot encoder.
# Now do one-hot encoding
encoder = OneHotEncoder(sparse_output=False)# Reshape the 'Category' column to a 2D array as required by the OneHotEncoder
category_array = df['Category'].values.reshape(-1, 1)
one_hot_encoded_array = encoder.fit_transform(category_array)
one_hot_encoded_df = pd.DataFrame(one_hot_encoded_array, columns=encoder.get_feature_names_out(['Category']))
encoded_df = pd.concat([df.drop('Category', axis=1), one_hot_encoded_df], axis=1)
print("Dimension after one-hot encoding:", encoded_df.shape)
After applying one-hot encoding, our dataset’s dimension balloons to 1000×201 — a whopping 40 times larger than before. This increase is a priority, because it demands more memory. Furthermore, you’ll notice that the majority of the values within the newly created columns are zeros, leading to what we call a sparse dataset. Certain models, especially tree-based ones, struggle with sparse data. Moreover, other challenges arise when coping with high-dimensional data also known as the ‘curse of dimensionality.’ Also, since one-hot encoding treats each class as a person column, we lose any ordinal information. Due to this fact, if the classes in your variable inherently have a hierarchical order, one-hot encoding may not be your most suitable option.
How will we tackle these disadvantages? One approach is to make use of a unique encoding method. Alternatively, you possibly can limit the variety of classes within the variable. Often, even with numerous classes, the vast majority of values for a variable are concentrated in only a couple of classes. In such cases, treating these minority classes as ‘others’ might be effective. This might be achieved by setting parameters like `min_frequency` or `max_categories` in OneHotEncoder. One other strategy for coping with sparse data involves techniques like feature hashing, which essentially simplifies the representation by mapping multiple categories to a lower-dimensional space using a hash function, or dimension reduction techniques like PCA.
Here’s a fast summary of One-hot encoding:
Pros:
– Prevents Misleading Interpretations: It avoids the chance of models misinterpreting the information as having some type of order, a problem prevalent in label/goal encoding.
– Suitable for Non-Ordinal Features: Ideal for categorical data without an ordinal relationship.
Cons:
– Dimensionality Increase: Results in a big increase within the dataset’s dimensionality, which might be problematic, especially for variables with many categories.
– Sparse Matrix: Ends in many columns full of zeros, creating sparse data.
– Not Efficient with High Cardinality Features: Less effective for variables with numerous categories.
Goal Encoding
Let’s now explore goal encoding, a method particularly effective with high-cardinality data and in models like tree-based algorithms.
The essence of goal encoding is to leverage the knowledge from the worth of the dependent variable. Its implementation varies depending on the duty. In regression, we encode the goal variable by the mean of the dependent variable for every class. For binary classification, it’s done by encoding the goal variable with the probability of being in a single class (calculated because the variety of rows in that class where the consequence is 1, divided by the full variety of rows in the category). In multiclass classification, the specific variable is encoded based on the probability of belonging to every class, leading to as many recent columns as there are classes within the dependent variable. To make clear, let’s use the identical product dataset we employed for one-hot encoding.
Let’s begin with goal encoding for a regression task. Imagine we would like to predict the value of products and aim to encode the product type. Just like other encodings, we use TargetEncoder from sklearn.preprocessing!
from sklearn.preprocessing import TargetEncoder
x = df.drop(["Price"], axis=1)
x_need_encode = df["Category"].to_frame()
y = df["Price"]# Define encoder
encoder = TargetEncoder()
x_encoded = encoder.fit_transform(x_need_encode, y)
# Encoder with 0 smoothing
encoder_no_smooth = TargetEncoder(smooth=0)
x_encoded_no_smooth = encoder_no_smooth.fit_transform(x_need_encode, y)
x_encoded = pd.DataFrame(x_encoded, columns=["encoded_category"])
data_target = pd.concat([x, x_encoded], axis=1)
print("Dimension before encoding:", df.shape)
print("Dimension after encoding:", data_target.shape)
print("---------")
print("Encoding")
print(encoder.encodings_[0][:5])
print(encoder.categories_[0][:5])
print(" ")
print("Encoding with no smooth")
print(encoder_no_smooth.encodings_[0][:5])
print(encoder_no_smooth.categories_[0][:5])
print("---------")
print("Mean by Category")
print(df.groupby("Category").mean("Price").head())
print("---------")
print("dataset:")
print(data_target.head())
After the encoding, you’ll notice that, despite the variable having many classes, the dataset’s dimension stays unchanged (1000 x 5). You may as well observe how each class is encoded. Although I discussed that the encoding for every class is predicated on the mean of the goal variable for that class, you’ll find that the actual mean differs barely from the encoding using the default settings. This discrepancy arises because, by default, the function routinely selects a smoothing parameter. This parameter blends the local category mean with the general global mean, which is especially useful to stop overfitting in categories with limited samples. If we set `smooth=0`, the encoded values align precisely with the actual means.
Now, let’s consider binary classification. Imagine our goal is to categorise whether the standard of a product is satisfactory. On this scenario, the encoded value represents the probability of a category being ‘satisfactory.’
x = df.drop(["Quality"], axis=1)
x_need_encode = df["Category"].to_frame()
y = df["Quality"]# Define encoder
encoder = TargetEncoder()
x_encoded = encoder.fit_transform(x_need_encode, y)
x_encoded = pd.DataFrame(x_encoded, columns=["encoded_category"])
data_target = pd.concat([x, x_encoded], axis=1)
print("Dimension:", data_target.shape)
print("---------")
print("Encoding")
print(encoder.encodings_[0][:5])
print(encoder.categories_[0][:5])
print("---------")
print(encoder.classes_)
print("---------")
print("dataset:")
print(data_target.head())
You may indeed see that the encoded_category represent the probability being “Satisfied” (float value between 0 and 1). To see how each class is encoded, you possibly can check the `classes_` attribute of the encoder. For binary classification, the primary value within the list is usually dropped, meaning that the column here indicates the probability of being satisfied. Conveniently, the encoder routinely detects the style of task, so there’s no must specify that it’s a binary classification.
Lastly, let’s see multi-class classification example. Suppose we’re predicting which manufacturer produced a product.
x = df.drop(["Manufacturer"], axis=1)
x_need_encode = df["Category"].to_frame()
y = df["Manufacturer"]# Define encoder
encoder = TargetEncoder()
x_encoded = encoder.fit_transform(x_need_encode, y)
x_encoded = pd.DataFrame(x_encoded, columns=encoder.classes_)
data_target = pd.concat([x, x_encoded], axis=1)
print("Dimension:", data_target.shape)
print("---------")
print("Encoding")
print(encoder.encodings_[0][:5])
print(encoder.categories_[0][:5])
print("---------")
print("dataset:")
print(data_target.head())
After encoding, you’ll see that we now have columns for every manufacturer. These columns indicate the probability of a product belonging to a certain category being produced by that manufacturer. Although our dataset has expanded barely, the variety of classes for the dependent variable is normally much smaller, so it’s unlikely to cause issues.
Goal encoding is especially advantageous for tree-based models. These models make splits based on feature values that the majority effectively separate the goal variable. By directly incorporating the mean of the goal variable, goal encoding provides a transparent and efficient means for the model to make these splits, often more so than other encoding methods.
Nonetheless, caution is required with goal encoding. If there are only a couple of observations for a category, and these don’t represent the true mean for that class, there’s a risk of overfitting.
This leads to a different crucial point: it’s vital to perform goal encoding after splitting your data into training and testing sets. Doing it beforehand can result in data leakage, because the encoding can be influenced by the outcomes within the test dataset. This might end in the model performing exceptionally well on the training dataset, supplying you with a misunderstanding of its efficacy. Due to this fact, to accurately assess your model’s performance, ensure goal encoding is completed post train-test split.
Here’s a fast summary of goal encoding:
Pros:
– Keeps Cardinality in Check: It’s highly effective for prime cardinality features because it doesn’t increase the feature space.
– Can Capture Information Inside Labels: By incorporating goal data, it often enhances predictive performance.
– Useful for Tree-Based Models: Particularly advantageous for complex models similar to random forests or gradient boosting machines.
Cons:
– Risk of Overfitting: There’s a heightened risk of overfitting, especially when categories have a limited variety of observations.
– Goal Leakage: It could inadvertently introduce future information into the model, i.e., details from the goal variable that wouldn’t be accessible during actual predictions.
– Less Interpretable: Because the transformations are based on the goal, they might be tougher to interpret in comparison with methods like one-hot or label encoding.
Final tip
To wrap up, I’d prefer to offer some practical suggestions. Throughout this discussion, we’ve checked out different encoding techniques, but in point of fact, it is advisable to apply various encodings to different variables inside a dataset. That is where `make_column_transformer` from sklearn.compose turns out to be useful. For instance, suppose you’re predicting product prices and choose to make use of goal encoding for the ‘Category’ resulting from its high cardinality, while applying one-hot encoding for ‘Manufacturer’ and ‘Quality’. To do that, you’ll define arrays containing the names of the variables for every encoding type and apply the function as shown below. This approach means that you can handle the transformed data seamlessly, leading you to an efficiently encoded dataset ready to your analyses!
from sklearn.compose import make_column_transformer
ohe_cols = ["Manufacturer"]
te_cols = ["Category", "Quality"]encoding = make_column_transformer(
(OneHotEncoder(), ohe_cols),
(TargetEncoder(), te_cols)
)
x = df.drop(["Price"], axis=1)
y = df["Price"]
# Fit the transformer
x_encoded = encoding.fit_transform(x, y)
x_encoded = pd.DataFrame(x_encoded, columns=encoding.get_feature_names_out())
x_rest = x.drop(ohe_cols+te_cols, axis=1)
print(pd.concat([x_rest, x_encoded],axis=1).head())
Thanks a lot for taking the time to read through this! Once I first launched into my machine learning journey, selecting the fitting encoding techniques and understanding their implementation was quite a maze for me. I genuinely hope this text has shed some light for you and made your path a bit clearer!
Source:
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825–2830, 2011.
Documentation of Scikit-learn:
Ordinal encoder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder
Goal encoder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html#sklearn.preprocessing.TargetEncoder
One-hot encoder https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder