Home Artificial Intelligence 4 Ways to Encode Categorical Features with High Cardinality — with Python Implementation Introducing categorical features

4 Ways to Encode Categorical Features with High Cardinality — with Python Implementation Introducing categorical features

0
4 Ways to Encode Categorical Features with High Cardinality — with Python Implementation
Introducing categorical features

Learn to use goal encoding, count encoding, feature hashing and Embedding using scikit-learn and TensorFlow

Towards Data Science
“Click” — Photo by Cleo Vermij on Unsplash

In this text, we are going to undergo 4 popular methods to encode categorical variables with high cardinality: (1) Goal encoding, (2) Count encoding, (3) Feature hashing and (4) Embedding.

We are going to explain how each method works, discuss its pros and cons and observe its impact on the performance of a classification task.

Table of content

— Introducing categorical features
(1) Why do we’d like to encode categorical features?
(2)
Why one-hot encoding will not be suited to high cardinality?
— Application on an AdTech dataset
— Overview of every encoding method
(1) Goal encoding
(2)
Count encoding
(3)
Feature hashing
(4)
Embedding
— Benchmarking the performance to predict CTR
— Conclusion
— To go further

Categorical features are a style of variables that describe categories or groups (e.g. gender, color, country), versus numerical features that measure a quantity (e.g. age, height, temperature).

There are two forms of categorical data: ordinal features which categories may be ranked and sorted (e.g. sizes of T-shirt or restaurant rankings from 1 to five star) and nominal features which categories don’t imply any meaningful order (e.g. name of an individual, of a city).

Why do we’d like to encode categorical features?

Encoding a categorical variable means finding a mapping that converts a category to a numerical value.

While some algorithms can work with categorical data directly (like decision trees), most machine learning models cannot handle categorical features and were designed to operate with…

LEAVE A REPLY

Please enter your comment!
Please enter your name here