Introduction
Label encoding is a way utilized in machine learning and data evaluation to convert categorical variables into numerical format. It is especially useful when working with algorithms that require numerical input, as most machine learning models can only operate on numerical data. On this explanation, we’ll explore how label encoding works and the best way to implement it in Python.
Let’s consider a straightforward example with a dataset containing details about various kinds of fruits, where the “Fruit” column has categorical values resembling “Apple,” “Orange,” and “Banana.” Label encoding assigns a singular numerical label to every distinct category, transforming the specific data into numerical representation.
To perform label encoding in Python, we will use the scikit-learn library, which provides a spread of preprocessing utilities, including the LabelEncoder class. Here’s a step-by-step guide:
- Import the vital libraries:
pythonCopy codefrom sklearn.preprocessing import LabelEncoder
- Create an instance of the LabelEncoder class:
pythonCopy codelabel_encoder = LabelEncoder()
- Fit the label encoder to the specific data:
pythonCopy codelabel_encoder.fit(categorical_data)
Here, categorical_data refers back to the column or array containing the specific values you would like to encode.
- Transform the specific data into numerical labels:
pythonCopy codeencoded_data = label_encoder.transform(categorical_data)
The transform method takes the unique categorical data and returns an array with the corresponding numerical labels.
- If needed, you may also reverse the encoding to acquire the unique categorical values using the
inverse_transformmethod:
pythonCopy codeoriginal_data = label_encoder.inverse_transform(encoded_data)
Label encoding may also be applied to multiple columns or features concurrently. You may repeat steps 3-5 for every categorical column you would like to encode.
It can be crucial to notice that label encoding introduces an arbitrary order to the specific values, which can result in incorrect assumptions by the model. To avoid this issue, you possibly can think about using one-hot encoding or other methods resembling ordinal encoding, which give more appropriate representations for categorical data.
Label encoding is a straightforward and effective approach to convert categorical variables into numerical form. Through the use of the LabelEncoder class from scikit-learn, you possibly can easily encode your categorical data and prepare it for further evaluation or input into machine learning algorithms.
Now, allow us to first briefly understand what data types are and its scale. It can be crucial to know this for us to proceed with categorical variable encoding. Data will be classified into three types, namely, structured data, semi-structured, and unstructured data.
Structured data denotes that the info represented is in matrix form with rows and columns. The info will be stored in database SQL in a table, CSV with delimiter separated, or excel with rows and columns.
The info which is just not in matrix form will be classified into semi-Structured data (data in XML, JSON format) or unstructured data (emails, images, log data, videos, and textual data).
Allow us to say, for given data science or machine learning business problem if we’re coping with only structured data and the info collected is a mixture of each Categorical variables and Continuous variables, a lot of the machine learning algorithms won’t understand, or not give you the chance to cope with categorical variables. Meaning, that machine learning algorithms will perform higher when it comes to accuracy and other performance metrics when the data is represented as a number as an alternative of categorical to a model for training and testing.
Deep learning techniques resembling the Artificial Neural network expect data to be numerical. Thus, categorical data have to be encoded to numbers before we will use it to suit and evaluate a model.
Few ML algorithms resembling Tree-based (Decision Tree, Random Forest ) do a greater job in handling categorical variables. The most effective practice in any data science project is to remodel categorical data right into a numeric value.
Now, our objective is obvious. Before constructing any statistical models, machine learning, or deep learning models, we want to remodel or encode categorical data to numeric values. Before we get there, we are going to understand various kinds of categorical data as below.
Nominal Scale
The nominal scale refers to variables which are just named and are used for labeling variables. Note that every one of A nominal scale refers to variables which are names. They’re used for labeling variables. Note that every one of those scales don’t overlap with one another, and none of them has any numerical significance.
Below are the examples which are shown for nominal scale data. Once the info is collected, we should always normally assign a numerical code to represent a nominal variable.
For instance, we will assign a numerical code 1 to represent Bangalore, 2 for Delhi, 3 for Mumbai, and 4 for Chennai for a categorical variable- through which place do you reside. Vital to notice that the numerical value assigned doesn’t have any mathematical value attached to them. Meaning, that basic mathematical operations resembling addition, subtraction, multiplication, or division are pointless. Bangalore + Delhi or Mumbai/Chennai doesn’t make any sense.
Ordinal Scale
An Ordinal scale is a variable through which the worth of the info is captured from an ordered set. For instance, customer feedback survey data uses a Likert scale that’s finite, as shown below.

On this case, let’s say the feedback data is collected using a five-point Likert scale. The numerical code 1, is assigned to Poor, 2 for Fair, 3 for Good, 4 for Very Good, and 5 for Excellent. We will observe that 5 is best than 4, and 5 is a lot better than 3. But if you happen to take a look at excellent minus good, it’s meaningless.
We thoroughly know that the majority machine learning algorithms work exclusively with numeric data. That’s the reason we want to encode categorical features right into a representation compatible with the models. Hence, we are going to cover some popular encoding approaches:
- Label encoding
- One-hot encoding
- Ordinal Encoding
Label Encoding
In label encoding in Python, we replace the specific value with a numeric value between 0 and the variety of classes minus 1. If the specific variable value accommodates 5 distinct classes, we use (0, 1, 2, 3, and 4).
To know label encoding with an example, allow us to take COVID-19 cases in India across states. If we observe the below data frame, the State column accommodates a categorical value that is just not very machine-friendly and the remaining of the columns contain a numerical value. Allow us to perform Label encoding for State Column.
From the below image, after label encoding, the numeric value is assigned to every of the specific values. You is perhaps wondering why the numbering is just not in sequence (Top-Down), and the reply is that the numbering is assigned in alphabetical order. Delhi is assigned 0 followed by Gujarat as 1 and so forth.

Label Encoding using Python
- Before we proceed with label encoding in Python, allow us to import vital data science libraries resembling pandas and NumPy.
- Then, with the assistance of panda, we are going to read the Covid19_India data file which is in CSV format and check if the info file is loaded properly. With the assistance of information(). We will notice that a state datatype is an object. Now we will proceed with LabelEncoding.
Label Encoding will be performed in 2 ways namely:
- LabelEncoder class using scikit-learn library
- Category codes
Approach 1 – scikit-learn library approach
As Label Encoding in Python is a component of information preprocessing, hence we are going to take an help of preprocessing module from sklearn package and import LabelEncoder class as below:
After which:
- Create an instance of LabelEncoder() and store it in labelencoder variable/object
- Apply fit and transform which does the trick to assign numerical value to categorical value and the identical is stored in latest column called “State_N”
- Note that we have now added a brand new column called “State_N” which accommodates numerical value associated to categorical value and still the column called State is present within the dataframe. This column must be removed before we feed the ultimate preprocess data to machine learning model to learn
Approach 2 – Category Codes
- As you had already observed that “State” column datatype is an object type which is by default hence, must convert “State” to a category type with the assistance of pandas
- We will access the codes of the categories by running covid19[“State].cat.codes
One potential issue with label encoding is that the majority of the time, there isn’t any relationship of any kind between categories, while label encoding introduces a relationship.
Within the above six classes’ example for “State” column, the connection looks as follows: 0 < 1 < 2 < 3 < 4 < 5. It signifies that numeric values will be misjudged by algorithms as having some form of order in them. This doesn't make much sense if the categories are, for instance, States.
Also Read: 5 common errors to avoid while working with ML
There is no such thing as a such relation in the unique data with the actual State names, but, through the use of numerical values as we did, a number-related connection between the encoded data is perhaps made. To beat this problem, we will use one-hot encoding as explained below.
One-Hot Encoding
On this approach, for every category of a feature, we create a brand new column (sometimes called a dummy variable) with binary encoding (0 or 1) to indicate whether a specific row belongs to this category.
Allow us to consider the previous State column, and from the below image, we will notice that latest columns are created ranging from state name Maharashtra till Uttar Pradesh, and there are 6 latest columns created. 1 is assigned to a specific row that belongs to this category, and 0 is assigned to the remaining of the row that doesn’t belong to this category.
A possible drawback of this method is a major increase within the dimensionality of the dataset (which is named a Curse of Dimensionality).
Meaning, one-hot encoding is the proven fact that we’re creating additional columns, one for every unique value within the set of the specific attribute we’d prefer to encode. So, if we have now a categorical attribute that accommodates, say, 1000 unique values, that one-hot encoding will generate 1,000 additional latest attributes and this is just not desirable.
To maintain it easy, one-hot encoding is kind of a strong tool, nevertheless it is simply applicable for categorical data which have a low variety of unique values.
Creating dummy variables introduces a type of redundancy to the dataset. If a feature has three categories, we only must have two dummy variables because, if an commentary is neither of the 2, it have to be the third one. That is also known as the dummy-variable trap, and it’s a best practice to at all times remove one dummy variable column (referred to as the reference) from such an encoding.
Data shouldn’t get into dummy variable traps that may result in an issue referred to as multicollinearity. Multicollinearity occurs where there may be a relationship between the independent variables, and it’s a serious threat to multiple linear regression and logistic regression problems.
To sum up, we should always avoid label encoding in Python when it introduces false order to the info, which may, in turn, result in incorrect conclusions. Tree-based methods (decision trees, Random Forest) can work with categorical data and label encoding. Nevertheless, for algorithms resembling linear regression, models calculating distance metrics between features (k-means clustering, k-Nearest Neighbors) or Artificial Neural Networks (ANN) are one-hot encoding.
One-Hot Encoding using Python
Now, let’s see the best way to apply one-hot encoding in Python. Getting back to our example, in Python, this process will be implemented using 2 approaches as follows:
- scikit-learn library
- Using Pandas
Approach 1 – scikit-learn library approach
- As one-hot encoding can also be part of information preprocessing, hence we are going to take an help of preprocessing module from sklearn package and them import OneHotEncoder class as below
- Instantiate the OneHotEncoder object, note that parameter drop = ‘first’ will handle dummy variable traps
- Perform OneHotEncoding for categorical variable
4. Merge One Hot Encoded Dummy Variables to Actual data frame but don’t forget to remove the actual column called “State”
5. From the below output, we will observe, dummy variable trap has been taken care
Approach 2 – Using Pandas: with the assistance of get_dummies function
- As everyone knows, one-hot encoding is such a standard operation in analytics, that pandas provide a function to get the corresponding latest features representing the specific variable.
- We’re considering the identical dataframe called “covid19” and imported pandas library which is sufficient to perform one hot encoding
- As you notice below code, this generates a brand new DataFrame containing five indicator columns, because as explained earlier for modeling we don’t need one indicator variable for every category; for a categorical feature with K categories, we want only K-1 indicator variables. In our example, “State_Delhi” was removed
- Within the case of 6 categories, we want only five indicator variables to preserve the data (and avoid collinearity). That’s the reason the function has one other Boolean argument, drop_first=True, which drops the primary category
- Because the function generates one other DataFrame, we want to concatenate (or add) the columns to our original DataFrame and in addition don’t forget to remove column called “State”
- Here, we use the function, indicating with the axis=1 argument that we would like to concatenate the columns of the two DataFrames given within the list (which is the primary argument of pd.concat). Don’t forget to remove actual “State” column
Ordinal Encoding
An Ordinal Encoder is used to encode categorical features into an ordinal numerical value (ordered set). This approach transforms categorical value into numerical value in ordered sets.
This encoding technique appears almost much like Label Encoding. But, label encoding wouldn’t consider whether a variable is ordinal or not, but within the case of ordinal encoding, it is going to assign a sequence of numerical values as per the order of information.
Let’s create a sample ordinal categorical data related to the client feedback survey, after which we are going to apply the Ordinal Encoder technique. On this case, let’s say the feedback data is collected using a Likert scale through which numerical code 1 is assigned to Poor, 2 for Good, 3 for Very Good, and 4 for Excellent. In case you observe, we all know that 5 is best than 4, 5 is a lot better than 3, but taking the difference between 5 and a pair of is meaningless (Excellent minus Good is meaningless).
Ordinal Encoding using Python
With the assistance of Pandas, we are going to assign customer survey data to a variable called “Customer_Rating” through a dictionary after which we will map each row for the variable as per the dictionary.
That brings us to the tip of the blog on Label Encoding in Python. We hope you enjoyed this blog. Also, try this free Python for Beginners course to learn the Fundamentals of Python. In case you want to explore more such courses and learn latest concepts, join the Great Learning Academy free course today.