Home Learn Random forest Algorithm in Machine learning: An Overview

Random forest Algorithm in Machine learning: An Overview

0
Random forest Algorithm in Machine learning: An Overview

Introduction to Random Forest Algorithm

In the sphere of information analytics, every algorithm has a price. But when we consider the general scenario, then a maximum of the business problem has a classification task. It becomes quite difficult to intuitively know what to adopt considering the character of the info. Random Forests have various applications across domains comparable to finance, healthcare, marketing, and more. They’re widely used for tasks like fraud detection, customer churn prediction, image classification, and stock market forecasting.

But today we will likely be discussing one among the highest classifier techniques, which is probably the most trusted by data experts and that’s Random Forest Classifier. Random Forest also has a regression algorithm technique which will likely be covered here.

If you desire to learn in-depth, do try our random forest course free of charge at Great Learning Academy. Understanding the importance of tree-based classifiers, this course has been curated on tree-based classifiers which can aid you understand decision trees, random forests, and the way to implement them in Python.

The word ‘Forest’ within the term suggests that it’ll contain a variety of trees. The algorithm comprises a bundle of decision trees to make a classification and additionally it is considered a saving technique with regards to overfitting of a call tree model. A call tree model has high variance and low bias which might give us pretty unstable output unlike the commonly adopted logistic regression, which has high bias and low variance. That’s the only point when Random Forest involves the rescue. But before discussing Random Forest intimately, let’s take a fast take a look at the tree concept.

In the true world, a forest is a mix of trees and within the machine learning world, a Random forest is a mix /ensemble of Decision Trees.

So, allow us to understand what a call tree is before we mix it to create a forest.

Imagine you’re going to make a serious expense, say buy a automotive.  assuming you’d need to get the perfect model that matches your budget, you wouldn’t just walk right into a showroom and walk out fairly drive out together with your automotive. Is it that so?

So, Let’s assume you desire to buy a automotive for 4 adults and a pair of children, you favor an SUV with maximum fuel efficiency, you favor just a little luxury like good speakers, sunroof, cosy seating and say you have got shortlisted models A and B.

Model A is beneficial by your friend X since the speakers are good, and the fuel efficiency is the perfect.

Model B is beneficial by your friend Y since it has 6 comfortable seats, speakers are good and the sunroof is sweet, the fuel efficiency is low, but he feels the opposite features persuade her that it’s the perfect.

Model B is beneficial by your friend Z as well since it has 6 comfortable seats, speakers are higher and the sunroof is sweet, the fuel efficiency is sweet in her rating.

It is rather likely that you simply would go along with Model B as you have got majority voting to this model from your mates. Your mates have voted considering the features of their alternative and a call model based on their very own logic.

Imagine your mates X, Y, Z as decision trees, you created a random forest with few decision trees and based on the outcomes, you selected the one which was beneficial by the bulk.

That is how a classifier Random forest works.

What’s Random Forest?

Definition from Wikipedia

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a mess of decision trees at training time. For classification tasks, the output of the random forest is the category chosen by most trees. For regression tasks, the mean or average prediction of the person trees is returned.

Random Forest Features

Some interesting facts about Random Forests – Features

  • Accuracy of Random forest is usually very high
  • Its efficiency is especially Notable in Large Data sets
  • Provides an estimate of vital variables in classification
  • Forests Generated could be saved and reused
  • Unlike other models It does nt overfit with more features

How random forest works?

Let’s Get it working

A random forest is a set of Decision Trees, Each Tree independently makes a prediction, the values are then averaged (Regression) / Max voted (Classification) to reach at the ultimate value.

The strength of this model lies in creating different trees with different sub-features from the features. The Features chosen for every tree is Random, so the trees don’t get deep and are focused only on the set of features.

Finally, once they are put together, we create an ensemble of Decision Trees that gives a well-learned prediction.

An Illustration on constructing a Random Forest

Allow us to now construct a Random Forest Model for say buying a automotive

One in every of the choice trees may very well be checking for features comparable to Variety of Seats and Sunroof availability and deciding yes or no

Here the choice tree considers the variety of seat parameters to be greater than 6 as the client prefers an SUV and prefers a automotive with a sunroof. The tree would offer the best value for the model that satisfies each the standards and would rate it lesser if either of the parameters isn’t met and rate it lowest if each the parameters are No. Allow us to see an illustration of the identical below:

One other decision tree may very well be checking for features comparable to Quality of Stereo, Comfort of Seats and Sunroof availability and choose yes or no. This might also rate the model based on the consequence of those parameters and choose yes or no depending upon the standards met. The identical has been illustrated below.

One other decision tree may very well be checking for features comparable to Variety of Seats, Comfort of Seats, Fuel Efficiency and Sunroof availability and choose yes or no. The choice Tree for a similar is given below.

Each of the choice Tree may provide you with a Yes or No based on the info set. Each of the trees are independent and our decision using a call tree would purely rely on the features that specific tree looks upon. If a call tree considers all of the features, the depth of the tree would keep increasing causing an over fit model.

A more efficient way could be to mix these decision Trees and create an ultimate Decision maker based on the output from each tree. That will be a random forest

Once we receive the output from every decision tree, we use the bulk vote taken to reach at the choice. To make use of this as a regression model, we might take a median of the values.

Allow us to see how a random forest would search for the above scenario.

The information for every tree is chosen using a technique called bagging which selects a random set of information points from the info set for every tree. The information chosen could be used again (with alternative) or kept aside (without alternative). Each tree would randomly pick the features based on the subset of Data provided. This randomness provides the potential for finding the feature importance, the feature that influences in the vast majority of the choice trees could be the feature of maximum importance.

Now once the trees are built with a subset of information and their very own set of features, each tree would independently execute to offer its decision. This decision will likely be a yes or No within the case of classification.

There’ll then be an ensemble of the trees created using methods comparable to stacking that will help reduce classification errors. The ultimate output is set by the max vote method for classification.

Allow us to see an illustration of the identical below.

Each of the choice tree would independently determine based by itself subset of information and features, so the outcomes wouldn’t be similar. Assuming the Decision Tree1 suggests ‘Buy’, Decision Tree 2 Suggests ‘Don’t Buy’ and Decision Tree 3 suggests ‘Buy’, then the max vote could be for Buy and the result from Random Forest could be to ‘Buy’

Each tree would have 3 major nodes

  • Root Node
  • Leaf Node
  • Decision Node

The node where the ultimate decision is made is known as ‘Leaf Node ‘, The function to make a decision is made within the ‘Decision Node’, the ‘Root Node’ is where the info is stored.

Please note that the features chosen will likely be random and will repeat across trees, this increases the efficiency and compensates for missing data. While splitting a node, only a subset of features is considered and the perfect feature amongst this subset is used for splitting, this diversity ends in a greater efficiency.

After we create a Random forest Machine Learning model, the choice trees are created based on random subset of features and the trees are split further and further. The entropy or the data gained is a vital parameter used to make a decision the tree split. When the branches are created, total entropy of the subbranches must be lower than the entropy of the Parent Node. If the entropy drops, information gained also drops, which is a criterion used to stop further split of the tree. You’ll be able to learn more with the assistance of a random forest machine learning course.

How does it differ from the Decision Tree?

A call tree offers a single path and considers all of the features without delay. So, this will create deeper trees making the model over fit. A Random forest creates multiple trees with random features, the trees are usually not very deep.

Providing an option of Ensemble of the choice trees also maximizes the efficiency because it averages the result, providing generalized results.

While a call tree structure largely will depend on the training data and will change drastically even for a slight change within the training data, the random number of features provides little deviation when it comes to structure change with change in data. With the addition of Technique comparable to Bagging for number of data, this could be further minimized.

Having said that, the storage and computational capacities required are more for Random Forests than a call tree.

In summary, Random Forest provides a lot better accuracy and efficiency than a call tree, this comes at a price of storage and computational power.

Let’s Regularize through Hyperparameters

Hyper parameters help us to have a certain degree of control over the model to make sure higher efficiency, a few of the commonly tuned hyperparameters are below.

N_estimators = This parameter helps us to find out the variety of Trees within the Forest, higher the number, we create a more robust aggregate model, but that will cost more computational power.

max_depth = This parameter restricts the variety of levels of every tree. Creating more levels increases the potential for considering more features in each tree. A deep tree would create an overfit model, but in Random forest this is able to be overcome as we might ensemble at the top.

max_features -This parameter helps us restrict the utmost variety of features to be considered at every tree. That is one among the vital parameters in deciding the efficiency. Generally, a Grid search with CV could be performed with various values for this parameter to reach at the perfect value.

bootstrap = This might help us determine the strategy used for sampling data points, should or not it’s with or without alternative.

max_samples – This decides the proportion of information that must be used from the training data for training. This parameter is usually not touched, because the samples that are usually not used for training (out of bag data) could be used for evaluating the forest and it’s preferred to make use of your entire training data set for training the forest.

Real World Random Forests

Being a Machine Learning model that could be used for each classification and Prediction, combined with good efficiency, that is a well-liked model in various arenas.

Random Forest could be applied to any data set with multi-dimensions, so it’s a well-liked alternative with regards to identifying customer loyalty in Retail, predicting stock prices in Finance, recommending products to customers even identifying the suitable composition of chemicals within the Manufacturing industry.

With its ability to do each prediction and classification, it produces higher efficiency than a lot of the classical models in a lot of the arenas.

Real-Time Use cases

Random Forest has been the go-to Model for Price Prediction, Fraud Detection in Financial statements, Various Research papers published in these areas recommend Random Forest as the perfect accuracy producing model. (Ref1, 2)

Random Forest Model has proved to offer good accuracy in predicting disease based on the features (Ref-3)

The Random Forest model has been used to detect Parkinson-related lesions inside the midbrain in 3D transcranial ultrasound. This was developed by training the model to grasp the organ arrangement, size, shape from prior knowledge and the leaf nodes predict the organ class and spatial location. With this, it provides improved class predictability (Ref 4)

Furthermore, a random forest technique has the aptitude to focus each on observations and variables of coaching data for developing individual decision trees and take maximum voting for classification and the whole average for regression problems respectively.  It also uses a bagging technique that takes observations in a random manner and selects all columns that are incapable of representing significant variables at the foundation for all decision trees. In this fashion, a random forest makes trees only that are depending on one another by penalising accuracy. We’ve a thumb rule which could be implemented for choosing sub-samples from observations using random forest. If we consider 2/3 of observations for training data and p be the variety of columns then 

  1. For classification, we take sqrt(p) variety of columns
  2. For regression, we take p/3 variety of columns.

The above thumb rule could be tuned in case you want increasing the accuracy of the model.

Allow us to interpret each bagging and random forest technique where we draw two samples, one in blue and one other in pink.

From the above diagram, we are able to see that the Bagging technique has chosen a number of observations but all columns. Then again, Random Forest chosen a number of observations and a number of columns to create uncorrelated individual trees.

A sample idea of a random forest classifier is given below

The above diagram gives us an idea of how each tree has grown and the variation of the depth of trees as per sample chosen but ultimately process, voting is performed for final classification. Also, averaging is performed after we cope with the regression problem.

Classifier Vs. Regressor

A random forest classifier works with data having discrete labels or higher generally known as class. 

Example- A patient is affected by cancer or not, an individual is eligible for a loan or not, etc.

A random forest regressor works with data having a numeric or continuous output and they can not be defined by classes.

Example- the worth of homes, milk production of cows, the gross income of firms, etc.

Benefits and Disadvantages of Random Forest

  1. It reduces overfitting in decision trees and helps to enhance the accuracy
  2. It’s flexible to each classification and regression problems
  3. It really works well with each categorical and continuous values
  4. It automates missing values present in the info
  5. Normalising of information isn’t required because it uses a rule-based approach.

Nonetheless, despite these benefits, a random forest algorithm also has some drawbacks.

  1. It requires much computational power in addition to resources because it builds quite a few trees to mix their outputs. 
  2. It also requires much time for training because it combines a variety of decision trees to find out the category.
  3. Attributable to the ensemble of decision trees, it also suffers interpretability and fails to find out the importance of every variable.

Applications of Random Forest

Banking Sector

Banking evaluation requires a variety of effort because it comprises a high risk of profit and loss. Customer evaluation is one of the crucial used studies adopted in banking sectors. Problems comparable to loan default probability of a customer or for detecting any fraud transaction, random forest could be a fantastic alternative. 

The above representation is a tree which decides whether a customer is eligible for loan credit based on conditions comparable to account balance, duration of credit, payment status, etc.

Healthcare Sectors

In pharmaceutical industries, random forest could be used to discover the potential of a certain medicine or the composition of chemicals required for medicines. It may even be utilized in hospitals to discover the diseases suffered by a patient, risk of cancer in a patient, and plenty of other diseases where early evaluation and research play a vital role.

Applying Random Forest with Python and R

We’ll perform case studies in Python and R for each Random forest regression and Classification techniques.

Random Forest Regression in Python

For regression, we will likely be coping with data which comprises salaries of employees based on their position. We’ll use this to predict the salary of an worker based on his position.

Allow us to deal with the libraries and the info:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv(‘Salaries.csv')
df.head()
X =df.iloc[:, 1:2].values
y =df.iloc[:, 2].values

Because the dataset may be very small we won’t perform any splitting. We’ll proceed on to fitting the info.

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators = 10, random_state = 0)
model.fit(X, y)

Did you notice that we’ve made just 10 trees by putting n_estimators=10? It’s as much as you to mess around with the variety of trees. Because it is a small dataset, 10 trees are enough.

Now we’ll predict the salary of a one that has a level of 6.5

y_pred =model.predict([[6.5]])

After prediction, we are able to see that the worker must get a salary of 167000 after reaching a level of 6.5. Allow us to visualise to interpret it in a greater way.

X_grid_data = np.arange(min(X), max(X), 0.01)
X_grid_data = X_grid.reshape((len(X_grid_data), 1))
plt.scatter(X, y, color="red")
plt.plot(X_grid_data,model.predict(X_grid_data), color="blue")
plt.title('Random Forest Regression’)
plt.xlabel('Position')
plt.ylabel('Salary')
plt.show()

Random Forest Regression in R

Now we will likely be doing the identical model in R and see the way it creates an impact in prediction

We’ll first import the dataset:

df = read.csv('Position_Salaries.csv')
df = df[2:3]

In R too, we won’t perform splitting as the info is just too small. We’ll use your entire data for training and make a person prediction as we did in Python

We’ll use the ‘randomForest’ library. In case you probably did not install the package, the below code will aid you out.

install.packages('randomForest')
library(randomForest)
set.seed(1234)

The seed function will aid you get the identical result that we got during training and testing.

model= randomForest(x = df[-2],
                         y = df$Salary,
                         ntree = 500)

Now we’ll predict the salary of a level 6.5 worker and see how much it differs from the one predicted using Python.

y_prediction = predict(model, data.frame(Level = 6.5))

As we see, the prediction gives a salary of 160908 but in Python, we got a prediction of 167000. It completely will depend on the info analyst to make a decision which algorithm works higher. We’re done with the prediction. Now it’s time to visualise the info

install.packages('ggplot2')
library(ggplot2)
x_grid_data = seq(min(df$Level), max(df$Level), 0.01)
ggplot()+geom_point(aes(x = df$Level, y = df$Salary),color="red") +geom_line(aes(x = x_grid_data, y = predict(model, newdata = data.frame(Level = x_grid_data))),color="blue") +ggtitle('Truth or Bluff (Random Forest Regression)') +  xlab('Level') + ylab('Salary')

So that is for regression using R. Now allow us to quickly move to the classification part to see how Random Forest works.

Random Forest Classifier in Python

For classification, we’ll use Social Networking Ads data which comprises information in regards to the product purchased based on age and salary of an individual. Allow us to import the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now allow us to see the dataset:

df = pd.read_csv('Social_Network_Ads.csv')
df

On your information, the dataset comprises 400 rows and 5 columns. 

X = df.iloc[:, [2, 3]].values
y = df.iloc[:, 4].values

Now we’ll split the info for training and testing. We’ll take 75% for training and rest for testing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

Now we’ll standardise the info using StandardScaler from sklearn library.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

After scaling, allow us to see the top of the info now.

random forest

Now it’s time to suit our model.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
model.fit(X_train, y_train)

We’ve made 10 trees and used criterion as ‘entropy ’ because it is used to diminish the impurity in the info. You’ll be able to increase the variety of trees when you wish but we’re keeping it limited to 10 for now.
Now the fitting is over. We’ll predict the test data.

y_prediction = model.predict(X_test)

After prediction, we are able to evaluate by confusion matrix and see how good our model performs.

from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_prediction)
random forest

Great. As we see, our model is doing well as the speed of misclassification may be very less which is interesting. Now allow us to visualise our training result.

from matplotlib.colours import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1,X2,model.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest Classification (Training set)')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.legend()
plt.show()
random forest

Now allow us to visualise test end in the identical way.

from matplotlib.colours import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1,X2,model.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),alpha=0.75,cmap= ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest Classification (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

In order that’s for now. We’ll move to perform the identical model in R.

Random Forest Classifier in R

Allow us to import the dataset and check the top of the info

df = read.csv('SocialNetwork_Ads.csv')
df = df[3:5]

Now in R, we want to alter the category to factor. So we want further encoding.

df$Purchased = factor(df$Purchased, levels = c(0, 1))

Now we’ll split the info and see the result. The splitting ratio will likely be the identical as we did in Python.

install.packages('caTools')
library(caTools)
set.seed(123)
split_data = sample.split(df$Purchased, SplitRatio = 0.75)
training_set = subset(df, split_data == TRUE)
test_set = subset(df, split_data == FALSE)

Also, we’ll perform the standardisation of the info and see the way it performs while testing.

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

Now we fit the model using the built-in library ‘randomForest’ provided by R.

install.packages('randomForest')
library(randomForest)
set.seed(123)
model= randomForest(x = training_set[-3],
                          y = training_set$Purchased,
                          ntree = 10)

We set the variety of trees to 10 to see the way it performs. We are able to set any variety of trees to enhance accuracy.

 y_prediction = predict(model, newdata = test_set[-3])

Now the prediction is over and we’ll evaluate using a confusion matrix.

conf_mat = table(test_set[, 3], y_prediction)
conf_mat
random forest

As we see the model underperforms in comparison with Python as the speed of misclassification is high.

Now allow us to interpret our result using visualisation. We will likely be using ElemStatLearn method for smooth visualisation.

library(ElemStatLearn)
train_set = training_set
X1 = seq(min(train_set [, 1]) - 1, max(train_set [, 1]) + 1, by = 0.01)
X2 = seq(min(train_set [, 2]) - 1, max(train_set [, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(model, grid_set)
plot(set[, -3],
     principal = 'Random Forest Classification (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch=".", col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(train_set, pch = 21, bg = ifelse(train_set [, 3] == 1, 'green4', 'red3'))

The model works wonderful because it is clear from the visualisation of coaching data. Now allow us to see the way it performs with the test data.

library(ElemStatLearn)
testset = test_set
X1 = seq(min(testset [, 1]) - 1, max(testset [, 1]) + 1, by = 0.01)
X2 = seq(min(testset [, 2]) - 1, max testset [, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(model, grid_set)
plot(set[, -3], principal = 'Random Forest Classification (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch=".", col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(testset, pch = 21, bg = ifelse(testset [, 3] == 1, 'green4', 'red3'))

That’s it for now. The test data just worked wonderful as expected.

Inference

Random Forest works well after we are attempting to avoid overfitting from constructing a call tree. Also, it really works wonderful when the info mostly contain categorical variables. Other algorithms like logistic regression can outperform with regards to numeric variables but with regards to making a call based on conditions, the random forest is the perfect alternative. It completely will depend on the analyst to mess around with the parameters to enhance accuracy. There is usually less probability of overfitting because it uses a rule-based approach. But yet again, it will depend on the info and the analyst to decide on the perfect algorithm. Random Forest is a very talked-about Machine Learning Model because it provides good efficiency, the choice making used may be very much like human considering. The power to grasp the feature importance helps us explain to the model though it’s more of a black-box model. The efficiency provided and almost inconceivable to overfit are the good benefits of this model. This may literally be utilized in any industry and the research papers published are evidence of the efficacy of this straightforward yet great model.

LEAVE A REPLY

Please enter your comment!
Please enter your name here