## Or how, in a hypothetical world affected by zombies, decision trees could make the difference between being out of the woods or not

*Outside the garage, the growls and snarls didn’t stop. He couldn’t imagine that the Zombie Apocalypse he had watched again and again in series and films was finally on his front porch. He could wait hidden within the garage for a while but had to return out eventually. Should he take an axe with him or would the rifle be enough? He could try to search out some food but, should he go alone? He tried to recollect all of the zombie movies he had seen but couldn’t agree on a single strategy. If he only had a way of remembering every scene where a personality is killed by zombies, would that be enough to extend his possibilities of survival? If he just had a choice guide all the pieces could be simpler…*

Have you ever ever watched considered one of those zombie apocalypse movies where there may be one character that all the time seems to know where the zombies are hidden or if it is best to fight with them or run away? Does this person really know what will occur? Did someone tell him/her beforehand? Perhaps there may be nothing magical about this. Perhaps this person has read a number of comics about zombies and it is actually good at knowing what to do in each case and learning from others’ mistakes. How necessary it’s to search out one of the simplest ways of using the events of the past as a guide for our decisions! This guide, also often called a choice tree, is a widely used supervised learning algorithm. This text is an introductory discussion about decision trees, find out how to construct them and why a lot of them create a random forest.

You might be in the midst of the zombie mayhem and you ought to know find out how to increase your possibilities of survival. At this point, you simply have information from 15 of your mates. For every considered one of them you recognize in the event that they were alone, in the event that they had a vehicle or a weapon or in the event that they were trained to fight. Most significantly, you recognize in the event that they were capable of survive or not. How are you going to use this information to your advantage?

Table 1 summarizes the outcomes and characteristics of your 15 friends. You ought to be just like the 3 of them that survived ultimately. What do these 3 friends have in common? A straightforward inspection of the table will tell us that the three survivors had all these items in common: they weren’t alone, they were trained to fight, and so they had a vehicle and a weapon. So, will you give you the chance to survive in the event you had all these 4 things? Past experiences are telling us that you simply might! In case you had to make your mind up what to take with you and whether to be on your individual or not, a minimum of you now have some historical data to support your decision.

Zombie apocalypses are never so simple as they give the impression of being. Let’s say that as a substitute of the 15 friends from the previous example, you might have the next friends:

This time, reaching a conclusion only by visual inspection shouldn’t be that easy. The one thing we all know needless to say is that if you ought to survive, you higher have someone by your side. The 5 those who survived weren’t alone (Figure 1). Besides this, it’s difficult to see if there may be a selected combination of things that can lead you to survival. Some people were capable of survive although they were alone. How did they do? In case you know you can be alone, what else are you able to do to extend your possibilities of surviving? Is there anything like a choice roadmap?

We will find some answers to the previous questions in a choice tree. A call tree is a model of the expected end result we will find in accordance with the selections we make. This model is built using previous experiences. In our example, we will construct a choice tree using the characteristics of the 15 friends and their outcomes. A call tree consists of multiple decision nodes or branches. In each considered one of these nodes, we make a choice that can take us to the next node until we reach an end result.

## Growing a choice tree

If someone asks you to attract a genealogical tree, you’ll probably start together with your grandparents or great-grandparents. From there, the tree will grow through your parents, uncles and cousins, until it reaches you. In an identical way, to grow a choice tree you mostly start from a node that does the very best separation of your data. From that time, the tree will start growing in accordance with the feature that best divides the information. There are a lot of algorithms you need to use to grow a choice tree. This text explains find out how to use the data gain and the Shannon entropy.

Let’s give attention to Table 2. We will see that there are 5 individuals who survived and 10 who died. Because of this the probability of surviving is 5/15 = ⅓ and the probability of dying is ⅔. With this information, we will calculate the entropy of this distribution. On this case, entropy refers to the common level of surprise or uncertainty on this distribution. To calculate the entropy we use the next equation:

Note that this equation will also be expressed by way of considered one of the possibilities since *p(surv)*+*p(die)*=1. If we plot this function you possibly can see how the entropy has the very best value of 1 when each *p(surv)* and *p(die)* are equal to 0.5. Quite the opposite, if the entire distribution corresponds to individuals who all survived or all died, the entropy is zero. So, the very best the entropy, the very best the uncertainty. The bottom the entropy, the more homogeneous the distribution is and the less “surprised” we can be concerning the end result.

In our case, the variety of survivors is lower than half of your entire population. It might be reasonable to think that the majority people don’t survive the zombie apocalypse. The entropy on this case is 0.92 which is what you get within the blue curve of Figure 2 while you seek for x=⅓ or ⅔ or while you apply the next equation:

Now that we all know the entropy or the degree of uncertainty of your entire distribution, what should we do? The subsequent step is to search out find out how to divide the information so we will keep that level of uncertainty.

The premise of the data gain consists of selecting the choice node that can reduce the less the extent of entropy of the previous node. At this point, we are attempting to search out which is the very best first separation of the information. Is it the undeniable fact that we’re alone, that we all know find out how to fight or that we’ve got a vehicle or a weapon? To know the reply we will calculate what’s the data gain of every considered one of these selections after which resolve which considered one of them has the largest gain. Do not forget that we are attempting to reduce the change within the entropy which is the heterogeneity or level of surprise within the end result distribution.

## Are you trained to fight?

Is that this the primary query it is best to ask yourself on this case? Will this query minimize the change in entropy of the end result distribution? To know this, let’s calculate the entropy of every considered one of these two cases: we all know find out how to fight and we have no idea find out how to fight. Figure 3 shows that of the 9 those who knew find out how to fight, only 5 of them survived. Quite the opposite, all the 6 individuals who weren’t trained to fight didn’t survive.

To calculate the entropy of the previous cases we will apply the identical equation we used before. Figure 3 shows how the entropy for the case during which people were trained to fight is 0.99 whereas in the opposite case, the entropy is zero. Do not forget that an entropy of zero means no surprise, homogeneous distribution, that is what is definitely happening on this case since all of the individuals who weren’t trained to fight, didn’t survive. At this point, it is vital to notice that the calculation of the entropy on this second scenario accommodates an undefined calculation since we’ll find yourself with a logarithm of zero. In these cases, you possibly can all the time apply L’Hôpital’s rule because it is explained in this text.

We now have to calculate the data gain from this decision. This is identical as asking how much the uncertainty in all my decisions would change if I made a decision to separate all of the outcomes in accordance with this query. The data gain is calculated by subtracting the entropy of every decision from the entropy of the fundamental node. A vital thing to note is how this operation is weighted in accordance with the number of people on each decision. So a giant entropy can have a small effect on the data gain calculation if the variety of those who took that call is small. For this instance, the data gain from the power to fight is 0.32 because it is shown in Figure 3.

## Are you able to survive the zombies all by yourself?

We will do an identical evaluation of the opportunity of surviving the zombie apocalypse alone or with another person. Figure 4 shows the calculation. On this case, the data gain is 0.52. Note how on this case, the opportunity of being alone never led to survival, whereas in cases where the person was not alone, it survived in 5 of the 7 cases.

## What about having a vehicle or a weapon?

For these two cases, we will calculate the data gain as we did before (Figure 5). You possibly can see how the data gains are smaller than those calculated previously. Because of this, at this point, it is best to divide our data in accordance with the 2 previous features than to those ones. Do not forget that the largest information gain corresponds to the feature during which the entropy reduction is the smallest. Once we’ve got calculated all the data gains for each feature we will resolve what the primary node of the choice tree can be.

## The primary node

Table 3 shows the data gains for every feature. The largest information gain corresponds to the actual fact of being alone or having a companion. This node takes us to the primary decision in our tree: you won’t survive alone. The 8 those who were alone didn’t make it no matter whether or not they had a weapon, a automobile or were trained to fight. So that is the very first thing we will infer from our evaluation which supports what we had concluded by just inspecting the information.

At this point, the choice tree looks like Figure 5. We all know that there are not any possibilities of surviving if we’re alone (making an allowance for the information we’ve got). If we usually are not alone, then we would survive but not in all cases. Since we will calculate the entropy at the best node in Figure 5, which is 0.86 (this calculation is shown in Figure 4), then we can even calculate the data gain from the opposite three features and choose which the following decision node can be.

## The second node

Figure 5 shows that the largest information gain at this point comes from the weapon feature in order that is the following decision node as is shown in Figure 6. Note how all of the those who weren’t alone and had a weapon survived and that’s the reason the left side of the weapon node finishes in a survival decision.

## The tree is complete

There are still 3 those who weren’t alone and didn’t have a weapon that we want to categorize. If we follow the identical process explained previously, we’ll find that the following feature with the largest information gain is the vehicle. So we will add a further node to our tree during which we ask if a selected person had a vehicle. That may divide the remaining 3 people into a bunch of two those who did have a vehicle but didn’t survive and one single person with no vehicle who survived. The ultimate decision tree is presented in Figure 7.

## The issue with decision trees

As you possibly can see, the choice tree is a model built with previous experiences. Depending on the variety of features in your data you’ll encounter multiple questions that can guide you to the ultimate answer. It’s important to note how on this case considered one of the features shouldn’t be represented in the choice tree. The flexibility to fight was never chosen as a choice node for the reason that other features all the time had a much bigger information gain. Because of this, in accordance with the input data, being trained on find out how to fight shouldn’t be necessary to survive a zombie apocalypse. Nevertheless, this might also mean that we didn’t have enough samples to find out if the power to fight was necessary or not. The important thing here is to do not forget that a choice tree is nearly as good because the input data we’re using to construct it. On this case, a sample of 15 people won’t be enough to have estimation of the importance of being trained to fight. That is considered one of the issues of the choice trees.

As with other supervised learning approaches, decision trees usually are not perfect. On the one hand, they rely heavily on the input data. Because of this a small change within the input data can result in necessary changes in the ultimate tree. Decision trees usually are not really good at generalizing. Then again, they have a tendency to have overfitting problems. In other words, we will find yourself with a fancy decision tree that works perfectly with the input data but it is going to dramatically fail after we use a test set. This might also affect the outcomes if we’re using the choice tree with continuous variables as a substitute of categorical ones like the instance presented.

A method of creating decision trees more efficient is to prune them. This implies stopping the algorithm before reaching a pure node just because the ones we reached in our example. This may lead to the removal of a branch that shouldn’t be providing any improvement within the accuracy of the choice tree. Pruning gives the choice tree more generalization power. Nevertheless, if we resolve to prune our decision tree then we would start asking additional questions corresponding to: when is the best moment to stop the algorithm? Should we stop after we reach a minimum variety of samples? Or after a predefined variety of nodes? How one can determine these numbers? Pruning can definitely help us to avoid overfitting however it may result in additional questions that usually are not so easy to reply.

What if as a substitute of a single decision tree, we’ve got multiple decision trees? They are going to change in accordance with the portion of the input data they take, the features they read and their pruning characteristics. We are going to find yourself with many decision trees and many various answers but we will all the time go along with the bulk within the case of a classification task or a mean if we’re working on regressions. This can assist us to generalize the distribution of our data higher. We’d think that one decision tree is misclassifying but when we discover 10 or 20 trees that reach the identical conclusion, then that is telling us that there is perhaps no misclassification in any case. Mainly, we’re letting the bulk resolve as a substitute of guiding ourselves by one single decision tree. This system is known as Random Forest.

The concept of Random Forests is generally related to the concept of Bagging which is a process where a random sample of knowledge in a training set is chosen with alternative. Because of this individual data points will be chosen greater than once. Within the Random Forest methodology, we will select a random variety of points, construct a choice tree, after which do that again until we’ve got multiple trees. Then, the ultimate decision will come from all of the answers that were obtained from the trees.

Random Forests is a widely known ensemble method used for classification and regression problems. This method has been applied in lots of industries corresponding to finance, healthcare and e-commerce [1]. Although the unique idea of Random Forests was slowly developed by many researchers, Leo Breiman is often often called the creator of this system [2]. His personal webpage accommodates an in depth description of Random Forests and an in depth explanation of how and why it really works. It’s a protracted but worthy read.

An extra and necessary thing to grasp about random forests is the best way during which they work with the features of the dataset. At each node, the random forest will randomly select a pre-defined variety of features as a substitute of all of them to make your mind up find out how to split each node. Do not forget that within the previous example, we analyzed the data gain from each feature at each level of the choice tree. Quite the opposite, a random forest will only analyze the data gain from a subset of the features at each node. So, the random forest mixes Bagging with the random variable selection at each node.

Let’s return to the zombies! The previous example was really easy, we had data from 15 people and we only knew 4 things about each considered one of them. Let’s make this harder! Let’s say that now we’ve got a dataset with multiple thousand entries and for every considered one of them we’ve got 10 features. This dataset was randomly generated in Excel and doesn’t belong to any business or private repository, you possibly can access it from this GitHub page.

As is common with a lot of these methodologies, it’s idea to separate your entire dataset right into a training and testing set. We are going to use the training set to construct the choice tree and random forest models after which we’ll evaluate them with the test set. For this purpose, we’ll use the scikit-learn libraries. This Jupyter Notebook accommodates an in depth explanation of the dataset, find out how to load it and find out how to construct the models using the library.

The complete dataset accommodates 1024 entries of which 212 (21%) correspond to survivals and 812 (79%) to deaths. We divided this dataset right into a training set that corresponds to 80% of the information (819 entries) and a testing set which accommodates 205 entries. Figure 8 shows how the relation between survivals and deaths is maintained in all sets.

Regarding the features, this time we’ve got 6 additional characteristics for every individual:

- Do you might have a radio?
- Do you might have food?
- Have you ever taken a course in outdoor survival?
- Have you ever taken a primary aid course?
- Have you ever had a zombie encounter before?
- Do you might have a GPS?

These 6 features combined with the 4 features we already had, represent 10 different characteristics for every individual or entry. With this information, we will construct a choice tree following the previously explained steps. The Jupyter Notebook uses the function DecisionTreeClassifier to generate a Decision Tree. Note that this function shouldn’t be meant to work for categorical variables. On this case, we’ve got converted all of the answers for every category to -1 or +1. Because of this each time we see a -1 in the outcomes it means “No” whereas a +1 means “Yes”. This is best explained within the Jupyter Notebook.

The Notebook explains find out how to load the information, call the choice tree function and plot the outcomes. Figure 9 shows the choice tree that was built with the 819 entries that corresponded to the training set (click here for a much bigger picture). The dark blue boxes correspond to final decision nodes during which the reply was survival whereas the dark orange boxes represent final decision nodes where the ultimate answer was not survival. You possibly can see how the primary decision node corresponds to the vehicle and from there, the tree starts growing in accordance with different features.

We will evaluate how good this tree is that if we use the test set inputs to predict the ultimate categories after which compare these results with the unique results. Table 4 shows a confusion matrix with the variety of times the choice tree misclassified an entry. We will see that the test set had 40 cases that represented survival and the choice tree only classified appropriately 25 of them. Then again, from the 165 cases that didn’t survive, the choice tree misclassified 11. The relation between the right classifications and your entire dataset of 205 points is 0.87 which is generally often called the prediction accuracy rating.

87% of accuracy doesn’t look bad but, can we improve this using a random forest? The subsequent section of the Jupyter Notebook accommodates an implementation of a random forest using the sklearn function RandomForestClassifier. This random forest will contain 10 decision trees which might be built using all of the entries but only considering 3 features at each split. Each of the choice trees within the random forest considers 682 entries which represent 84% of the total training set. So, simply to be clear, the random forest process will:

- Take a random subset of 682 entries from the training set
- Construct a choice tree that considers 3 randomly chosen features at each node
- Repeat the previous steps 9 additional times
- The predictions will correspond to the bulk vote over the ten decision trees

Table 5 shows the confusion matrix for the outcomes coming from the random forest. We will see that these results are higher than what we were getting before with a single decision tree. This random forest misclassifies 11 entries and has a prediction accuracy rating of 0.95 which is higher than the choice tree.

It’s important to consider that the random forest methodology shouldn’t be only nearly as good because the input data we’ve got but in addition nearly as good because the number of parameters that we use. The variety of decision trees we construct and the variety of parameters we analyze at each split may have a very important effect on the end result. So, as is the case of many other supervised learning algorithms, it’s obligatory to spend a while tuning the parameters until we found the very best possible result.

Going through this text is similar to that guy in the films that managed to flee from the zombie that was chasing him because a tree branch fell on the zombie’s head just at the best time! This shouldn’t be the one zombie he’ll encounter and he is certainly not out of the woods yet! There are a lot of things about Random Forests and Decision Trees that weren’t even mentioned in this text. Nevertheless, it’s enough to grasp the usage and applicability of this method. Currently, there are multiple libraries and programs that construct these models in seconds. So you most likely don’t have to undergo the entropy and data gain calculation again. Still, it is vital to grasp what is occurring backstage and find out how to appropriately interpret the outcomes. In a world where topics corresponding to “Machine Learning”, “Ensemble Methods” and “Data Analytics” are daily more common, it is vital to have a transparent idea of what these methods are and find out how to apply them to on a regular basis problems. Unlike the zombie apocalypse survival movies, being ready doesn’t occur by likelihood.

- IBM. What’s random forest?
- Louppe, Gilles (2014). Understanding Random Forests. PhD dissertation. University of Liege