A deep dive on a robust and popular algorithm
From drug discovery to species classification, credit scoring to cybersecurity and more, the random forest is a well-liked and powerful algorithm for modeling our complex world. Its versatility and predictive prowess would appear to require cutting-edge complexity, but when we dig into what a random forest actually is, we see an incredibly easy set of repeating steps.
I find that the very best strategy to learn something is to play with it. So to achieve an intuition on how random forests work, let’s construct one by hand in Python, starting with a call tree and expanding to the complete forest. We’ll see first-hand how flexible and interpretable this algorithm is for each classification and regression. And while this project may sound complicated, there are really only just a few core concepts we’ll must learn: 1) the right way to iteratively partition data, and a pair of) the right way to quantify how well data is partitioned.
Decision tree inference
A choice tree is a supervised learning algorithm that identifies a branching set of binary rules that map features to labels in a dataset. Unlike algorithms like logistic regression where the output is an equation, the choice tree algorithm is nonparametric, meaning it doesn’t make strong assumptions on the connection between features and labels. Which means decision trees are free to grow in whatever way best partitions their training data, so the resulting structure will vary between datasets.
One major good thing about decision trees is their explainability: each step the tree takes in deciding the right way to predict a category (for classification) or continuous value (for regression) might be seen within the tree nodes. A model that predicts whether a client will buy a product they viewed online, for instance, might appear like this.
Starting with the root, each node within the tree asks a binary query (e.g., “Was the session length longer than 5 minutes?”) and passes the feature vector to one in every of two child nodes depending on the…