
The generalization of gradient boosting to other sorts of problems (e.g., classification problems) and other loss functions follows from the commentary that the residuals hₘ(xᵢ) are proportional to the negative gradients of the squared loss function with respect to Fₘ₋₁(xᵢ):
Due to this fact, we will generalize this method to any differentiable loss function by utilizing the negative gradients of the loss function as an alternative of the residuals.
We’ll now derive the overall gradient boosting algorithm for any differentiable loss function.
Boosting approximates the true mapping from the features to the labels y = f(x) using an additive expansion (ensemble) of the shape:
where hₘ(x) are base learners from some class H (normally decision trees of a hard and fast size), and M represents the variety of learners.
Given a loss function L(y, F(x)), our goal is to search out an approximation F(x) that minimizes the common loss on the training set:
Gradient boosting uses an iterative approach to search out this approximation. It starts from a model F₀ of a relentless function that minimizes the loss:
For instance, if the loss function is squared loss (utilized in regression problems), F₀(x) can be the mean of the goal values.
Then, it incrementally expands the model in a greedy fashion:
where the newly added base learner hₘ is fitted to reduce the sum of losses of the ensemble Fₘ:
Finding the perfect function hₘ at each iteration for an arbitrary loss function L is computationally infeasible. Due to this fact, we use an iterative optimization approach: in every iteration we elect a base learner hₘ that points within the negative gradient direction of the loss function. In consequence, adding hₘ to the ensemble will get us closer to the minimum loss.
This process is comparable to gradient descent, nevertheless it operates within the function space relatively than the parameter space, since in every iteration we move to a distinct function within the hypothesis space H, relatively than making a step within the parameter space of a selected function h. This enables h to be a non-parametric machine learning model, similar to a call tree. This process is named functional gradient descent.
In functional gradient descent, our parameters are the values of F(x) on the points x₁, …, xₙ, and we seek to reduce L(yᵢ, F(xᵢ)) at each individual xᵢ. The very best steepest-descent direction of the loss function at every point xᵢ is its negative gradient:
gₘ(xᵢ) is the derivative of the loss with respect to its second parameter, evaluated at Fₘ₋₁(xᵢ).
Due to this fact, the vector
gives the perfect steepest-descent direction within the N-dimensional data space at Fₘ₋₁(xᵢ). Nevertheless, this gradient is defined only at the info points x₁, …, xₙ, and can’t be generalized to other x-values.
In the continual case, where H is the set of arbitrary differentiable functions on R, we could have simply chosen a function hₘ ∈ H where hₘ(xᵢ) = –gₘ(xᵢ).
Within the discrete case (i.e., when the set H is finite), we elect hₘ as a function in H that’s closest to gₘ(xᵢ) at the info points xᵢ, i.e., hₘ that’s most parallel to the vector –gₘ in Rⁿ. This function could be obtained by fitting a base learner hₘ to a training set {(xᵢ, ỹᵢₘ)}, with the labels
These labels are called pseudo-residuals. In other words, in every boosting iteration, we’re fitting a base learner to predict the negative gradients of the loss function with respect to the ensemble’s predictions from the previous iteration.
Note that this approach is heuristic, and doesn’t necessarily yield an actual solution to the optimization problem.
The entire pseudocode of the algorithm is shown below:
Gradient tree boosting is a specialization of the gradient boosting algorithm to the case where the bottom learner h(x) is a fixed-size regression tree.
In each iteration, a regression tree hₘ(x) is fit to the pseudo-residuals. Let Kₘ be the variety of its leaves. The tree partitions the input space into Kₘ disjoint regions: R₁ₘ, …, Rₖₘ, and predicts a relentless value in each region j, which is the mean of the pseudo-residuals in that region:
Due to this fact, the function hₘ(x) could be written as the next sum:
These regression trees are in-built a top-down greedy fashion using mean squared error because the splitting criterion (see this text for more details on regression trees).
The identical algorithm of gradient boosting can be used for classification tasks. Nevertheless, because the sum of the trees Fₘ(x) could be any continuous value, it must be mapped to a category or a probability. This mapping is dependent upon the form of the classification problem:
- In binary classification problems, we use the sigmoid function to model the probability that xᵢ belongs to the positive class (just like logistic regression):
The initial model on this case is given by the prior probability of the positive class, and the loss function is the binary log loss.
2. In multiclass classification problems, K trees (for K classes) are built at each of the M iterations. The probability that xᵢ belongs to class k is modeled because the softmax of the Fₘ,ₖ(xᵢ) values:
The initial model on this case is given by the prior probability of every class, and the loss function is the cross-entropy loss.
As with other ensemble methods based on decision trees, we want to regulate the complexity of the model with a purpose to avoid overfitting. Several regularization techniques are commonly used with gradient-boosted trees.
First, we will use the identical regularization techniques that we now have in standard decision trees, similar to limiting the depth of the tree, the variety of leaves, or the minimum variety of samples required to separate a node. We also can use post-pruning techniques to remove branches from the tree that fail to cut back the loss by a predefined threshold.
Second, we will control the variety of boosting iterations (i.e., the variety of trees within the ensemble). Increasing the variety of trees reduces the ensemble’s error on the training set, but may result in overfitting. The optimal variety of trees is often found by early stopping, i.e., the algorithm is terminated once the rating on the validation set doesn’t improve for a specified variety of iterations.
Lastly, Friedman [1, 2] has suggested the next regularization techniques, that are more specific to gradient-boosted trees:
Shrinkage
Shrinkage [1] scales the contribution of every base learner by a relentless factor ν:
The parameter ν (0 < ν ≤ 1) is named the learning rate, because it controls the step size of the gradient descent procedure.
Empirically, it has been found that using small learning rates (e.g., ν ≤ 0.1) can significantly improve the model’s generalization ability. Nevertheless, smaller learning rates also require more boosting iterations with a purpose to maintain the identical training error, thereby increasing the computational time during each training and prediction.
Stochastic Gradient Boosting (Subsampling)
In a follow-up paper [2], Friedman proposed stochastic gradient boosting, which mixes gradient boosting with bagging.
In each iteration, a base learner is trained only on a fraction (typically 0.5) of the training set, drawn at random without substitute. This subsampling procedure introduces randomness into the algorithm and helps prevent the model from overfitting.
Like in bagging, subsampling also allows us to make use of the out-of-bag samples (samples that weren’t involved in constructing the subsequent base learner) with a purpose to evaluate the performance of the model, as an alternative of getting an independent validation data set. Out-of-bag estimates often underestimate the actual performance of the model, thus they’re used provided that cross-validation takes an excessive amount of time.
One other strategy to cut back the variance of the model is to randomly sample the features considered for split in each node of the tree (just like random forests).
Yow will discover the code examples of this text on my github: https://github.com/roiyeho/medium/tree/most important/gradient_boosting
Thanks for reading!
References
[1] Friedman, J.H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.
[2] Friedman, J.H. (2002). Stochastic gradient boosting. Computational Statistics & Data Evaluation, 38, 367–378.