Random Forests got here a great distance
By way of Machine Learning timelines, Random Forests (RFs), introduced within the seminal paper of Breimann ([1]), are ancient. Despite their age, they keep impressing with their performance and are a subject of energetic research. The goal of this text is to focus on what a flexible toolbox Random Forest methods have grow to be, focussing on Generalized Random Forest (GRF) and Distributional Random Forest (DRF).
In brief, the foremost idea underlying each methods is that the weights implicitly produced by RF may be used to estimate targets apart from the conditional expectation. The thought of GRF is to make use of a Random Forest with a splitting criterion that is tailored to the goal one has in mind (e.g., conditional mean, conditional quantiles, or the conditional treatment effect). The thought of DRF is to adapt the splitting criterion such that the entire conditional distribution may be estimated. From this object, many various targets can then be derived in a second step. In truth, I mostly discuss DRF in this text, as I’m more conversant in this method and it’s somewhat more streamlined (just one forest needs to be fitted for a big selection of targets). Nonetheless, all of the benefits, indicated within the figure above, also apply to GRF and actually, the DRF package in R is built upon the skilled implementation of GRF. Furthermore, the incontrovertible fact that the splitting criterion of GRF forests is tailored to the goal means it could possibly have higher performance than DRF. This is especially true for binary Y, where probability_forests() ought to be used. So, regardless that I talk mostly about DRF, GRF ought to be kept in mind throughout this text.
The goal of this text is to offer an summary with links to deeper reading within the corresponding sections. We are going to undergo each of the points within the above figure clock-wise, reference the corresponding articles, and highlight it with slightly example. I first quickly summarize crucial links to further reading below:
Versatility/Performance: Medium Article and Original Papers (DRF/GRF)
Missing Values Incorporated: Medium Article
Uncertainty Measures: Medium Article
Variable Importance: Medium Article
The Example
We take X_1, X_2, X_4, …, X_10 independently uniform between (-1,1) and create dependence between X_1 and X_3 by taking X_3=X_1 + uniform error. Then we simulate Y as
## Load packages and functions needed
library(drf)
library(mice)
source("drfnew_v2.R")
## The function drfnew_v2.R is on the market below, or on
## https://github.com/JeffNaef/drfupdate## Set parameters
set.seed(10)
n<-1000
##Simulate Data that experiences each a mean in addition to sd shift
# Simulate from X
x1 <- runif(n,-1,1)
x2 <- runif(n,-1,1)
x3 <- x1+ runif(n,-1,1)
X0 <- matrix(runif(7*n,-1,1), nrow=n, ncol=7)
Xfull <- cbind(x1,x2, x3, X0)
colnames(Xfull)<-paste0("X", 1:10)
# Simulate dependent variable Y
Y <- as.matrix(rnorm(n,mean = 0.8*(x1 > 0), sd = 1 + 1*(x2 > 0)))
##Also add MAR missing values using ampute from the mice package
X<-ampute(Xfull)$amp
head(cbind(Y,X))
Y X1 X2 X3 X4 X5
1 -3.0327466 -0.4689827 0.06161759 0.27462737 NA -0.624463079
2 1.2582824 -0.2557522 0.36972181 NA -0.04100963 0.009518047
3 -0.8781940 0.1457067 -0.23343321 NA -0.64519687 -0.945426305
4 3.1595623 0.8164156 0.90997600 0.69184618 -0.20573331 -0.007404298
5 1.1176545 -0.5966361 NA -1.21276055 0.62845399 0.894703422
6 -0.4377359 0.7967794 -0.92179989 -0.03863182 0.88271415 -0.237635732
X6 X7 X8 X9 X10
1 -0.9290009 0.5401628 0.39735433 -0.7434697 0.8807558
2 -0.2885927 0.3805251 -0.09051334 -0.7446170 0.9935311
3 -0.5022541 0.3009541 0.29424395 0.5554647 -0.5341800
4 0.7583608 -0.8506881 0.22758566 -0.1596993 -0.7161976
5 -0.3640422 0.8051613 -0.46714833 0.4318039 -0.8674060
6 -0.3577590 -0.7341207 0.85504668 -0.6933918 0.4656891
Notice that with the function ampute from the mice package, we put Missing not at Random (MAR) missing values on X to focus on the flexibility of GRF/DRF to take care of missing values. Furthermore, within the above process only X_1 and X_2 are relevant for predicting Y, all other variables are “noise” variables. Such a “sparse” setting might actually be common in real-life datasets.
We now select a test point for this instance that we are going to use throughout:
x<-matrix(c(0.2, 0.4, runif(8,-1,1)), nrow=1, ncol=10)
print(x)[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.2 0.4 0.7061058 0.8364877 0.2284314 0.7971179 0.78581 0.5310279
[,9] [,10]
[1,] -0.5067102 0.6918785
Versatility
DRF estimates the conditional distribution P_X=x in the shape of straightforward weights:
From these weights, a big selection of targets may be calculated, or they may be used to simulate from the conditional distribution. An excellent reference for its versatility is the unique research article, where plenty of examples were used, in addition to the corresponding medium article.
In the instance, we first simulate from this distribution:
DRF<-drfCI(X=X, Y=Y, B=50,num.trees=1000, min.node.size = 5)
DRFpred<-predictdrf(DRF, newdata=x)## Sample from P_ X=x
Yxs<-Y[sample(1:n, size=n, replace = T, DRFpred$weights[1,])]
hist(Yxs, prob=T)
z<-seq(-6,7,by=0.01)
d<-dnorm(z, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
lines(z,d, col="darkred" )
The plot shows the approximate draws from the conditional distribution overlaid with the true density in red. We now use this to estimate the conditional expectation and the conditional (0.05, 0.95) quantiles at x:
# Calculate quantile prediction as weighted quantiles from Y
qx <- quantile(Yxs, probs = c(0.05,0.95))# Calculate conditional mean prediction
mux <- mean(Yxs)
# True quantiles
q1<-qnorm(0.05, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
q2<-qnorm(0.95, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
mu<-0.8 * (x[1] > 0)
hist(Yxs, prob=T)
z<-seq(-6,7,by=0.01)
d<-dnorm(z, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
lines(z,d, col="darkred" )
abline(v=q1,col="darkred" )
abline(v=q2, col="darkred" )
abline(v=qx[1], col="darkblue")
abline(v=qx[2], col="darkblue")
abline(v=mu, col="darkred")
abline(v=mux, col="darkblue")
Likewise, many targets may be calculated with GRF, only on this case for every of those two targets a distinct forest would must be fit. Particularly, regression_forest() for the conditional expectation and quantile_forest() for the quantiles.
What GRF cannot do is take care of multivariate targets, which is feasible with DRF as well.
Performance
Despite all of the work on powerful latest (nonparametric) methods, reminiscent of neural nets, tree-based methods are consistently capable of beat competitors on tabular data. See e.g., this fascinating paper, or this older paper on the strength of RF in classification.
To be fair, with parameter tuning, boosted tree methods, reminiscent of XGboost, often take the lead, at the very least relating to classical prediction (which corresponds to conditional expectation estimation). Nonetheless, the robust performance RF methods are inclined to have with none tuning is remarkable. Furthermore, there has also been work on improving the performance of Random Forests, for instance, the hedged Random Forest approach.
Missing Values Incorporated
“Missing incorporated in attributes criterion” (MIA) from this paper is a quite simple but very powerful idea that enables tree-based methods to handle missing data. This was implemented within the GRF R package and so additionally it is available in DRF. The small print are also explained on this medium article. So simple as the concept is, it really works remarkably well in practice: Within the above example, DRF had no trouble handling substantial MAR missingness within the training data X (!)
Uncertainty Measures
As a statistician I don’t just want point estimates (even of a distribution), but in addition a measure of estimation uncertainty of my parameters (even when the “parameter” is my whole distribution). Seems that a straightforward additional subsampling baked into DRF/GRF allows for a principled uncertainty quantification for big sample sizes. The idea behind this within the case of DRF is derived on this research article, but I also explain it on this medium article. GRF has all the speculation in the unique paper.
We adapt this for the above example:
# Calculate uncertainty
alpha<-0.05
B<-length(DRFpred$weightsb)
qxb<-matrix(NaN, nrow=B, ncol=2)
muxb<-matrix(NaN, nrow=B, ncol=1)
for (b in 1:B){
Yxsb<-Y[sample(1:n, size=n, replace = T, DRFpred$weightsb[[b]][1,])]
qxb[b,] <- quantile(Yxsb, probs = c(0.05,0.95))
muxb[b] <- mean(Yxsb)
}
CI.lower.q1 <- qx[1] - qnorm(1-alpha/2)*sqrt(var(qxb[,1]))
CI.upper.q1 <- qx[1] + qnorm(1-alpha/2)*sqrt(var(qxb[,1]))CI.lower.q2 <- qx[2] - qnorm(1-alpha/2)*sqrt(var(qxb[,2]))
CI.upper.q2 <- qx[2] + qnorm(1-alpha/2)*sqrt(var(qxb[,2]))
CI.lower.mu <- mux - qnorm(1-alpha/2)*sqrt(var(muxb))
CI.upper.mu <- mux + qnorm(1-alpha/2)*sqrt(var(muxb))
hist(Yxs, prob=T)
z<-seq(-6,7,by=0.01)
d<-dnorm(z, mean=0.8 * (x[1] > 0), sd=(1+(x[2] > 0)))
lines(z,d, col="darkred" )
abline(v=q1,col="darkred" )
abline(v=q2, col="darkred" )
abline(v=qx[1], col="darkblue")
abline(v=qx[2], col="darkblue")
abline(v=mu, col="darkred")
abline(v=mux, col="darkblue")
abline(v=CI.lower.q1, col="darkblue", lty=2)
abline(v=CI.upper.q1, col="darkblue", lty=2)
abline(v=CI.lower.q2, col="darkblue", lty=2)
abline(v=CI.upper.q2, col="darkblue", lty=2)
abline(v=CI.lower.mu, col="darkblue", lty=2)
abline(v=CI.upper.mu, col="darkblue", lty=2)
As may be seen from the above code, we essentially have B subtrees that may be used to calculate the measure every time. From these B samples of mean and quantiles, we are able to then calculate variances and use a standard approximation to acquire (asymptotic) confidence intervals seen within the dashed line within the Figure. Again, all of this may be done despite the missing values in X(!)
Variable Importance
A final necessary aspect of Random Forests is efficiently calculated variable importance measures. While traditional measures are somewhat ad hoc, for the normal RF and for DRF, there are actually principled measures available, as explained on this medium article. For RF, the Sobol-MDA method reliably identifies the variables most significant for conditional expectation estimation, whereas for DRF, the MMD-MDA identifies the variables most significant for the estimation of the distribution overall. As discussed within the article, using the concept of projected Random Forests, these measures may be very efficiently implemented. We display this in the instance with a less efficient implementation of the MMD variable importance measure:
## Variable importance for conditional Quantile Estimation## For the conditional quantiles we use a measure that considers the entire distribution,
## i.e. the MMD based measure of DRF.
MMDVimp <- compute_drf_vimp(X=X,Y=Y, print=F)
sort(MMDVimp, decreasing = T)
X2 X1 X8 X6 X3 X10
0.852954299 0.124110913 0.012194176 0.009578300 0.008191663 0.007517931
X9 X7 X5 X4
0.006861688 0.006632175 0.005257195 0.002401974
Here each X_1 and X_2 are appropriately identified as being probably the most relevant variable when attempting to estimate the distribution. Remarkably, despite the dependence of X_3 and X_1, the measure appropriately quantifies that X_3 just isn’t necessary for the prediction of the distribution of Y. That is something that the unique MDA measure of Random Forests tends to do flawed, as demonstrated within the medium article. Furthermore, notice again that the missing values in X aren’t any problem here.
Conclusion
GRF/DRF and likewise the normal Random Forest mustn’t be missing within the toolbox of any data scientist. While methods like XGboost can have a greater performance in traditional prediction, the numerous strengths of recent RF-based approaches render them an incredibly versatile tool.
In fact, one should have in mind that these methods are still fully nonparametric, and plenty of data points are needed for the fit to make sense. That is in particularly true for the uncertainty quantification, which is barely valid asymptotically, i.e. for “large” samples.
Literature
[1] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
Appendix: Code
require(drf)
require(Matrix)
require(kernlab)drfCI <- function(X, Y, B, sampling = "binomial", ...) {
n <- dim(X)[1]
# compute point estimator and DRF per halfsample
# weightsb: B times n matrix of weights
DRFlist <- lapply(seq_len(B), function(b) {
# half-sample index
indexb <- if (sampling == "binomial") {
seq_len(n)[as.logical(rbinom(n, size = 1, prob = 0.5))]
} else {
sample(seq_len(n), floor(n / 2), replace = FALSE)
}
## Using normal Bootstrap on the info and refitting DRF
DRFb <-
drfown(X = X[indexb, , drop = F], Y = Y[indexb, , drop = F], ...)
return(list(DRF = DRFb, indices = indexb))
})
return(list(DRFlist = DRFlist, X = X, Y = Y))
}
predictdrf <- function(DRF, newdata, functional = NULL, ...) {
##Predict for testpoints in newdata, with B weights for every point, representing
##uncertainty
ntest <- nrow(x)
n <- nrow(DRF$Y)
weightsb <- lapply(DRF$DRFlist, function(l) {
weightsbfinal <- Matrix(0, nrow = ntest, ncol = n, sparse = TRUE)
weightsbfinal[, l$indices] <- predict(l$DRF, x)$weights
return(weightsbfinal)
})
weightsall <- Reduce("+", weightsb) / length(weightsb)
if (!is.null(functional)) {
stopifnot("Not yet implemented for several x" = ntest == 1)
thetahatb <-
lapply(weightsb, function(w)
functional(weights = w, X = DRF$X, Y = DRF$Y, x = x))
thetahatbforvar <- do.call(rbind, thetahatb)
thetahat <- functional(weights = weightsall, X = DRF$X, Y = DRF$Y, x = x)
thetahat <- matrix(thetahat, nrow = dim(x)[1])
var_est <- if (dim(thetahat)[2] > 1) {
a <- sweep(thetahatbforvar, 2, thetahat, FUN = "-")
crossprod(a, a) / B
} else {
mean((c(thetahatbforvar) - c(thetahat)) ^ 2)
}
return(list(weights = weightsall, thetahat = thetahat, weightsb = weightsb,
var_est = var_est))
} else {
return(list(weights = weightsall, weightsb = weightsb))
}
}
drfown <- function(X, Y,
num.trees = 500,
splitting.rule = "FourierMMD",
num.features = 10,
bandwidth = NULL,
response.scaling = TRUE,
node.scaling = FALSE,
sample.weights = NULL,
sample.fraction = 0.5,
mtry = min(ceiling(sqrt(ncol(X)) + 20), ncol(X)),
min.node.size = 15,
honesty = TRUE,
honesty.fraction = 0.5,
honesty.prune.leaves = TRUE,
alpha = 0.05,
imbalance.penalty = 0,
compute.oob.predictions = TRUE,
num.threads = NULL,
seed = stats::runif(1, 0, .Machine$integer.max),
compute.variable.importance = FALSE) {
# initial checks for X and Y
if (is.data.frame(X)) {
if (is.null(names(X))) {
stop("the regressor ought to be named if provided under data.frame format.")
}
if (any(apply(X, 2, class) %in% c("factor", "character"))) {
any.factor.or.character <- TRUE
X.mat <- as.matrix(fastDummies::dummy_cols(X, remove_selected_columns = TRUE))
} else {
any.factor.or.character <- FALSE
X.mat <- as.matrix(X)
}
mat.col.names.df <- names(X)
mat.col.names <- colnames(X.mat)
} else {
X.mat <- X
mat.col.names <- NULL
mat.col.names.df <- NULL
any.factor.or.character <- FALSE
}
if (is.data.frame(Y)) {
if (any(apply(Y, 2, class) %in% c("factor", "character"))) {
stop("Y should only contain numeric variables.")
}
Y <- as.matrix(Y)
}
if (is.vector(Y)) {
Y <- matrix(Y,ncol=1)
}
#validate_X(X.mat)
if (inherits(X, "Matrix") && !(inherits(X, "dgCMatrix"))) {
stop("Currently only sparse data of sophistication 'dgCMatrix' is supported.")
}
drf:::validate_sample_weights(sample.weights, X.mat)
#Y <- validate_observations(Y, X)
# set legacy GRF parameters
clusters <- vector(mode = "numeric", length = 0)
samples.per.cluster <- 0
equalize.cluster.weights <- FALSE
ci.group.size <- 1
num.threads <- drf:::validate_num_threads(num.threads)
all.tunable.params <- c("sample.fraction", "mtry", "min.node.size", "honesty.fraction",
"honesty.prune.leaves", "alpha", "imbalance.penalty")
# should we scale or not the info
if (response.scaling) {
Y.transformed <- scale(Y)
} else {
Y.transformed <- Y
}
data <- drf:::create_data_matrices(X.mat, consequence = Y.transformed, sample.weights = sample.weights)
# bandwidth using median heuristic by default
if (is.null(bandwidth)) {
bandwidth <- drf:::medianHeuristic(Y.transformed)
}
args <- list(num.trees = num.trees,
clusters = clusters,
samples.per.cluster = samples.per.cluster,
sample.fraction = sample.fraction,
mtry = mtry,
min.node.size = min.node.size,
honesty = honesty,
honesty.fraction = honesty.fraction,
honesty.prune.leaves = honesty.prune.leaves,
alpha = alpha,
imbalance.penalty = imbalance.penalty,
ci.group.size = ci.group.size,
compute.oob.predictions = compute.oob.predictions,
num.threads = num.threads,
seed = seed,
num_features = num.features,
bandwidth = bandwidth,
node_scaling = ifelse(node.scaling, 1, 0))
if (splitting.rule == "CART") {
##forest <- do.call(gini_train, c(data, args))
forest <- drf:::do.call.rcpp(drf:::gini_train, c(data, args))
##forest <- do.call(gini_train, c(data, args))
} else if (splitting.rule == "FourierMMD") {
forest <- drf:::do.call.rcpp(drf:::fourier_train, c(data, args))
} else {
stop("splitting rule not available.")
}
class(forest) <- c("drf")
forest[["ci.group.size"]] <- ci.group.size
forest[["X.orig"]] <- X.mat
forest[["is.df.X"]] <- is.data.frame(X)
forest[["Y.orig"]] <- Y
forest[["sample.weights"]] <- sample.weights
forest[["clusters"]] <- clusters
forest[["equalize.cluster.weights"]] <- equalize.cluster.weights
forest[["tunable.params"]] <- args[all.tunable.params]
forest[["mat.col.names"]] <- mat.col.names
forest[["mat.col.names.df"]] <- mat.col.names.df
forest[["any.factor.or.character"]] <- any.factor.or.character
if (compute.variable.importance) {
forest[['variable.importance']] <- variableImportance(forest, h = bandwidth)
}
forest
}
#' Variable importance for Distributional Random Forests
#'
#' @param X Matrix with input training data.
#' @param Y Matrix with output training data.
#' @param X_test Matrix with input testing data. If NULL, out-of-bag estimates are used.
#' @param num.trees Variety of trees to suit DRF. Default value is 500 trees.
#' @param silent If FALSE, print variable iteration number, otherwise nothing is print. Default is FALSE.
#'
#' @return The list of importance values for all input variables.
#' @export
#'
#' @examples
compute_drf_vimp <- function(X, Y, X_test = NULL, num.trees = 500, silent = FALSE){
# fit initial DRF
bandwidth_Y <- drf:::medianHeuristic(Y)
k_Y <- rbfdot(sigma = bandwidth_Y)
K <- kernelMatrix(k_Y, Y, Y)
DRF <- drfown(X, Y, num.trees = num.trees)
wall <- predict(DRF, X_test)$weights
# compute normalization constant
wbar <- colMeans(wall)
wall_wbar <- sweep(wall, 2, wbar, "-")
I0 <- as.numeric(sum(diag(wall_wbar %*% K %*% t(wall_wbar))))
# compute drf importance dropping variables one after the other
I <- sapply(1:ncol(X), function(j) {
if (!silent){print(paste0('Running importance for variable X', j, '...'))}
DRFj <- drfown(X = X[, -j, drop=F], Y = Y, num.trees = num.trees)
DRFpredj <- predict(DRFj, X_test[, -j])
wj <- DRFpredj$weights
Ij <- sum(diag((wj - wall) %*% K %*% t(wj - wall)))/I0
return(Ij)
})
# compute retraining bias
DRF0 <- drfown(X = X, Y = Y, num.trees = num.trees)
DRFpred0 = predict(DRF0, X_test)
w0 <- DRFpred0$weights
vimp0 <- sum(diag((w0 - wall) %*% K %*% t(w0 - wall)))/I0
# compute final importance (remove bias & truncate negative values)
vimp <- sapply(I - vimp0, function(x){max(0,x)})
names(vimp)<-colnames(X)
return(vimp)
}