## Traditional Methods and Latest Developments

Random Forest and generalizations (particularly, Generalized Random Forests (GRF) and Distributional Random Forests (DRF) ) are powerful and easy-to-use machine learning methods that mustn’t be absent within the toolbox of any data scientist. They not only show robust performance over a wide range of datasets without the necessity for tuning, but may easily handle missing values, and even provide confidence intervals. In this text, we give attention to one other feature they’re able to provide: notions of feature importance. Particularly, we give attention to:

- Traditional Random Forest (RF), which is used to predict the conditional expectation of a variable
*Y*given p predictors.*X* - The Distributional Random Forest, which is used to predict the entire conditional distribution of a d-variate
given*Y**p*predictors.*X*

Unfortunately, like many modern machine learning methods, each forests lack interpretability. That’s, there are such a lot of operations involved, it seems not possible to find out what the functional relationship between the predictors and *Y* actually is. A standard strategy to tackle this problem is to define Variable Importance measures (VIMP), that no less than help resolve which predictors are necessary. Generally, this has two different objectives:

(1) finding a small variety of variables with maximal accuracy,

(2) detecting and rating all influential variables to give attention to for further exploration.

The difference between (1) and (2) matters as soon as there may be dependence between the weather in ** X** (so just about all the time). For instance, if two variables are highly correlated together and with

*Y*, one in every of the 2 inputs may be removed without hurting accuracy for objective (1), since each variables convey the identical information. Nevertheless, each must be included for objective (2), since these two variables can have different meanings in practice for domain experts.

Today we give attention to (1) and check out to seek out a smaller variety of predictors that display roughly the identical predictive accuracy. As an example, within the wage example below, we’re capable of reduce the variety of predictors from 79 to about 20, with only a small reduction in accuracy. These most significant predictors contain variables akin to age and education that are well-known to influence wages. There are also many great articles on medium about (2), using Shapley values akin to this one or this one. There’s also very recent and exciting academic literature on tips on how to efficiently calculate Shapley values with Random Forest. But that is material for a second article.

The 2 measures we take a look at today are literally more general variable importance measures that may be used for any method, based on the drop-and-relearn principle which we’ll take a look at below. We focus exclusively on tree-based methods here, nevertheless. Furthermore, we don’t go into great detail explaining the methods, but relatively attempt to give attention to their applications and why newer versions are preferable to the more traditional ones.

## The Beginnings

Variable importance measures for RFs are actually as old as RF itself. The primary accuracy the **Mean Decrease Accuracy (MDA)** was proposed by Breiman in his seminal Random Forest paper [1]. The concept is easy: For each dimension *j=1,…,p*, one compares the accuracy of the total prediction with the accuracy of the prediction when *X_j* is randomly permuted. The concept of that is to interrupt the connection between *X_j* and *Y* and compare the accuracy when *X_j* just isn’t helping to predict *Y* by design, to the case when it’s potentially of use.

There are numerous different versions of MDA implemented in R and Python:

Unfortunately, permuting variable *X_j* in this manner not only breaks its relationship to *Y*, but in addition to the opposite variables in ** X**. This just isn’t an issue if

*X_j*is independent from all other variables, however it becomes an issue once there may be dependence. Consequently, [3] is capable of show that as soon as there may be dependence in

**, the MDA converges to something nonsensical. Particularly, MDA can provide high importance to a variable**

*X**X_j*that just isn’t necessary to predict

*Y*, but is very correlated with one other variable, say

*X_l*, that is definitely necessary for predicting

*Y*(as demonstrated in the instance below). At the identical time, it will possibly fail to detect variables which can be actually relevant, as demonstrated by a protracted list of papers in [3, Section 2.1]. Intuitively, what we’d need to measure is the performance of the model if

*X_j*just isn’t included, and as an alternative, we measure the performance of a model with a permuted

*X_j*variable.

The second traditional accuracy measure is **Mean Decrease Impurity (MDI)**, which sums the weighted decreases of impurity over all nodes that split on a given covariate, averaged over all trees within the forest. Unfortunately, MDI is ill-defined from the beginning (it is not clear what it should measure) and several other papers highlight the sensible problem of this approach (e.g. [5]) As such, we won’t go into detail about MDI, as MDA is commonly the popular alternative.

## Modern Developments I: Sobol-MDA

For the longest time, I believed these somewhat informal measures were the very best we could do. One paper that modified that, got here out only very recently. On this paper, the authors exhibit theoretically that the favored measures above are literally quite flawed and don’t measure what we would like to measure. So the primary query is perhaps: What will we actually need to measure? One potential answer: The Sobol-index (originally proposed in the pc science literature):

Let’s unpack this. First, *tau(**X**)=E[ Y | **X**] *is the conditional expectation function we would love to estimate. It is a random variable since it is a function of the random ** X**. Now

*X**^{(-j)}*is the

*p-1*vector with covariate

*j*removed. Thus

*ST^{(j)}*is the reduction in output explained variance if the

*j*th output variable is removed.

The above is the more traditional way of writing the measure. Nevertheless, for me writing:

is way more intuitive. Here *d* is a distance between two random vectors and for the *ST^{(j)}* above, this distance is just the standard Euclidean distance. Thus the upper a part of *ST^{(j)}* is just measuring the typical squared distance between what we would like (*tau(**X**)*) and what we get without variable *j*. The latter is

The query becomes tips on how to estimate this efficiently. It seems that the intuitive drop-and-relearn principle could be enough: Simply estimating *tau(**X**) *using RF after which dropping *X_j* and refitting the RF to acquire an estimate of *tau(**X^{(-j)}**), *one obtains the consistent estimator*:*

where *tau_n(**X**_i)* is the RF estimate for a test point *X**_i* using all *p* predictors and similarly *tau_n(**X**_i^{(-j)})* is the refitted forest using only *p-1* predictors.

Nevertheless, this implies the forest must be refitted *p* times, not very efficient when *p* is large! As such the authors in [3] develop what they call the **Sobol-MDA**. As an alternative of refitting the forest every time, the forest is just fitted once. Then test points are dropped down the identical forest and the resulting prediction is “projected” to form the measure in (1). That’s, splits on *X_j* are simply ignored (remember the goal is to acquire an estimate without *X_j*). The authors are capable of show that calculating (1) above with this projected approach also leads to a consistent estimator! That is a stupendous idea indeed and renders the algorithm applicable even in high dimensions.

The tactic is implemented in R within the soboldMDA package, based on the very fast ranger package.

## Modern Developments II: MMD-based sensitivity index

Taking a look at the formulation using the space *d*, a natural query is to ask whether different distances might be used to get variable importance measures for harder problems. One such recent example is to make use of the MMD distance as *d:*

The MMD distance is a superb tool, that permits to quite easily construct a distance between distributions using a kernel *k* (akin to the Gaussian kernel):

For the moment I leave the small print to further articles. An important takeaway is just that *I^{(j)}* considers a more general goal than the conditional expectation. It recognizes a variable *X_j *as necessary, as soon because it influences the distribution of *Y* in any way. It is perhaps that *X_j* only changes the variance or the quantiles and leaves the conditional mean of *Y* untouched (see example below). On this case, the Sobol-MDA wouldn’t recognize *X_j* as necessary, however the MMD method would. This doesn’t necessarily make it higher, it is just a unique tool: If one is keen on predicting the conditional expectation, *ST^{(j)}* is the proper measure. Nevertheless, if one is keen on predicting other points of the distribution, especially quantiles, *I^{(j)}* could be higher suited. Again *I^{(j)}* may be consistently estimated using the drop-and-relearn principle (refitting DRF for *j=1,…,p* eacht time with variable $j$ removed), or the identical projection approach as for Sobol-MDA may be used. A drop-and-relearn-based implementation is attached at the tip of this text. We check with this method here as **MMD-MDA**.

## Simulated Data

We now illustrate these two modern measures on a straightforward simulated example: We first download and install the Sobol-MDA package from Gitlab after which load all of the packages mandatory for this instance:

`library(kernlab)`

library(drf)

library(Matrix)

library(DescTools)

library(mice)

library(sobolMDA)

source("compute_drf_vimp.R") ##Contents of this file may be found below

source("evaluation.R") ##Contents of this file may be found below

Then we simulate from this easy example: We take *X_1, X_2, X_4, …, X_10* independently uniform between (-1,1) and create dependence between *X_1* and *X_3* by taking *X_3=X_1 + uniform error*. Then we simulate *Y* as

`##Simulate Data that experiences each a mean in addition to sd shift`# Simulate from X

x1 <- runif(n,-1,1)

x2 <- runif(n,-1,1)

X0 <- matrix(runif(7*n,-1,1), nrow=n, ncol=7)

x3 <- x1+ runif(n,-1,1)

X <- cbind(x1,x2, x3, X0)

# Simulate dependent variable Y

Y <- as.matrix(rnorm(n,mean = 0.8*(x1 > 0), sd = 1 + 1*(x2 > 0)))

colnames(X)<-paste0("X", 1:10)

head(cbind(Y,X))

We then analyze the Sobol-MDA approach to estimate the conditional expectation of *Y* given ** X**:

`## Variable importance for conditional Expectation Estimation`XY <- as.data.frame(cbind(Xfull, Y))

colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')

num.trees <- 500

forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'sobolMDA')

sobolMDA <- forest$variable.importance

names(sobolMDA) <- colnames(X)

sort(sobolMDA, decreasing = T)

X1 X8 X7 X6 X5 X9

0.062220958 0.021946135 0.016818860 0.016777223 -0.001290326 -0.001540919

X3 X10 X4 X2

-0.001578540 -0.007400854 -0.008299478 -0.020334150

As may be seen, it appropriately identifies that *X_1* is a very powerful variable, while the others are ranked equally (un)necessary. This is sensible since the conditional expectation of *Y* is just modified by *X_1*. Crucially, the measure manages to do that despite the dependence between *X_1 *and *X_3*. Thus we successfully pursued goal (1), as explained above, in this instance. However, we may have a take a look at the normal MDA:

`forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'permutation')`

MDA <- forest$variable.importance

names(MDA) <- colnames(X)sort(MDA, decreasing = T)

X1 X3 X6 X7 X8 X2

0.464516976 0.118147061 0.063969310 0.032741521 0.029004312 -0.004494380

X4 X9 X10 X5

-0.009977733 -0.011030996 -0.014281844 -0.018062544

On this case, while it appropriately identifies *X_1 *as a very powerful variable, it also places *X_3 *in second place, with a worth that seems quite a bit higher than the remaining variables. This despite the actual fact, that *X_3 *is just as unimportant as *X_2, X_4,…, X_10*!

But what if we’re keen on predicting the distribution of *Y* more generally, say for estimating quantiles? On this case, we’d like a measure that’s capable of recognize the influence of *X_2* on the conditional variance of *Y*. Here the MMD variable importance measure comes into play:

`MMDVimp <- compute_drf_vimp(X=X,Y=Y)`

sort(MMDVimp, decreasing = T)X2 X1 X10 X6 X8 X3

0.683315006 0.318517259 0.014066410 0.009904518 0.006859128 0.005529749

X7 X9 X4 X5

0.003476256 0.003290550 0.002417677 0.002036174

Again the measure is capable of appropriately discover what matters: *X_1* and *X_2* are the 2 most significant variables. And again, it does this despite the dependence between *X_1* and *X_3*. Interestingly, it also gives the variance shift from *X_2* a better importance than the expectation shift from *X_1*.

## Real Data

Finally, I present an actual data application to exhibit the variable importance measure. Note that with DRF, we could look even at multivariate ** Y** but to maintain things more easy, we give attention to a univariate setting and consider the US wage data from the 2018 American Community Survey by the US Census Bureau. In the primary DRF paper, we obtained data on roughly 1 million full-time employees from the 2018 American Community Survey by the US Census Bureau from which we extracted the salary information and all covariates that is perhaps relevant for salaries. This wealth of information is good to experiment with a way like DRF (actually we’ll only use a tiny subset for this evaluation). The information we load may be found here.

`# Load data (https://github.com/lorismichel/drf/blob/master/applications/wage_data/data/datasets/wage_benchmark.Rdata)`

load("wage_benchmark.Rdata")##Define the training data

n<-1000

Xtrain<-X[1:n,]

Ytrain<-Y[1:n,]

Xtrain<-cbind(Xtrain,Ytrain[,"male"])

colnames(Xtrain)[ncol(Xtrain)]<-"male"

Ytrain<-Ytrain[,1, drop=F]

##Define the test data

ntest<-2000

Xtest<-X[(n+1):(n+ntest),]

Ytest<-Y[(n+1):(n+ntest),]

Xtest<-cbind(Xtest,Ytest[,"male"])

colnames(Xtest)[ncol(Xtest)]<-"male"

Ytest<-Ytest[,1, drop=F]

We now calculate each variable importance measures (this can take some time as only the drop-and-relearn method is implemented for DRF):

`# Calculate variable importance for each measures`

# 1. Sobol-MDA

XY <- as.data.frame(cbind(Xtrain, Ytrain))

colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')

num.trees <- 500

forest <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees, importance = 'sobolMDA')

SobolMDA <- forest$variable.importance

names(SobolMDA) <- colnames(Xtrain)# 2. MMD-MDA

MMDVimp <- compute_drf_vimp(X=Xtrain,Y=Ytrain,silent=T)

print("Top 10 most significant variables for conditional Expectation estimation")

sort(SobolMDA, decreasing = T)[1:10]

print("Top 5 most significant variables for conditional Distribution estimation")

sort(MMDVimp, decreasing = T)[1:10]

`Sobol-MDA:`education_level age male

0.073506769 0.027079349 0.013722756

occupation_11 occupation_43 industry_54

0.013550320 0.010025332 0.007744589

industry_44 occupation_23 occupation_15

0.006657918 0.005772662 0.004610835

marital_never married

0.004545964

`MMD-MDA:`education_level age male

0.420316085 0.109212519 0.027356393

occupation_43 occupation_11 marital_never married

0.016861954 0.014122583 0.003449910

occupation_29 marital_married industry_81

0.002272629 0.002085207 0.001152210

industry_72

0.000984725

On this case, the 2 variable importance measures agree quite a bit on which variables are necessary. While this just isn’t a causal evaluation, additionally it is nice that variables which can be known to be necessary to predict wages, specifically “age”, “education_level” and “gender”, are indeed seen as very necessary by the 2 measures.

To acquire a small set of predictive variables, one could now for *j=1,…p-1,*

(I) Remove the least necessary variable

(II) Calculate the loss (e.g. mean squared error) on a test set

(III) Recalculate the variable importance for the remaining variable

(IV) Repeat until a certain stopping criterion is met

One could stop, for example, if the loss increased by greater than 5%. To make my life easier in this text, I just use the identical variable importance values saved in “SobolMDA” and “MMDVimp” above. That’s, I ignore step (III) and only consider (I), (II) and (IV). When the goal of estimation is the total conditional distribution, step (II) can be not entirely clear. We use what we check with as MMD loss, described in additional detail in our paper ([4]). This loss considers the error we’re making within the prediction of the distribution. For the conditional mean, we simply use the mean-squared error. This is completed within the function “evalall” found below:

`# Remove variables one-by-one accoring to the importance values saved in SobolMDA`

# and MMDVimp.

evallistSobol<-evalall(SobolMDA, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c("MSE"), num.trees )

evallistMMD<-evalall(MMDVimp, X=Xtrain ,Y=Ytrain ,Xtest, Ytest, metrics=c("MMD"), num.trees )plot(evallistSobol$evalMSE, type="l", lwd=2, cex=0.8, col="darkgreen", essential="MSE loss" , xlab="Variety of Variables removed", ylab="Values")

plot(evallistMMD$evalMMD, type="l", lwd=2, cex=0.8, col="darkgreen", essential="MMD loss" , xlab="Variety of Variables removed", ylab="Values")

This leads to the next two pictures:

Notice that each have somewhat wiggly lines, which is first as a consequence of the indisputable fact that I didn’t recalculate the importance measure, e.g., neglected step (III), and second as a consequence of the randomness of the forests. Apart from this, the graphs nicely show how the errors successively increase with each variable that’s removed. This increase is first slow for the least necessary variables after which gets quicker for a very powerful ones, exactly as one would expect. Particularly, the loss in each cases stays virtually unchanged if one removes the 50 least necessary variables! In truth, one could remove about 70 variables in each cases without increasing the loss by greater than 6%. One has to notice though that many predictors are a part of one-hot encoded categorical variables and thus one must be somewhat careful when removing predictors, as they correspond to levels of 1 categorical variable. Nevertheless, in an actual application, this might still be desirable.

## Conclusion

In this text, we checked out modern approaches to variable importance in Random Forests, with the goal of obtaining a small set of predictors or covariates, each with respect to the conditional expectation and for the conditional distribution more generally. We’ve seen within the wage data example, that this may result in a considerable reduction in predictors with virtually the identical accuracy.

As noted above the measures presented aren’t strictly constrained to Random Forest, but may be used more generally in principle. Nevertheless, forests allow for the elegant projection approach that permits for the calculation of the importance measure for all variables *j*, without having to refit the forest every time (!) That is described in each [3] and [4].

## Literature

[1] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

[2] Breiman, L. (2003a). Organising, using, and understanding random forests v3.1. Technical report, UC Berkeley, Department of Statistics

[3] Bénard, C., Da Veiga, S., and Scornet, E. (2022). Mean decrease accuracy for random forests: inconsistency, and a practical solution via the Sobol-MDA. Biometrika, 109(4):881–900.

[4] Clément Bénard, Jeffrey Näf, and Julie Josse. MMD-based variable importance for distributional random forest, 2023.

[5] Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: illustrations, sources and an answer. BMC Bioinformatics, 8:25.

## Appendix : Code

#### Contents of compute_drf_vimp.R #######' Variable importance for Distributional Random Forests

#'

#' @param X Matrix with input training data.

#' @param Y Matrix with output training data.

#' @param X_test Matrix with input testing data. If NULL, out-of-bag estimates are used.

#' @param num.trees Variety of trees to suit DRF. Default value is 500 trees.

#' @param silent If FALSE, print variable iteration number, otherwise nothing is print. Default is FALSE.

#'

#' @return The list of importance values for all input variables.

#' @export

#'

#' @examples

compute_drf_vimp <- function(X, Y, X_test = NULL, num.trees = 500, silent = FALSE){

# fit initial DRF

bandwidth_Y <- drf:::medianHeuristic(Y)

k_Y <- rbfdot(sigma = bandwidth_Y)

K <- kernelMatrix(k_Y, Y, Y)

DRF <- drf(X, Y, num.trees = num.trees)

wall <- predict(DRF, X_test)$weights

# compute normalization constant

wbar <- colMeans(wall)

wall_wbar <- sweep(wall, 2, wbar, "-")

I0 <- as.numeric(sum(diag(wall_wbar %*% K %*% t(wall_wbar))))

# compute drf importance dropping variables one after the other

I <- sapply(1:ncol(X), function(j) {

if (!silent){print(paste0('Running importance for variable X', j, '...'))}

DRFj <- drf(X = X[, -j, drop=F], Y = Y, num.trees = num.trees)

DRFpredj <- predict(DRFj, X_test[, -j])

wj <- DRFpredj$weights

Ij <- sum(diag((wj - wall) %*% K %*% t(wj - wall)))/I0

return(Ij)

})

# compute retraining bias

DRF0 <- drf(X = X, Y = Y, num.trees = num.trees)

DRFpred0 = predict(DRF0, X_test)

w0 <- DRFpred0$weights

vimp0 <- sum(diag((w0 - wall) %*% K %*% t(w0 - wall)))/I0

# compute final importance (remove bias & truncate negative values)

vimp <- sapply(I - vimp0, function(x){max(0,x)})

names(vimp)<-colnames(X)

return(vimp)

}

`#### Contents of evaluation.R ######`compute_mmd_loss <- function(Y_train, Y_test, weights){

# Y_train <- scale(Y_train)

# Y_test <- scale(Y_test)

bandwidth_Y <- (1/drf:::medianHeuristic(Y_train))^2

k_Y <- rbfdot(sigma = bandwidth_Y)

K_train <- matrix(kernelMatrix(k_Y, Y_train, Y_train), ncol = nrow(Y_train))

K_cross <- matrix(kernelMatrix(k_Y, Y_test, Y_train), ncol = nrow(Y_train))

weights <- matrix(weights, ncol = ncol(weights))

t1 <- diag(weights%*%K_train%*%t(weights))

t2 <- diag(K_cross%*%t(weights))

mmd_loss <- mean(t1) - 2*mean(t2)

mmd_loss

}

evalall <- function(Vimp, X ,Y ,Xtest, Ytest, metrics=c("MMD","MSE"), num.trees ){

if (ncol(Ytest) > 1 & "MSE" %in% metrics){

metrics <- metrics[!( metrics %in% "MSE") ]

}

# Sort for increasing importance, such that the least necessary variables are removed first

Vimp<-sort(Vimp)

if ( is.null(names(Vimp)) ){

stop("Need names for later")

}

evalMMD<-matrix(0, nrow=ncol(X))

evalMSE<-matrix(0, nrow=ncol(X))

###Idea: Create a function that takes a variable importance measure and does this loop!!

for (j in 1:ncol(X)){

if (j==1){

if ("MMD" %in% metrics){

DRFred<- drf(X=X,Y=Y)

weights<- predict(DRFred, newdata=Xtest)$weights

evalMMD[j]<-compute_mmd_loss(Y_train=Y, Y_test=Ytest, weights)

}

if ("MSE" %in% metrics){

XY <- as.data.frame(cbind(X, Y))

colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')

RFfull <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees)

XtestRF<-Xtest

colnames(XtestRF) <- paste('X', 1:ncol(XtestRF), sep='')

predRF<-predict(RFfull, data=XtestRF)

evalMSE[j] <- mean((Ytest - predRF$predictions)^2)

}

}else{

if ("MMD" %in% metrics){

DRFred<- drf(X=X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F],Y=Y)

weights<- predict(DRFred, newdata=Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F])$weights

evalMMD[j]<-compute_mmd_loss(Y_train=Y, Y_test=Ytest, weights)

}

if ("MSE" %in% metrics){

XY <- as.data.frame(cbind(X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F], Y))

colnames(XY) <- c(paste('X', 1:(ncol(XY)-1), sep=''), 'Y')

RFfull <- sobolMDA::ranger(Y ~., data = XY, num.trees = num.trees)

XtestRF<-Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F]

colnames(XtestRF) <- paste('X', 1:ncol(XtestRF), sep='')

predRF<-predict(RFfull, data=XtestRF)

evalMSE[j] <- mean((Ytest - predRF$predictions)^2)

# DRFall <- drf(X=X[,!(colnames(X) %in% names(Vimp[1:(j-1)])), drop=F], Y=Y, num.trees=num.trees)

# quantpredictall<-predict(DRFall, newdata=Xtest[,!(colnames(Xtest) %in% names(Vimp[1:(j-1)])), drop=F], functional="quantile",quantiles=c(0.5))

# evalMAD[j] <- mean(sapply(1:nrow(Xtest), function(j) abs(Ytest[j] - quantpredictall$quantile[,,"q=0.5"][j]) ))

}

}

}

return(list(Vimp=Vimp, evalMMD=evalMMD, evalMSE=evalMSE ))

}