Introduction

Variable importance in predictive modeling context can have several varied meanings. Today I want to investigate two of them:

How important is a feature with respect to a given machine learning model?
How important is a feature with respect to a given learning algorithm and a given dataset?

In the post we will discuss why and how these two concepts can differ using an example of boosted trees. We will use their implementation available in the EvoTrees.jl package.

The post was written under Julia 1.9.2, DataFrames.jl 1.9.2, Distributions.jl 0.25.98, and EvoTrees.jl 0.15.2.

Generating the test data

Let us start with loading the packages we are going to use and generating the test data.

What I want to have is a data set with 10,000 observations. We will have 9 continuous features denoted x1 to x9 and a continuous target variable y. Our goal is to have triplets of features (x1 to x3, x4 to x6, x7 to x9) highly correlated together (but independent between groups) and having a different correlation level with the y target. I have made the within-group variables highly correlated but non-identical (so they are distinguishable and have a slightly different correlation with y in the sampled data set).

The code generating such data is as follows:

using DataFrames
using Distributions
using EvoTrees
using LinearAlgebra
using Random

δ = 1.0e-6
b = fill(1.0 - δ, 3, 3) + δ * I
z = zeros(3, 3)
y = fill(0.5, 3)
dist = MvNormal([b      z  z      0.8*y
                 z      b  z      y
                 z      z  b      1.2*y
                 0.8*y' y' 1.2*y' 1.0])
Random.seed!(1)
mat = rand(dist, 10_000);
df = DataFrame(transpose(mat), [string.("x", 1:9); "y"]);

Let us have a peek at the data:

julia> names(df)
10-element Vector{String}:
 "x1"
 "x2"
 "x3"
 "x4"
 "x5"
 "x6"
 "x7"
 "x8"
 "x9"
 "y"

julia> df
10000×10 DataFrame
   Row │ x1           x2          x3           x4           x5           x6          ⋯
       │ Float64      Float64     Float64      Float64      Float64      Float64     ⋯
───────┼──────────────────────────────────────────────────────────────────────────────
     1 │  0.0619327    0.0623264   0.0613998    0.0466594    0.0481949    0.0454962  ⋯
     2 │  0.217879     0.216951    0.217742     0.00738607   0.00888615   0.00626926
     3 │  1.54641      1.54598     1.54328     -1.00261     -1.0013      -0.999863
     4 │  0.208777     0.207593    0.209145    -1.21253     -1.21556     -1.21462
     5 │ -0.458805    -0.458081   -0.457956     0.103491     0.10537      0.103313   ⋯
     6 │  0.392072     0.390938    0.390447    -0.354123    -0.354995    -0.353026
     7 │ -0.313095    -0.310223   -0.311185     1.09256      1.09373      1.09443
   ⋮   │      ⋮           ⋮            ⋮            ⋮            ⋮            ⋮      ⋱
  9994 │ -1.24411     -1.24363    -1.24439     -0.789893    -0.793004    -0.792177
  9995 │  0.199036     0.199199    0.199344     0.945945     0.945308     0.943717   ⋯
  9996 │  1.81075      1.80926     1.81064     -2.53813     -2.53805     -2.53996
  9997 │ -0.00896532  -0.0079907  -0.00876527  -0.629303    -0.629402    -0.630129
  9998 │ -1.62881     -1.62626    -1.62703     -0.222873    -0.222469    -0.22166
  9999 │  1.45152      1.44833     1.45131     -2.543       -2.54377     -2.544      ⋯
 10000 │  0.436075     0.435492    0.436974    -0.28131     -0.281519    -0.283039
                                                       4 columns and 9986 rows omitted

Indeed we see that there are 9 features and one target variable. Also we visually see that variables x1, x2, and x3 are almost the same but not identical. Similarly x4, x5, and x6. (I have cropped the rest of the printout as it was too wide for the post.)

I chose such a data generation scheme since a priori, that is with respect to a given machine learning algorithm and a given dataset, their importance is as follows:

x1, x2, and x3 should have a very similar and lowest importance (their correlation with y is lowest by design);
x4, x5, and x6 should have a very similar and medium importance;
x7, x8, and x9 should have a very similar and highest importance (their correlation with y is highest by design).

However, if we build a specific boosted tree model can we expect the same relationship? Let us check.

Variable importance with respect to a specific model

We build a boosted tree model (using the default settings) and evaluate variable importance of the features:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df;
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.33002820995126636
 "x4" => 0.17950260124468856
 "x5" => 0.10630471720405912
 "x7" => 0.1002898622306779
 "x1" => 0.09023808819243322
 "x8" => 0.060680998291169054
 "x3" => 0.04789330560493748
 "x6" => 0.044689013127277216
 "x2" => 0.040373204153491105

We see that x9 feature has the highest importance, but it is quite different from x8 and x7. The x4 feature, although it has a lower correlation with y than e.g. the x8 feature has a higher variable importance (the same holds for x1 vs x8).

What is the reason for such a situation? When a boosted tree model is built it seems that what has happened is that x9 variable captured most of the value of explanation of y from x7 and x8 variables as they are very similar. Therefore, in this specific model, x7 and x8 are not that important.

Let us try estimating the model for the second time to see if we notice any difference:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df;
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.33002820995126636
 "x4" => 0.17950260124468856
 "x5" => 0.10630471720405912
 "x7" => 0.1002898622306779
 "x1" => 0.09023808819243322
 "x8" => 0.060680998291169054
 "x3" => 0.04789330560493748
 "x6" => 0.044689013127277216
 "x2" => 0.040373204153491105

The results are identical. You might wonder what is the reason for this? The cause of this situation is that the fit_evotree function uses a default seed when doing computations so we get the same tree twice. To be precise, when we call EvoTreeRegressor() it sets the seed of the default random number generator in the current task to 123.

So let us try shuffling the variables to see if we would get a different result:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df[!, randperm(10)];
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.23187113718728977
 "x8" => 0.20285271199278873
 "x4" => 0.16779901582722756
 "x5" => 0.15415181545562057
 "x3" => 0.08494533205347177
 "x2" => 0.06781415123236784
 "x7" => 0.05788796227269619
 "x1" => 0.024214826049282448
 "x6" => 0.008463047929255035

Indeed the variable importance changed. Would it be still different if we did another randomized run?

julia> model = fit_evotree(EvoTreeRegressor(),
                           df[!, randperm(10)];
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.23187113718728977
 "x8" => 0.20285271199278873
 "x4" => 0.16779901582722756
 "x5" => 0.15415181545562057
 "x3" => 0.08494533205347177
 "x2" => 0.06781415123236784
 "x7" => 0.05788796227269619
 "x1" => 0.024214826049282448
 "x6" => 0.008463047929255035

Maybe this was a surprise to you but the answer is: no. We get the same results. What is going on?

This time the answer is that the fit_evotree and randperm functions share the same random number generator (as we run them in the same task) and fit_evotree resets its state to 123 when we call EvoTreeRegressor(). This means that when we invoked randperm the generator was in the same state both times so the randperm(10) call produced the same sequence of numbers.

Variable importance with respect to an algorithm and a data set

We were given an important general lesson. We need to properly initialize the functions that use randomness in our code. Let us leverage this knowledge to try assessing variable importance with respect to a boosted trees and a data set we have generated.

What I do in the code below is generating 1000 boosted trees and computing mean variable importance (along with some additional statistics) across all of them:

julia> rng = Xoshiro(1);

julia> reduce(vcat,
           map(1:1000) do _
               fit_evotree(EvoTreeRegressor(rng=rng),
                           df;
                           target_name="y",
                           verbosity=0) |>
               EvoTrees.importance |>
               sort |>
               DataFrame
           end) |> describe
9×7 DataFrame
 Row │ variable  mean       min         median     max        nmissing  eltype
     │ Symbol    Float64    Float64     Float64    Float64    Int64     DataType
─────┼───────────────────────────────────────────────────────────────────────────
   1 │ x1        0.094902   0.0456523   0.0946136  0.1437            0  Float64
   2 │ x2        0.0475577  0.0109109   0.0469086  0.0969166         0  Float64
   3 │ x3        0.0361729  0.00189622  0.0353323  0.0844639         0  Float64
   4 │ x4        0.128568   0.032614    0.127093   0.285664          0  Float64
   5 │ x5        0.111577   0.00310871  0.109548   0.238868          0  Float64
   6 │ x6        0.0891413  0.0         0.0879858  0.196108          0  Float64
   7 │ x7        0.189893   0.0400903   0.179193   0.417476          0  Float64
   8 │ x8        0.155352   0.013377    0.154884   0.371452          0  Float64
   9 │ x9        0.146837   0.00642872  0.141293   0.397151          0  Float64

This time we see a better separation between variables x1, x2, and x3, followed by x4, x5, and x6, and finally the x7, x8, and x9 group. However, still we see some non-negligible within-group differences (and x1 is even better than x6). It seems that just ensuring that we properly pass the random number generator the fit_evotree model random number generator is not enough.

Let us then do a final attempt. This time we both properly pass the random number generator to the fit_evotree model and randomize the order of variables in the source data frame (making sure we also properly pass the random number generator to the randperm function):

julia> reduce(vcat,
           map(1:1000) do _
               fit_evotree(EvoTreeRegressor(rng=rng),
                           df[!, randperm(rng, 10)];
                           target_name="y",
                           verbosity=0) |>
               EvoTrees.importance |>
               sort |>
               DataFrame
           end) |> describe
9×7 DataFrame
 Row │ variable  mean       min         median     max       nmissing  eltype
     │ Symbol    Float64    Float64     Float64    Float64   Int64     DataType
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ x1        0.0595918  0.00260002  0.0524585  0.153613         0  Float64
   2 │ x2        0.0604078  0.00374867  0.0534607  0.148399         0  Float64
   3 │ x3        0.0586352  0.00219417  0.051398   0.140393         0  Float64
   4 │ x4        0.103219   0.00484802  0.101625   0.252127         0  Float64
   5 │ x5        0.119415   0.00146451  0.116338   0.268475         0  Float64
   6 │ x6        0.106432   0.00672514  0.102708   0.254382         0  Float64
   7 │ x7        0.153185   0.00346922  0.139289   0.388995         0  Float64
   8 │ x8        0.166709   0.00651302  0.161872   0.419284         0  Float64
   9 │ x9        0.172406   0.00979053  0.166798   0.444192         0  Float64

This time we have succeeded. While there is still some variability in within-group variable importance it is small, and the groups are clearly separated. The worst features have their importance around 6%, the medium value features around 11%, and the best features around 16%.

Conclusions

Let us recap what we have seen today.

First, we see that variable importance with respect to a given concrete instance of a machine learning model can be significantly different from variable importance with respect to a given learning algorithm and a given dataset. Therefore, one should carefully think which one is interesting from the perspective of a problem that one wants to solve.

The second lesson is related to implementation details of machine learning algorithms:

Many of them use pseudorandom numbers when building a model and proper handling of pseudorandom generator is crucial.
The result of model building can depend on the order of features in the source data set. In our example we have seen that shuffling the columns of an input data table produced significantly different variable importance results.

I hope you found these examples useful!