# Variable importance with EvoTrees.jl

# Introduction

Variable importance in predictive modeling context can have several varied meanings. Today I want to investigate two of them:

- How important is a feature with respect to a given machine learning model?
- How important is a feature with respect to a given learning algorithm and a given dataset?

In the post we will discuss why and how these two concepts can differ using an example of boosted trees. We will use their implementation available in the EvoTrees.jl package.

The post was written under Julia 1.9.2, DataFrames.jl 1.9.2, Distributions.jl 0.25.98, and EvoTrees.jl 0.15.2.

# Generating the test data

Let us start with loading the packages we are going to use and generating the test data.

What I want to have is a data set with 10,000 observations.
We will have 9 continuous features denoted `x1`

to `x9`

and a continuous target variable `y`

.
Our goal is to have triplets of features (`x1`

to `x3`

, `x4`

to `x6`

, `x7`

to `x9`

)
highly correlated together (but independent between groups) and having a different
correlation level with the `y`

target. I have made the within-group variables
highly correlated but non-identical (so they are distinguishable and have
a slightly different correlation with `y`

in the sampled data set).

The code generating such data is as follows:

```
using DataFrames
using Distributions
using EvoTrees
using LinearAlgebra
using Random
δ = 1.0e-6
b = fill(1.0 - δ, 3, 3) + δ * I
z = zeros(3, 3)
y = fill(0.5, 3)
dist = MvNormal([b z z 0.8*y
z b z y
z z b 1.2*y
0.8*y' y' 1.2*y' 1.0])
Random.seed!(1)
mat = rand(dist, 10_000);
df = DataFrame(transpose(mat), [string.("x", 1:9); "y"]);
```

Let us have a peek at the data:

```
julia> names(df)
10-element Vector{String}:
"x1"
"x2"
"x3"
"x4"
"x5"
"x6"
"x7"
"x8"
"x9"
"y"
julia> df
10000×10 DataFrame
Row │ x1 x2 x3 x4 x5 x6 ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 ⋯
───────┼──────────────────────────────────────────────────────────────────────────────
1 │ 0.0619327 0.0623264 0.0613998 0.0466594 0.0481949 0.0454962 ⋯
2 │ 0.217879 0.216951 0.217742 0.00738607 0.00888615 0.00626926
3 │ 1.54641 1.54598 1.54328 -1.00261 -1.0013 -0.999863
4 │ 0.208777 0.207593 0.209145 -1.21253 -1.21556 -1.21462
5 │ -0.458805 -0.458081 -0.457956 0.103491 0.10537 0.103313 ⋯
6 │ 0.392072 0.390938 0.390447 -0.354123 -0.354995 -0.353026
7 │ -0.313095 -0.310223 -0.311185 1.09256 1.09373 1.09443
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
9994 │ -1.24411 -1.24363 -1.24439 -0.789893 -0.793004 -0.792177
9995 │ 0.199036 0.199199 0.199344 0.945945 0.945308 0.943717 ⋯
9996 │ 1.81075 1.80926 1.81064 -2.53813 -2.53805 -2.53996
9997 │ -0.00896532 -0.0079907 -0.00876527 -0.629303 -0.629402 -0.630129
9998 │ -1.62881 -1.62626 -1.62703 -0.222873 -0.222469 -0.22166
9999 │ 1.45152 1.44833 1.45131 -2.543 -2.54377 -2.544 ⋯
10000 │ 0.436075 0.435492 0.436974 -0.28131 -0.281519 -0.283039
4 columns and 9986 rows omitted
```

Indeed we see that there are 9 features and one target variable. Also we visually
see that variables `x1`

, `x2`

, and `x3`

are almost the same but not identical.
Similarly `x4`

, `x5`

, and `x6`

.
(I have cropped the rest of the printout as it was too wide for the post.)

I chose such a data generation scheme since *a priori*, that is
with respect to a given machine learning algorithm and a given dataset,
their importance is as follows:

`x1`

,`x2`

, and`x3`

should have a very similar and lowest importance (their correlation with`y`

is lowest by design);`x4`

,`x5`

, and`x6`

should have a very similar and medium importance;`x7`

,`x8`

, and`x9`

should have a very similar and highest importance (their correlation with`y`

is highest by design).

However, if we build a specific boosted tree model can we expect the same relationship? Let us check.

# Variable importance with respect to a specific model

We build a boosted tree model (using the default settings) and evaluate variable importance of the features:

```
julia> model = fit_evotree(EvoTreeRegressor(),
df;
target_name="y",
verbosity=0);
julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
"x9" => 0.33002820995126636
"x4" => 0.17950260124468856
"x5" => 0.10630471720405912
"x7" => 0.1002898622306779
"x1" => 0.09023808819243322
"x8" => 0.060680998291169054
"x3" => 0.04789330560493748
"x6" => 0.044689013127277216
"x2" => 0.040373204153491105
```

We see that `x9`

feature has the highest importance, but it is quite different
from `x8`

and `x7`

. The `x4`

feature, although it has a lower correlation with
`y`

than e.g. the `x8`

feature has a higher variable importance (the same holds
for `x1`

vs `x8`

).

What is the reason for such a situation? When a boosted tree model is built it
seems that what has happened is that `x9`

variable captured most of the value
of explanation of `y`

from `x7`

and `x8`

variables as they are very similar.
Therefore, in this specific model, `x7`

and `x8`

are not that important.

Let us try estimating the model for the second time to see if we notice any difference:

```
julia> model = fit_evotree(EvoTreeRegressor(),
df;
target_name="y",
verbosity=0);
julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
"x9" => 0.33002820995126636
"x4" => 0.17950260124468856
"x5" => 0.10630471720405912
"x7" => 0.1002898622306779
"x1" => 0.09023808819243322
"x8" => 0.060680998291169054
"x3" => 0.04789330560493748
"x6" => 0.044689013127277216
"x2" => 0.040373204153491105
```

The results are identical. You might wonder what is the reason for this? The cause
of this situation is that the `fit_evotree`

function uses a default seed when doing
computations so we get the same tree twice. To be precise, when we call
`EvoTreeRegressor()`

it sets the seed of the default random number generator in the
current task to `123`

.

So let us try shuffling the variables to see if we would get a different result:

```
julia> model = fit_evotree(EvoTreeRegressor(),
df[!, randperm(10)];
target_name="y",
verbosity=0);
julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
"x9" => 0.23187113718728977
"x8" => 0.20285271199278873
"x4" => 0.16779901582722756
"x5" => 0.15415181545562057
"x3" => 0.08494533205347177
"x2" => 0.06781415123236784
"x7" => 0.05788796227269619
"x1" => 0.024214826049282448
"x6" => 0.008463047929255035
```

Indeed the variable importance changed. Would it be still different if we did another randomized run?

```
julia> model = fit_evotree(EvoTreeRegressor(),
df[!, randperm(10)];
target_name="y",
verbosity=0);
julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
"x9" => 0.23187113718728977
"x8" => 0.20285271199278873
"x4" => 0.16779901582722756
"x5" => 0.15415181545562057
"x3" => 0.08494533205347177
"x2" => 0.06781415123236784
"x7" => 0.05788796227269619
"x1" => 0.024214826049282448
"x6" => 0.008463047929255035
```

Maybe this was a surprise to you but the answer is: no. We get the same results. What is going on?

This time the answer is that the `fit_evotree`

and `randperm`

functions share the same
random number generator (as we run them in the same task) and `fit_evotree`

resets its state
to `123`

when we call `EvoTreeRegressor()`

.
This means that when we invoked `randperm`

the generator was in the same state both times so the
`randperm(10)`

call produced the same sequence of numbers.

# Variable importance with respect to an algorithm and a data set

We were given an important general lesson. We need to properly initialize the functions that use randomness in our code. Let us leverage this knowledge to try assessing variable importance with respect to a boosted trees and a data set we have generated.

What I do in the code below is generating 1000 boosted trees and computing mean variable importance (along with some additional statistics) across all of them:

```
julia> rng = Xoshiro(1);
julia> reduce(vcat,
map(1:1000) do _
fit_evotree(EvoTreeRegressor(rng=rng),
df;
target_name="y",
verbosity=0) |>
EvoTrees.importance |>
sort |>
DataFrame
end) |> describe
9×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Float64 Float64 Float64 Int64 DataType
─────┼───────────────────────────────────────────────────────────────────────────
1 │ x1 0.094902 0.0456523 0.0946136 0.1437 0 Float64
2 │ x2 0.0475577 0.0109109 0.0469086 0.0969166 0 Float64
3 │ x3 0.0361729 0.00189622 0.0353323 0.0844639 0 Float64
4 │ x4 0.128568 0.032614 0.127093 0.285664 0 Float64
5 │ x5 0.111577 0.00310871 0.109548 0.238868 0 Float64
6 │ x6 0.0891413 0.0 0.0879858 0.196108 0 Float64
7 │ x7 0.189893 0.0400903 0.179193 0.417476 0 Float64
8 │ x8 0.155352 0.013377 0.154884 0.371452 0 Float64
9 │ x9 0.146837 0.00642872 0.141293 0.397151 0 Float64
```

This time we see a better separation between variables `x1`

, `x2`

, and `x3`

, followed
by `x4`

, `x5`

, and `x6`

, and finally the `x7`

, `x8`

, and `x9`

group.
However, still we see some non-negligible within-group differences (and `x1`

is even better than `x6`

).
It seems that just ensuring that we properly pass the random number generator
the `fit_evotree`

model random number generator is not enough.

Let us then do a final attempt. This time we both properly pass the random number generator to
the `fit_evotree`

model and randomize the order of variables in the source data frame
(making sure we also properly pass the random number generator to the `randperm`

function):

```
julia> reduce(vcat,
map(1:1000) do _
fit_evotree(EvoTreeRegressor(rng=rng),
df[!, randperm(rng, 10)];
target_name="y",
verbosity=0) |>
EvoTrees.importance |>
sort |>
DataFrame
end) |> describe
9×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Float64 Float64 Float64 Int64 DataType
─────┼──────────────────────────────────────────────────────────────────────────
1 │ x1 0.0595918 0.00260002 0.0524585 0.153613 0 Float64
2 │ x2 0.0604078 0.00374867 0.0534607 0.148399 0 Float64
3 │ x3 0.0586352 0.00219417 0.051398 0.140393 0 Float64
4 │ x4 0.103219 0.00484802 0.101625 0.252127 0 Float64
5 │ x5 0.119415 0.00146451 0.116338 0.268475 0 Float64
6 │ x6 0.106432 0.00672514 0.102708 0.254382 0 Float64
7 │ x7 0.153185 0.00346922 0.139289 0.388995 0 Float64
8 │ x8 0.166709 0.00651302 0.161872 0.419284 0 Float64
9 │ x9 0.172406 0.00979053 0.166798 0.444192 0 Float64
```

This time we have succeeded. While there is still some variability in within-group variable importance it is small, and the groups are clearly separated. The worst features have their importance around 6%, the medium value features around 11%, and the best features around 16%.

# Conclusions

Let us recap what we have seen today.

First, we see that variable importance with respect to a given concrete instance of a machine learning model can be significantly different from variable importance with respect to a given learning algorithm and a given dataset. Therefore, one should carefully think which one is interesting from the perspective of a problem that one wants to solve.

The second lesson is related to implementation details of machine learning algorithms:

- Many of them use pseudorandom numbers when building a model and proper handling of pseudorandom generator is crucial.
- The result of model building can depend on the order of features in the source data set. In our example we have seen that shuffling the columns of an input data table produced significantly different variable importance results.

I hope you found these examples useful!