Variable importance with EvoTrees.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/08/11/evotrees.html

Introduction

Variable importance in predictive modeling context
can have several varied meanings. Today I want to investigate two of them:

  • How important is a feature with respect to a given machine learning model?
  • How important is a feature with respect to a given learning algorithm and a given dataset?

In the post we will discuss why and how these two concepts can differ using an example
of boosted trees. We will use their implementation available in the EvoTrees.jl package.

The post was written under Julia 1.9.2, DataFrames.jl 1.9.2, Distributions.jl 0.25.98,
and EvoTrees.jl 0.15.2.

Generating the test data

Let us start with loading the packages we are going to use and generating the test data.

What I want to have is a data set with 10,000 observations.
We will have 9 continuous features denoted x1 to x9 and a continuous target variable y.
Our goal is to have triplets of features (x1 to x3, x4 to x6, x7 to x9)
highly correlated together (but independent between groups) and having a different
correlation level with the y target. I have made the within-group variables
highly correlated but non-identical (so they are distinguishable and have
a slightly different correlation with y in the sampled data set).

The code generating such data is as follows:

using DataFrames
using Distributions
using EvoTrees
using LinearAlgebra
using Random

δ = 1.0e-6
b = fill(1.0 - δ, 3, 3) + δ * I
z = zeros(3, 3)
y = fill(0.5, 3)
dist = MvNormal([b      z  z      0.8*y
                 z      b  z      y
                 z      z  b      1.2*y
                 0.8*y' y' 1.2*y' 1.0])
Random.seed!(1)
mat = rand(dist, 10_000);
df = DataFrame(transpose(mat), [string.("x", 1:9); "y"]);

Let us have a peek at the data:

julia> names(df)
10-element Vector{String}:
 "x1"
 "x2"
 "x3"
 "x4"
 "x5"
 "x6"
 "x7"
 "x8"
 "x9"
 "y"

julia> df
10000×10 DataFrame
   Row │ x1           x2          x3           x4           x5           x6          ⋯
       │ Float64      Float64     Float64      Float64      Float64      Float64     ⋯
───────┼──────────────────────────────────────────────────────────────────────────────
     1 │  0.0619327    0.0623264   0.0613998    0.0466594    0.0481949    0.0454962  ⋯
     2 │  0.217879     0.216951    0.217742     0.00738607   0.00888615   0.00626926
     3 │  1.54641      1.54598     1.54328     -1.00261     -1.0013      -0.999863
     4 │  0.208777     0.207593    0.209145    -1.21253     -1.21556     -1.21462
     5 │ -0.458805    -0.458081   -0.457956     0.103491     0.10537      0.103313   ⋯
     6 │  0.392072     0.390938    0.390447    -0.354123    -0.354995    -0.353026
     7 │ -0.313095    -0.310223   -0.311185     1.09256      1.09373      1.09443
   ⋮   │      ⋮           ⋮            ⋮            ⋮            ⋮            ⋮      ⋱
  9994 │ -1.24411     -1.24363    -1.24439     -0.789893    -0.793004    -0.792177
  9995 │  0.199036     0.199199    0.199344     0.945945     0.945308     0.943717   ⋯
  9996 │  1.81075      1.80926     1.81064     -2.53813     -2.53805     -2.53996
  9997 │ -0.00896532  -0.0079907  -0.00876527  -0.629303    -0.629402    -0.630129
  9998 │ -1.62881     -1.62626    -1.62703     -0.222873    -0.222469    -0.22166
  9999 │  1.45152      1.44833     1.45131     -2.543       -2.54377     -2.544      ⋯
 10000 │  0.436075     0.435492    0.436974    -0.28131     -0.281519    -0.283039
                                                       4 columns and 9986 rows omitted

Indeed we see that there are 9 features and one target variable. Also we visually
see that variables x1, x2, and x3 are almost the same but not identical.
Similarly x4, x5, and x6.
(I have cropped the rest of the printout as it was too wide for the post.)

I chose such a data generation scheme since a priori, that is
with respect to a given machine learning algorithm and a given dataset,
their importance is as follows:

  • x1, x2, and x3 should have a very similar and lowest importance
    (their correlation with y is lowest by design);
  • x4, x5, and x6 should have a very similar and medium importance;
  • x7, x8, and x9 should have a very similar and highest importance
    (their correlation with y is highest by design).

However, if we build a specific boosted tree model can we expect the same relationship?
Let us check.

Variable importance with respect to a specific model

We build a boosted tree model (using the default settings) and evaluate
variable importance of the features:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df;
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.33002820995126636
 "x4" => 0.17950260124468856
 "x5" => 0.10630471720405912
 "x7" => 0.1002898622306779
 "x1" => 0.09023808819243322
 "x8" => 0.060680998291169054
 "x3" => 0.04789330560493748
 "x6" => 0.044689013127277216
 "x2" => 0.040373204153491105

We see that x9 feature has the highest importance, but it is quite different
from x8 and x7. The x4 feature, although it has a lower correlation with
y than e.g. the x8 feature has a higher variable importance (the same holds
for x1 vs x8).

What is the reason for such a situation? When a boosted tree model is built it
seems that what has happened is that x9 variable captured most of the value
of explanation of y from x7 and x8 variables as they are very similar.
Therefore, in this specific model, x7 and x8 are not that important.

Let us try estimating the model for the second time to see if we notice any difference:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df;
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.33002820995126636
 "x4" => 0.17950260124468856
 "x5" => 0.10630471720405912
 "x7" => 0.1002898622306779
 "x1" => 0.09023808819243322
 "x8" => 0.060680998291169054
 "x3" => 0.04789330560493748
 "x6" => 0.044689013127277216
 "x2" => 0.040373204153491105

The results are identical. You might wonder what is the reason for this? The cause
of this situation is that the fit_evotree function uses a default seed when doing
computations so we get the same tree twice. To be precise, when we call
EvoTreeRegressor() it sets the seed of the default random number generator in the
current task to 123.

So let us try shuffling the variables to see if we would get a different result:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df[!, randperm(10)];
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.23187113718728977
 "x8" => 0.20285271199278873
 "x4" => 0.16779901582722756
 "x5" => 0.15415181545562057
 "x3" => 0.08494533205347177
 "x2" => 0.06781415123236784
 "x7" => 0.05788796227269619
 "x1" => 0.024214826049282448
 "x6" => 0.008463047929255035

Indeed the variable importance changed. Would it be still different if we did another
randomized run?

julia> model = fit_evotree(EvoTreeRegressor(),
                           df[!, randperm(10)];
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.23187113718728977
 "x8" => 0.20285271199278873
 "x4" => 0.16779901582722756
 "x5" => 0.15415181545562057
 "x3" => 0.08494533205347177
 "x2" => 0.06781415123236784
 "x7" => 0.05788796227269619
 "x1" => 0.024214826049282448
 "x6" => 0.008463047929255035

Maybe this was a surprise to you but the answer is: no. We get the same results.
What is going on?

This time the answer is that the fit_evotree and randperm functions share the same
random number generator (as we run them in the same task) and fit_evotree resets its state
to 123 when we call EvoTreeRegressor().
This means that when we invoked randperm the generator was in the same state both times so the
randperm(10) call produced the same sequence of numbers.

Variable importance with respect to an algorithm and a data set

We were given an important general lesson. We need to properly initialize the functions
that use randomness in our code. Let us leverage this knowledge to
try assessing variable importance with respect to a boosted trees and a data set
we have generated.

What I do in the code below is generating 1000 boosted trees
and computing mean variable importance (along with some additional statistics)
across all of them:

julia> rng = Xoshiro(1);

julia> reduce(vcat,
           map(1:1000) do _
               fit_evotree(EvoTreeRegressor(rng=rng),
                           df;
                           target_name="y",
                           verbosity=0) |>
               EvoTrees.importance |>
               sort |>
               DataFrame
           end) |> describe
9×7 DataFrame
 Row │ variable  mean       min         median     max        nmissing  eltype
     │ Symbol    Float64    Float64     Float64    Float64    Int64     DataType
─────┼───────────────────────────────────────────────────────────────────────────
   1 │ x1        0.094902   0.0456523   0.0946136  0.1437            0  Float64
   2 │ x2        0.0475577  0.0109109   0.0469086  0.0969166         0  Float64
   3 │ x3        0.0361729  0.00189622  0.0353323  0.0844639         0  Float64
   4 │ x4        0.128568   0.032614    0.127093   0.285664          0  Float64
   5 │ x5        0.111577   0.00310871  0.109548   0.238868          0  Float64
   6 │ x6        0.0891413  0.0         0.0879858  0.196108          0  Float64
   7 │ x7        0.189893   0.0400903   0.179193   0.417476          0  Float64
   8 │ x8        0.155352   0.013377    0.154884   0.371452          0  Float64
   9 │ x9        0.146837   0.00642872  0.141293   0.397151          0  Float64

This time we see a better separation between variables x1, x2, and x3, followed
by x4, x5, and x6, and finally the x7, x8, and x9 group.
However, still we see some non-negligible within-group differences (and x1 is even better than x6).
It seems that just ensuring that we properly pass the random number generator
the fit_evotree model random number generator is not enough.

Let us then do a final attempt. This time we both properly pass the random number generator to
the fit_evotree model and randomize the order of variables in the source data frame
(making sure we also properly pass the random number generator to the randperm function):

julia> reduce(vcat,
           map(1:1000) do _
               fit_evotree(EvoTreeRegressor(rng=rng),
                           df[!, randperm(rng, 10)];
                           target_name="y",
                           verbosity=0) |>
               EvoTrees.importance |>
               sort |>
               DataFrame
           end) |> describe
9×7 DataFrame
 Row │ variable  mean       min         median     max       nmissing  eltype
     │ Symbol    Float64    Float64     Float64    Float64   Int64     DataType
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ x1        0.0595918  0.00260002  0.0524585  0.153613         0  Float64
   2 │ x2        0.0604078  0.00374867  0.0534607  0.148399         0  Float64
   3 │ x3        0.0586352  0.00219417  0.051398   0.140393         0  Float64
   4 │ x4        0.103219   0.00484802  0.101625   0.252127         0  Float64
   5 │ x5        0.119415   0.00146451  0.116338   0.268475         0  Float64
   6 │ x6        0.106432   0.00672514  0.102708   0.254382         0  Float64
   7 │ x7        0.153185   0.00346922  0.139289   0.388995         0  Float64
   8 │ x8        0.166709   0.00651302  0.161872   0.419284         0  Float64
   9 │ x9        0.172406   0.00979053  0.166798   0.444192         0  Float64

This time we have succeeded. While there is still some variability in within-group
variable importance it is small, and the groups are clearly separated. The worst
features have their importance around 6%, the medium value features around 11%,
and the best features around 16%.

Conclusions

Let us recap what we have seen today.

First, we see that variable importance with respect to a given concrete instance
of a machine learning model can be significantly different from variable importance
with respect to a given learning algorithm and a given dataset. Therefore, one should
carefully think which one is interesting from the perspective of a problem that one
wants to solve.

The second lesson is related to implementation details of
machine learning algorithms:

  1. Many of them use pseudorandom numbers when building a model and proper
    handling of pseudorandom generator is crucial.
  2. The result of model building can depend on the order of features in the
    source data set. In our example we have seen that shuffling the columns
    of an input data table produced significantly different variable
    importance results.

I hope you found these examples useful!