Re-posted from: https://bkamins.github.io/julialang/2023/08/11/evotrees.html
Introduction
Variable importance in predictive modeling context
can have several varied meanings. Today I want to investigate two of them:
- How important is a feature with respect to a given machine learning model?
- How important is a feature with respect to a given learning algorithm and a given dataset?
In the post we will discuss why and how these two concepts can differ using an example
of boosted trees. We will use their implementation available in the EvoTrees.jl package.
The post was written under Julia 1.9.2, DataFrames.jl 1.9.2, Distributions.jl 0.25.98,
and EvoTrees.jl 0.15.2.
Generating the test data
Let us start with loading the packages we are going to use and generating the test data.
What I want to have is a data set with 10,000 observations.
We will have 9 continuous features denoted x1
to x9
and a continuous target variable y
.
Our goal is to have triplets of features (x1
to x3
, x4
to x6
, x7
to x9
)
highly correlated together (but independent between groups) and having a different
correlation level with the y
target. I have made the within-group variables
highly correlated but non-identical (so they are distinguishable and have
a slightly different correlation with y
in the sampled data set).
The code generating such data is as follows:
using DataFrames
using Distributions
using EvoTrees
using LinearAlgebra
using Random
δ = 1.0e-6
b = fill(1.0 - δ, 3, 3) + δ * I
z = zeros(3, 3)
y = fill(0.5, 3)
dist = MvNormal([b z z 0.8*y
z b z y
z z b 1.2*y
0.8*y' y' 1.2*y' 1.0])
Random.seed!(1)
mat = rand(dist, 10_000);
df = DataFrame(transpose(mat), [string.("x", 1:9); "y"]);
Let us have a peek at the data:
julia> names(df)
10-element Vector{String}:
"x1"
"x2"
"x3"
"x4"
"x5"
"x6"
"x7"
"x8"
"x9"
"y"
julia> df
10000×10 DataFrame
Row │ x1 x2 x3 x4 x5 x6 ⋯
│ Float64 Float64 Float64 Float64 Float64 Float64 ⋯
───────┼──────────────────────────────────────────────────────────────────────────────
1 │ 0.0619327 0.0623264 0.0613998 0.0466594 0.0481949 0.0454962 ⋯
2 │ 0.217879 0.216951 0.217742 0.00738607 0.00888615 0.00626926
3 │ 1.54641 1.54598 1.54328 -1.00261 -1.0013 -0.999863
4 │ 0.208777 0.207593 0.209145 -1.21253 -1.21556 -1.21462
5 │ -0.458805 -0.458081 -0.457956 0.103491 0.10537 0.103313 ⋯
6 │ 0.392072 0.390938 0.390447 -0.354123 -0.354995 -0.353026
7 │ -0.313095 -0.310223 -0.311185 1.09256 1.09373 1.09443
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
9994 │ -1.24411 -1.24363 -1.24439 -0.789893 -0.793004 -0.792177
9995 │ 0.199036 0.199199 0.199344 0.945945 0.945308 0.943717 ⋯
9996 │ 1.81075 1.80926 1.81064 -2.53813 -2.53805 -2.53996
9997 │ -0.00896532 -0.0079907 -0.00876527 -0.629303 -0.629402 -0.630129
9998 │ -1.62881 -1.62626 -1.62703 -0.222873 -0.222469 -0.22166
9999 │ 1.45152 1.44833 1.45131 -2.543 -2.54377 -2.544 ⋯
10000 │ 0.436075 0.435492 0.436974 -0.28131 -0.281519 -0.283039
4 columns and 9986 rows omitted
Indeed we see that there are 9 features and one target variable. Also we visually
see that variables x1
, x2
, and x3
are almost the same but not identical.
Similarly x4
, x5
, and x6
.
(I have cropped the rest of the printout as it was too wide for the post.)
I chose such a data generation scheme since a priori, that is
with respect to a given machine learning algorithm and a given dataset,
their importance is as follows:
x1
,x2
, andx3
should have a very similar and lowest importance
(their correlation withy
is lowest by design);x4
,x5
, andx6
should have a very similar and medium importance;x7
,x8
, andx9
should have a very similar and highest importance
(their correlation withy
is highest by design).
However, if we build a specific boosted tree model can we expect the same relationship?
Let us check.
Variable importance with respect to a specific model
We build a boosted tree model (using the default settings) and evaluate
variable importance of the features:
julia> model = fit_evotree(EvoTreeRegressor(),
df;
target_name="y",
verbosity=0);
julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
"x9" => 0.33002820995126636
"x4" => 0.17950260124468856
"x5" => 0.10630471720405912
"x7" => 0.1002898622306779
"x1" => 0.09023808819243322
"x8" => 0.060680998291169054
"x3" => 0.04789330560493748
"x6" => 0.044689013127277216
"x2" => 0.040373204153491105
We see that x9
feature has the highest importance, but it is quite different
from x8
and x7
. The x4
feature, although it has a lower correlation with
y
than e.g. the x8
feature has a higher variable importance (the same holds
for x1
vs x8
).
What is the reason for such a situation? When a boosted tree model is built it
seems that what has happened is that x9
variable captured most of the value
of explanation of y
from x7
and x8
variables as they are very similar.
Therefore, in this specific model, x7
and x8
are not that important.
Let us try estimating the model for the second time to see if we notice any difference:
julia> model = fit_evotree(EvoTreeRegressor(),
df;
target_name="y",
verbosity=0);
julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
"x9" => 0.33002820995126636
"x4" => 0.17950260124468856
"x5" => 0.10630471720405912
"x7" => 0.1002898622306779
"x1" => 0.09023808819243322
"x8" => 0.060680998291169054
"x3" => 0.04789330560493748
"x6" => 0.044689013127277216
"x2" => 0.040373204153491105
The results are identical. You might wonder what is the reason for this? The cause
of this situation is that the fit_evotree
function uses a default seed when doing
computations so we get the same tree twice. To be precise, when we call
EvoTreeRegressor()
it sets the seed of the default random number generator in the
current task to 123
.
So let us try shuffling the variables to see if we would get a different result:
julia> model = fit_evotree(EvoTreeRegressor(),
df[!, randperm(10)];
target_name="y",
verbosity=0);
julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
"x9" => 0.23187113718728977
"x8" => 0.20285271199278873
"x4" => 0.16779901582722756
"x5" => 0.15415181545562057
"x3" => 0.08494533205347177
"x2" => 0.06781415123236784
"x7" => 0.05788796227269619
"x1" => 0.024214826049282448
"x6" => 0.008463047929255035
Indeed the variable importance changed. Would it be still different if we did another
randomized run?
julia> model = fit_evotree(EvoTreeRegressor(),
df[!, randperm(10)];
target_name="y",
verbosity=0);
julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
"x9" => 0.23187113718728977
"x8" => 0.20285271199278873
"x4" => 0.16779901582722756
"x5" => 0.15415181545562057
"x3" => 0.08494533205347177
"x2" => 0.06781415123236784
"x7" => 0.05788796227269619
"x1" => 0.024214826049282448
"x6" => 0.008463047929255035
Maybe this was a surprise to you but the answer is: no. We get the same results.
What is going on?
This time the answer is that the fit_evotree
and randperm
functions share the same
random number generator (as we run them in the same task) and fit_evotree
resets its state
to 123
when we call EvoTreeRegressor()
.
This means that when we invoked randperm
the generator was in the same state both times so the
randperm(10)
call produced the same sequence of numbers.
Variable importance with respect to an algorithm and a data set
We were given an important general lesson. We need to properly initialize the functions
that use randomness in our code. Let us leverage this knowledge to
try assessing variable importance with respect to a boosted trees and a data set
we have generated.
What I do in the code below is generating 1000 boosted trees
and computing mean variable importance (along with some additional statistics)
across all of them:
julia> rng = Xoshiro(1);
julia> reduce(vcat,
map(1:1000) do _
fit_evotree(EvoTreeRegressor(rng=rng),
df;
target_name="y",
verbosity=0) |>
EvoTrees.importance |>
sort |>
DataFrame
end) |> describe
9×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Float64 Float64 Float64 Int64 DataType
─────┼───────────────────────────────────────────────────────────────────────────
1 │ x1 0.094902 0.0456523 0.0946136 0.1437 0 Float64
2 │ x2 0.0475577 0.0109109 0.0469086 0.0969166 0 Float64
3 │ x3 0.0361729 0.00189622 0.0353323 0.0844639 0 Float64
4 │ x4 0.128568 0.032614 0.127093 0.285664 0 Float64
5 │ x5 0.111577 0.00310871 0.109548 0.238868 0 Float64
6 │ x6 0.0891413 0.0 0.0879858 0.196108 0 Float64
7 │ x7 0.189893 0.0400903 0.179193 0.417476 0 Float64
8 │ x8 0.155352 0.013377 0.154884 0.371452 0 Float64
9 │ x9 0.146837 0.00642872 0.141293 0.397151 0 Float64
This time we see a better separation between variables x1
, x2
, and x3
, followed
by x4
, x5
, and x6
, and finally the x7
, x8
, and x9
group.
However, still we see some non-negligible within-group differences (and x1
is even better than x6
).
It seems that just ensuring that we properly pass the random number generator
the fit_evotree
model random number generator is not enough.
Let us then do a final attempt. This time we both properly pass the random number generator to
the fit_evotree
model and randomize the order of variables in the source data frame
(making sure we also properly pass the random number generator to the randperm
function):
julia> reduce(vcat,
map(1:1000) do _
fit_evotree(EvoTreeRegressor(rng=rng),
df[!, randperm(rng, 10)];
target_name="y",
verbosity=0) |>
EvoTrees.importance |>
sort |>
DataFrame
end) |> describe
9×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Float64 Float64 Float64 Int64 DataType
─────┼──────────────────────────────────────────────────────────────────────────
1 │ x1 0.0595918 0.00260002 0.0524585 0.153613 0 Float64
2 │ x2 0.0604078 0.00374867 0.0534607 0.148399 0 Float64
3 │ x3 0.0586352 0.00219417 0.051398 0.140393 0 Float64
4 │ x4 0.103219 0.00484802 0.101625 0.252127 0 Float64
5 │ x5 0.119415 0.00146451 0.116338 0.268475 0 Float64
6 │ x6 0.106432 0.00672514 0.102708 0.254382 0 Float64
7 │ x7 0.153185 0.00346922 0.139289 0.388995 0 Float64
8 │ x8 0.166709 0.00651302 0.161872 0.419284 0 Float64
9 │ x9 0.172406 0.00979053 0.166798 0.444192 0 Float64
This time we have succeeded. While there is still some variability in within-group
variable importance it is small, and the groups are clearly separated. The worst
features have their importance around 6%, the medium value features around 11%,
and the best features around 16%.
Conclusions
Let us recap what we have seen today.
First, we see that variable importance with respect to a given concrete instance
of a machine learning model can be significantly different from variable importance
with respect to a given learning algorithm and a given dataset. Therefore, one should
carefully think which one is interesting from the perspective of a problem that one
wants to solve.
The second lesson is related to implementation details of
machine learning algorithms:
- Many of them use pseudorandom numbers when building a model and proper
handling of pseudorandom generator is crucial. - The result of model building can depend on the order of features in the
source data set. In our example we have seen that shuffling the columns
of an input data table produced significantly different variable
importance results.
I hope you found these examples useful!