Author Archives: Blog by Bogumił Kamiński

DataFrames.jl survey: selecting columns of a data frame based on their values

Re-posted from: https://bkamins.github.io/julialang/2023/08/18/selectcolumn.html

Introduction

Today I want to make a user survey about a future direction
of development of DataFrames.jl.

Most commonly users want to select columns of a data frame
using their names and this is the operation that has an
extensive support in DataFrames.jl.
However, in some cases one might need to perform such a
selection conditional on a value stored in a column.

I want to ask if we should add a special selector allowing for
picking columns of a data frame based on their values.
The question is given in the conclusions section.
However, before we get to it let us briefly review what is
currently supported.

The post was written using Julia 1.9.2 and DataFrames.jl 1.6.1.

Column selection using column names

First create a simple data frame on which we are going
to perform the column selection examples:

julia> using DataFrames

julia> df = DataFrame(a1 = [1, 2],
                      a2=[1, missing],
                      b1=Any[3, missing],
                      b2=Any[3, 4])
2×4 DataFrame
 Row │ a1     a2       b1       b2
     │ Int64  Int64?   Any      Any
─────┼──────────────────────────────
   1 │     1        1  3        3
   2 │     2  missing  missing  4

Now assume we want to pick columns whose name starts with "b":

julia> select(df, Cols(startswith("b")))
2×2 DataFrame
 Row │ b1       b2
     │ Any      Any
─────┼──────────────
   1 │ 3        3
   2 │ missing  4

As you can see, if you pass a function returning a Bool
(such a function is often called a predicate) to Cols
selector you get columns whose names math the condition
defined by this function.

Today I want to focus on cases when you specify the condition
using a function. However, let me mention that there
are other ways to perform selection we discussed above. For example
we could use a regular expression:

julia> select(df, Cols(r"^b"))
2×2 DataFrame
 Row │ b1       b2
     │ Any      Any
─────┼──────────────
   1 │ 3        3
   2 │ missing  4

In general, in DataFrames.jl you currently have the following ways to
select columns (Warning! The list is long.):

a symbol, string, or integer;
vector of symbols, strings, integers, or bools;
regular expression;
All, Between, Cols, Not, and : selectors.

Column selection using column values

What if we wanted to select the columns using their values.
For example, assume that we want to pick columns that contain
missing value. In this case the easiest way to do it is to
use the eachcol(df) iterator over columns of our data frame:

julia> select(df, any.(ismissing, eachcol(df)))
2×2 DataFrame
 Row │ a2       b1
     │ Int64?   Any
─────┼──────────────────
   1 │       1  3
   2 │ missing  missing

Notice that the any.(ismissing, eachcol(df)) condition
iterates all columns of df and for each of them returns true
if they contain any missing value (and false otherwise):

julia> any.(ismissing, eachcol(df))
4-element BitVector:
 0
 1
 1
 0

An alternative, similar condition would be to select all columns
that allow for storing missing value (without requiring that
they actually have it stored in them). For this we need to use
the eltype function on columns:

julia> select(df, Missing .<: eltype.(eachcol(df)))
2×3 DataFrame
 Row │ a2       b1       b2
     │ Int64?   Any      Any
─────┼───────────────────────
   1 │       1  3        3
   2 │ missing  missing  4

Note that the difference is column :b2, which does not
contain missing values, but could contain them since its
element type is Any.

Column selection using column names and values

Now, what if we wanted to perform column selection based on both
their names and values?

The general pattern uses pairs(eachcol(df)) which iterates
pairs of column names and values:

julia> pairs(eachcol(df))
Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Vector{Symbol})(...):
  :a1 => [1, 2]
  :a2 => Union{Missing, Int64}[1, missing]
  :b1 => Any[3, missing]
  :b2 => Any[3, 4]

So for example if we wanted to pick columns that contain missing values
and start with "a" we can write:

julia> select(df, [startswith(string(n), "a") && any(ismissing, c)
                   for (n,c) in pairs(eachcol(df))])
2×1 DataFrame
 Row │ a2
     │ Int64?
─────┼─────────
   1 │       1
   2 │ missing

This pattern is fully general, but slightly verbose, especially column names
returned by pairs are Symbols. The same condition
can be written more naturally as follows:

julia> select(df, Cols(startswith("a"),
                       any.(ismissing, eachcol(df));
                       operator=intersect))
2×1 DataFrame
 Row │ a2
     │ Int64?
─────┼─────────
   1 │       1
   2 │ missing

Here we take advantage of the fact that our condition was a conjunction
and Cols selector accepts the operator keyword argument with allows
to get an intersection of two selectors.

If we wanted to select columns that meet
at least one of the conditions this would be even simpler:

julia> select(df, Cols(startswith("a"),
                       any.(ismissing, eachcol(df))))
2×3 DataFrame
 Row │ a1     a2       b1
     │ Int64  Int64?   Any
─────┼─────────────────────────
   1 │     1        1  3
   2 │     2  missing  missing

Note that by default Cols select columns that are a union of selectors
passed to it.

Conclusions

I hope the examples I presented today will be useful when you work with
DataFrames.jl.

You might ask why, when performing column selection
based on their values one needs to invoke eachcol(df). This is indeed
a bit verbose. However, we decided that Cols(predicate), which is
shorter to write should apply the predicate function to column names
as this is a more common operation. And if user wants to apply
the predicate function to column values (which is a less frequent case)
writing predicate.(eachcol(df)) is readable and easy enough.

If we added a built-in way for selection of columns by value
it would mean that instead of writing
predicate.(eachcol(df)) you would write something like
Vals(predicate) (the Vals is an example name – the choice of name can
be done later).

The benefits of having it are:

It is shorter.
We do not have a redundance of having to pass the df data frame in the expression.

Here are the cons of adding it:

It makes the list of things to learn longer (and the list is already quite long).
It is ambiguous how the Vals(predicate) should be interpreted if it were
used in the context of GroupedDataFrame as the question would be how should we
treat the groups (so most likely it should be only allowed for AbstractDataFrame).
It would require a change of internal memory layout of DataFrame object
(which means that the next release of DataFrames.jl would be incompatible on
binary level with the current release so serialization/deserialization would
not work cross-versions).

Now comes the question:

Should we add a new special selector that would allow picking
columns based on their values?

If you have an opinion please vote or comment in this issue on GitHub.

Variable importance with EvoTrees.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/08/11/evotrees.html

Introduction

Variable importance in predictive modeling context
can have several varied meanings. Today I want to investigate two of them:

How important is a feature with respect to a given machine learning model?
How important is a feature with respect to a given learning algorithm and a given dataset?

In the post we will discuss why and how these two concepts can differ using an example
of boosted trees. We will use their implementation available in the EvoTrees.jl package.

The post was written under Julia 1.9.2, DataFrames.jl 1.9.2, Distributions.jl 0.25.98,
and EvoTrees.jl 0.15.2.

Generating the test data

Let us start with loading the packages we are going to use and generating the test data.

What I want to have is a data set with 10,000 observations.
We will have 9 continuous features denoted x1 to x9 and a continuous target variable y.
Our goal is to have triplets of features (x1 to x3, x4 to x6, x7 to x9)
highly correlated together (but independent between groups) and having a different
correlation level with the y target. I have made the within-group variables
highly correlated but non-identical (so they are distinguishable and have
a slightly different correlation with y in the sampled data set).

The code generating such data is as follows:

using DataFrames
using Distributions
using EvoTrees
using LinearAlgebra
using Random

δ = 1.0e-6
b = fill(1.0 - δ, 3, 3) + δ * I
z = zeros(3, 3)
y = fill(0.5, 3)
dist = MvNormal([b      z  z      0.8*y
                 z      b  z      y
                 z      z  b      1.2*y
                 0.8*y' y' 1.2*y' 1.0])
Random.seed!(1)
mat = rand(dist, 10_000);
df = DataFrame(transpose(mat), [string.("x", 1:9); "y"]);

Let us have a peek at the data:

julia> names(df)
10-element Vector{String}:
 "x1"
 "x2"
 "x3"
 "x4"
 "x5"
 "x6"
 "x7"
 "x8"
 "x9"
 "y"

julia> df
10000×10 DataFrame
   Row │ x1           x2          x3           x4           x5           x6          ⋯
       │ Float64      Float64     Float64      Float64      Float64      Float64     ⋯
───────┼──────────────────────────────────────────────────────────────────────────────
     1 │  0.0619327    0.0623264   0.0613998    0.0466594    0.0481949    0.0454962  ⋯
     2 │  0.217879     0.216951    0.217742     0.00738607   0.00888615   0.00626926
     3 │  1.54641      1.54598     1.54328     -1.00261     -1.0013      -0.999863
     4 │  0.208777     0.207593    0.209145    -1.21253     -1.21556     -1.21462
     5 │ -0.458805    -0.458081   -0.457956     0.103491     0.10537      0.103313   ⋯
     6 │  0.392072     0.390938    0.390447    -0.354123    -0.354995    -0.353026
     7 │ -0.313095    -0.310223   -0.311185     1.09256      1.09373      1.09443
   ⋮   │      ⋮           ⋮            ⋮            ⋮            ⋮            ⋮      ⋱
  9994 │ -1.24411     -1.24363    -1.24439     -0.789893    -0.793004    -0.792177
  9995 │  0.199036     0.199199    0.199344     0.945945     0.945308     0.943717   ⋯
  9996 │  1.81075      1.80926     1.81064     -2.53813     -2.53805     -2.53996
  9997 │ -0.00896532  -0.0079907  -0.00876527  -0.629303    -0.629402    -0.630129
  9998 │ -1.62881     -1.62626    -1.62703     -0.222873    -0.222469    -0.22166
  9999 │  1.45152      1.44833     1.45131     -2.543       -2.54377     -2.544      ⋯
 10000 │  0.436075     0.435492    0.436974    -0.28131     -0.281519    -0.283039
                                                       4 columns and 9986 rows omitted

Indeed we see that there are 9 features and one target variable. Also we visually
see that variables x1, x2, and x3 are almost the same but not identical.
Similarly x4, x5, and x6.
(I have cropped the rest of the printout as it was too wide for the post.)

I chose such a data generation scheme since a priori, that is
with respect to a given machine learning algorithm and a given dataset,
their importance is as follows:

x1, x2, and x3 should have a very similar and lowest importance
(their correlation with y is lowest by design);
x4, x5, and x6 should have a very similar and medium importance;
x7, x8, and x9 should have a very similar and highest importance
(their correlation with y is highest by design).

However, if we build a specific boosted tree model can we expect the same relationship?
Let us check.

Variable importance with respect to a specific model

We build a boosted tree model (using the default settings) and evaluate
variable importance of the features:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df;
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.33002820995126636
 "x4" => 0.17950260124468856
 "x5" => 0.10630471720405912
 "x7" => 0.1002898622306779
 "x1" => 0.09023808819243322
 "x8" => 0.060680998291169054
 "x3" => 0.04789330560493748
 "x6" => 0.044689013127277216
 "x2" => 0.040373204153491105

We see that x9 feature has the highest importance, but it is quite different
from x8 and x7. The x4 feature, although it has a lower correlation with
y than e.g. the x8 feature has a higher variable importance (the same holds
for x1 vs x8).

What is the reason for such a situation? When a boosted tree model is built it
seems that what has happened is that x9 variable captured most of the value
of explanation of y from x7 and x8 variables as they are very similar.
Therefore, in this specific model, x7 and x8 are not that important.

Let us try estimating the model for the second time to see if we notice any difference:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df;
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.33002820995126636
 "x4" => 0.17950260124468856
 "x5" => 0.10630471720405912
 "x7" => 0.1002898622306779
 "x1" => 0.09023808819243322
 "x8" => 0.060680998291169054
 "x3" => 0.04789330560493748
 "x6" => 0.044689013127277216
 "x2" => 0.040373204153491105

The results are identical. You might wonder what is the reason for this? The cause
of this situation is that the fit_evotree function uses a default seed when doing
computations so we get the same tree twice. To be precise, when we call
EvoTreeRegressor() it sets the seed of the default random number generator in the
current task to 123.

So let us try shuffling the variables to see if we would get a different result:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df[!, randperm(10)];
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.23187113718728977
 "x8" => 0.20285271199278873
 "x4" => 0.16779901582722756
 "x5" => 0.15415181545562057
 "x3" => 0.08494533205347177
 "x2" => 0.06781415123236784
 "x7" => 0.05788796227269619
 "x1" => 0.024214826049282448
 "x6" => 0.008463047929255035

Indeed the variable importance changed. Would it be still different if we did another
randomized run?

julia> model = fit_evotree(EvoTreeRegressor(),
                           df[!, randperm(10)];
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.23187113718728977
 "x8" => 0.20285271199278873
 "x4" => 0.16779901582722756
 "x5" => 0.15415181545562057
 "x3" => 0.08494533205347177
 "x2" => 0.06781415123236784
 "x7" => 0.05788796227269619
 "x1" => 0.024214826049282448
 "x6" => 0.008463047929255035

Maybe this was a surprise to you but the answer is: no. We get the same results.
What is going on?

This time the answer is that the fit_evotree and randperm functions share the same
random number generator (as we run them in the same task) and fit_evotree resets its state
to 123 when we call EvoTreeRegressor().
This means that when we invoked randperm the generator was in the same state both times so the
randperm(10) call produced the same sequence of numbers.

Variable importance with respect to an algorithm and a data set

We were given an important general lesson. We need to properly initialize the functions
that use randomness in our code. Let us leverage this knowledge to
try assessing variable importance with respect to a boosted trees and a data set
we have generated.

What I do in the code below is generating 1000 boosted trees
and computing mean variable importance (along with some additional statistics)
across all of them:

julia> rng = Xoshiro(1);

julia> reduce(vcat,
           map(1:1000) do _
               fit_evotree(EvoTreeRegressor(rng=rng),
                           df;
                           target_name="y",
                           verbosity=0) |>
               EvoTrees.importance |>
               sort |>
               DataFrame
           end) |> describe
9×7 DataFrame
 Row │ variable  mean       min         median     max        nmissing  eltype
     │ Symbol    Float64    Float64     Float64    Float64    Int64     DataType
─────┼───────────────────────────────────────────────────────────────────────────
   1 │ x1        0.094902   0.0456523   0.0946136  0.1437            0  Float64
   2 │ x2        0.0475577  0.0109109   0.0469086  0.0969166         0  Float64
   3 │ x3        0.0361729  0.00189622  0.0353323  0.0844639         0  Float64
   4 │ x4        0.128568   0.032614    0.127093   0.285664          0  Float64
   5 │ x5        0.111577   0.00310871  0.109548   0.238868          0  Float64
   6 │ x6        0.0891413  0.0         0.0879858  0.196108          0  Float64
   7 │ x7        0.189893   0.0400903   0.179193   0.417476          0  Float64
   8 │ x8        0.155352   0.013377    0.154884   0.371452          0  Float64
   9 │ x9        0.146837   0.00642872  0.141293   0.397151          0  Float64

This time we see a better separation between variables x1, x2, and x3, followed
by x4, x5, and x6, and finally the x7, x8, and x9 group.
However, still we see some non-negligible within-group differences (and x1 is even better than x6).
It seems that just ensuring that we properly pass the random number generator
the fit_evotree model random number generator is not enough.

Let us then do a final attempt. This time we both properly pass the random number generator to
the fit_evotree model and randomize the order of variables in the source data frame
(making sure we also properly pass the random number generator to the randperm function):

julia> reduce(vcat,
           map(1:1000) do _
               fit_evotree(EvoTreeRegressor(rng=rng),
                           df[!, randperm(rng, 10)];
                           target_name="y",
                           verbosity=0) |>
               EvoTrees.importance |>
               sort |>
               DataFrame
           end) |> describe
9×7 DataFrame
 Row │ variable  mean       min         median     max       nmissing  eltype
     │ Symbol    Float64    Float64     Float64    Float64   Int64     DataType
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ x1        0.0595918  0.00260002  0.0524585  0.153613         0  Float64
   2 │ x2        0.0604078  0.00374867  0.0534607  0.148399         0  Float64
   3 │ x3        0.0586352  0.00219417  0.051398   0.140393         0  Float64
   4 │ x4        0.103219   0.00484802  0.101625   0.252127         0  Float64
   5 │ x5        0.119415   0.00146451  0.116338   0.268475         0  Float64
   6 │ x6        0.106432   0.00672514  0.102708   0.254382         0  Float64
   7 │ x7        0.153185   0.00346922  0.139289   0.388995         0  Float64
   8 │ x8        0.166709   0.00651302  0.161872   0.419284         0  Float64
   9 │ x9        0.172406   0.00979053  0.166798   0.444192         0  Float64

This time we have succeeded. While there is still some variability in within-group
variable importance it is small, and the groups are clearly separated. The worst
features have their importance around 6%, the medium value features around 11%,
and the best features around 16%.

Conclusions

Let us recap what we have seen today.

First, we see that variable importance with respect to a given concrete instance
of a machine learning model can be significantly different from variable importance
with respect to a given learning algorithm and a given dataset. Therefore, one should
carefully think which one is interesting from the perspective of a problem that one
wants to solve.

The second lesson is related to implementation details of
machine learning algorithms:

Many of them use pseudorandom numbers when building a model and proper
handling of pseudorandom generator is crucial.
The result of model building can depend on the order of features in the
source data set. In our example we have seen that shuffling the columns
of an input data table produced significantly different variable
importance results.

I hope you found these examples useful!

The cord tying problem

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/08/03/laces.html

Introduction

During the JuliaCon 2023 conference I got several
suggestions for writing some more puzzle-solving posts.
Therefore this week I want to present a problem that I have recently
learned from Paweł Prałat:

Assume that you have an even number n of cords of equal length
lying in a bunch. They are arranged in such a way that they are
approximately straight so that you see one end of each cord in one
region (call it left) and the other end in another region
(call it right). Imagine, for example, that the cords were wrapped
around in their middles. However, this unfortunately
means, you cannot distinguish which end belongs to which cord.
Your task is to tie the ends of the cords in such a way that after
removing the middle wrapping they form a single big loop. Assuming that
you tie the cords randomly (left and right ends separately)
compute the probability that you are going to succeed.

The initial setup of the cords (before we start tying them) is
shown on a figure below (I took the image from this source):

Cords

As usual, we are going to solve it analytically and computationally.

The post was written using Julia 1.9.2 and DataFrames.jl 1.6.1.

Analytical solution

Denote by p(n) the probability that we succeed with n cords.

Notice that we can assume that we can first tie n right
ends of the cords in any order.

Now let us analyze tying the left ends of the cords.
We assume that they are tied randomly.
We start tying the left ends by randomly picking cord
a and tying it to a random cord b.
Let us ask when such a tie is a success and when it is a failure.

If n = 2 we know we succeeded. We have just created a loop
using two cords. Thus p(2) = 1.

If n > 2 what we want to avoid is a situation when the cords
a and b are already tied on the right side. Why?
Because then we would create a loop
that would be smaller than the required loop including all cords.
The probability that we pick a wrong end of a cord is 1/(n-1)
as we have n-1 ends to choose from and one of them is bad.
Thus we succeed with probability (n-2)/(n-1).

Now observe that, assuming we succeeded, we have just tied three original
cords into a one longer cord. Thus we are left with a situation
when we have n-2 cords and the problem has the same structure to
what we started with.
So we get that for n > 2 we can computep(n) = p(n-2)*(n-2)/(n-1).

This means that we can write (using Julia code notation):

p(n) = prod((i-1)/i for i in n-1:-2:3)

Note that this function works correctly only for even n
that is at least 4. We will make its more careful implementation
in the computational solution section below.

However, first, let us ask ourselves if the formula for this function
can be simplified. Indeed we see that it is equivalent to:

prod(i-1 for i in n-1:-2:3) / prod(i for i in n-1:-2:3)

which in turn can be rewritten as:

prod(i-1 for i in n-1:-2:3)^2 / prod(i for i in n-1:-1:2)

Now observe that the numerator is just 2^(n-2)*factorial(n÷2-1)^2
(remember that n is even)
and the denominator is factorial(n-1). Thus the formula further
simplifies to:

2^(n-2)*factorial(n÷2-1)^2 / factorial(n-1)

And finally:

2^(n-2) / ((n-1)*binomial(n-2, n÷2-1))

Now from Stirling’s approximation we know that:

binomial(n-2, n÷2-1) ~ 2^(2n-4) / sqrt((n-2)*pi)

so for sufficiently large n:

p(n) ~ sqrt((n/2-1)*pi) / (n-1)

Thus we learn that the probability of getting a full circle
is of order O(1/sqrt(n)) for large n.

Let us now check these results computationally.

Computational solution

First start with a more careful implementation of the functions
computing p(n):

function connect_exact(n::Integer)
    @assert n > 3 && iseven(n)
    return prod((i-1)/i for i in n-1:-2:3)
end

function connect_approx(n::Integer)
    @assert n > 3 && iseven(n)
    return sqrt(pi * (n / 2 - 1)) / (n - 1)
end

Let us check how close the exact and approximate formulas are.
Let us compute percentage deviation of the approximation
from the exact result:

julia> [1 - connect_approx(n) / connect_exact(n) for n in 4:2:28]
13-element Vector{Float64}:
11377307454724206
060014397013374854
040631211300167
030689300286045995
024649922854770412
02059439568578203
017683822837349372
015493594528168564
013785863139806342
012417071173843053
011295454766000912
01035962441429672
00956696079055186

We see that approximation is slightly below the exact number and that
the percentage deviation decreases as n goes up. With n=28 we are
below 1% error.

Let us check some larger value of n:

julia> 1 - connect_approx(10000) / connect_exact(10000)
2.5004688326446534e-5

We see that the values are now really close.
If you were afraid that we might be hitting numeric computation
issues with connect_approx since we are multiplying a lot of
values, we can easily switch to a more precise computation with Julia:

julia> 1 - connect_approx(big(10000)) / connect_exact(big(10000))
2.500468833607760982625749174941669517305288399515417883351990349709823151295288e-05

We see that using normal Float64 was enough for this range of values
of n to get enough accuracy.

But what if we were not sure if our derivation of the formula for p(n)
was correct? We can use simulation to check it.

Here is the implementation of a simulator:

function connect_sim(n::Integer)
    @assert iseven(n) && n > 3
    left = randperm(n)
    neis2 = zeros(Int, n)
    for i in 1:2:n
        neis2[left[i]] = left[i+1]
        neis2[left[i+1]] = left[i]
    end
    prev = 1
    loc = 2
    visited = 2
    while true
        nei1 = isodd(loc) ? loc+1 : loc-1
        nei2 = neis2[loc]
        loc, prev = (prev == nei1 ? nei2 : nei1), loc
        loc == 1 && return visited == n
        visited += 1
    end
end

The the code we assume that we numbered the cords from 1 to n
and that in the right part they are connected 1-2, 3-4, …
(note that we can always re-number them to get this).

The neis2 keeps the information about connections on left.
To get a random connection pattern we first draw a random n-element
permutation and store it in the left variable. Then we assume that
the connections are formed by cords left[1]-left[2], left[3]-left[4], …
and store these connections in the neis2 vector.

Now we are ready to check if this connection pattern is good, that
is, it creates one big loop. To do this we start from cord 1
and assume that we first moved to cord 2. The current location of
our travel is kept in variable loc. Then from each cord we move
either on right or on left to the next cord. The nei1 variable keeps
cords neighbor on right and nei2 on left. We keep track in the prev
variable which cord we have visited last. Using this information
we know which move we should make next. Notice that since we started from
1 we eventually have to reach it. The number of steps taken to reach
1 is tracked by the visited variable. If when loc == 1 we have
that visited == n this means that we have formed a big cycle and
we return true. Otherwise we return false.

Let us check if our simulation indeed returns values close to theoretical
ones. For this we will record the mean of 100,000 runs of our simulation
(and here the power of Julia shines – it is not a problem to run that many
samples). We check the results for the values of n we investigated above:

using DataFrames
using Random
using Statistics
connect_sim_mean(n) =
    mean(connect_sim(n) for _ in 1:100_000)
Random.seed!(1234)
df = DataFrame(n=[4:2:28; 10_000])
transform(df, :n .=> ByRow.([connect_exact,
                             connect_approx,
                             connect_sim_mean]))

The results of running this code are given below:

14×4 DataFrame
 Row │ n      n_connect_exact  n_connect_approx  n_connect_sim_mean
     │ Int64  Float64          Float64           Float64
─────┼──────────────────────────────────────────────────────────────
│     4        0.666667          0.590818              0.66662
│     6        0.533333          0.501326              0.53622
│     8        0.457143          0.438569              0.45843
│    10        0.406349          0.393879              0.40671
│    12        0.369408          0.360302              0.36978
│    14        0.340992          0.33397               0.33996
│    16        0.31826           0.312631              0.31743
│    18        0.299538          0.294897              0.29848
│    20        0.283773          0.279861              0.28399
│    22        0.27026           0.266904              0.26879
│    24        0.25851           0.25559               0.25528
│    26        0.248169          0.245598              0.24755
│    28        0.238978          0.236692              0.24052
│ 10000        0.0125335         0.0125331             0.01266

We can see that simulation results match the exact calculations well.

Conclusions

I hope you liked the puzzle and the solution. Next week I plan to
present the results of some experiments involving machine learning
models in Julia.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Blog by Bogumił Kamiński

DataFrames.jl survey: selecting columns of a data frame based on their values

Introduction

Column selection using column names

Column selection using column values

Column selection using column names and values

Conclusions

Variable importance with EvoTrees.jl

Introduction

Generating the test data

Variable importance with respect to a specific model

Variable importance with respect to an algorithm and a data set

Conclusions

The cord tying problem

Introduction

Analytical solution

Computational solution

Conclusions