Author Archives: Blog by Bogumił Kamiński

A simple coin-tossing game

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/05/24/probability.html

Introduction

I have been writing my blog for over 4 years now (without missing a single week).
My first post was on May 10, 2020, you can find it here.

There is a small change in how I distribute my content. Starting from last week I made the repository of my blog public,
so if you find any mistake please do not hesitate to open a Pull Request here.

To celebrate this I decided to go back to my favorite topic – mathematical puzzles.
Today I use a classic coin-tossing game example

The post was written under Julia 1.10.1 and StatsBase.jl 0.34.4, and FreqTables.jl 0.4.6, BenchmarkTools.jl 1.5.0.

The problem

Assume Alice and Bob toss a fair coin. Alice wins if after tossing a head (H) tail (T) is tossed, that is we see a HT sequence.
Bob wins if two consecutive heads are tossed, that is we see a HH sequence.

The questions are:

  • Who is more likely to win this game?
  • If only Alice played, how long, on the average, would she wait till HT was tossed?
  • If only Bob played, how long, on the average, would she wait till HH was tossed?

Let us try to answer these questions using Julia.

Both players play

This code simulates the situation when Alice and Bob play together:

function both()
    a = rand(('H', 'T'))
    while true
        b = rand(('H', 'T'))
        if a == 'T'
            a = b
        else
            return b == 'H' ? "Bob" : "Alice"
        end
    end
end

The both function returns "Bob" if Bob wins, and "Alice" otherwise.
From the code it should be already clear that both players have the same probability of winning.
The only way to terminate the simulation is return b == 'H' ? "Bob" : "Alice" and this condition is symmetric
with respect to Alice and Bob. Let us confirm this by running a simulation:

julia> using FreqTables, Random

julia> Random.seed!(1234);

julia> freqtable([both() for _ in 1:100_000_000])
2-element Named Vector{Int64}
Dim1  │
──────┼─────────
A     │ 50000012
B     │ 49999988

Indeed, the number of times Alice and Bob win seem to be the same.

Alice’s waiting time

Now let us check how long, on the average, Alice has to wait to see the HT sequence. Here is Alice’s simulator:

function alice()
    a = rand(('H', 'T'))
    i = 1
    while true
        b = rand(('H', 'T'))
        i += 1
        a == 'H' && b == 'T' && return i
        a = b
    end
end

Let us check it:

julia> using StatsBase

julia> describe([alice() for _ in 1:100_000_000])
Summary Stats:
Length:         100000000
Missing Count:  0
Mean:           3.999890
Std. Deviation: 2.000032
Minimum:        2.000000
1st Quartile:   2.000000
Median:         3.000000
3rd Quartile:   5.000000
Maximum:        31.000000
Type:           Int64

So it seems that, in expectation, Alice finishes her game in 4 tosses.
Can we expect the same for Ben (as we remember – if they play together they have the same chances of finishing first)?
Let us see.

Alice’s waiting time

Now let us check how long, on the average, Bob has to wait to see the HH sequence. Here is Bob’s simulator:

function bob()
    a = rand(('H', 'T'))
    i = 1
    while true
        b = rand(('H', 'T'))
        i += 1
        a == 'H' && b == 'H' && return i
        a = b
    end
end

Let us check it:

julia> describe([bob() for _ in 1:100_000_000])
Summary Stats:
Length:         100000000
Missing Count:  0
Mean:           5.999915
Std. Deviation: 4.690177
Minimum:        2.000000
1st Quartile:   2.000000
Median:         5.000000
3rd Quartile:   8.000000
Maximum:        87.000000
Type:           Int64

To our surprise, Bob needs 6 coin tosses, on the average, to see HH.

What is the reason of this difference? Assume we have just tossed H. Start with Bob. If we hit H we finish. If we hit T we then need to wait till we see H again to be able to consider finishing.
However, if we are Alice if we hit T we finish, but if we hit H we do not have to wait for anything – we are already in a state that gives us a chance to finish the game in the next step.

Conclusions

The difference between joint games and separate games is a bit surprising and I hope you found it interesting if you have not seen this puzzle before.
Today I have approached this problem using simulation. However, it is easy to write down a Markov chain representation of all three scenarios and solve them analytically.
I encourage you to try doing this exercise.

PS.

In the code I use the rand(('H', 'T')) form to generate randomness. It is much faster than e.g. writing rand(["H", "T"]) (which would be a first instinct), for two reasons:

  • using Char instead of String is a more lightweight option;
  • using Tuple instead of Vector avoids allocations.

Let us see a comparison of timing (I cut out the histograms from the output):

julia> using BenchmarkTools

julia> @benchmark rand(('H', 'T'))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.900 ns … 233.700 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.500 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.316 ns ±   2.772 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark rand(["H", "T"])
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  15.816 ns …  2.086 μs  ┊ GC (min … max): 0.00% … 96.30%
 Time  (median):     19.019 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.041 ns ± 60.335 ns  ┊ GC (mean ± σ):  9.44% ±  3.63%

 Memory estimate: 64 bytes, allocs estimate: 1.

In this case we could also just use rand(Bool) (as the coin is fair and has only two states):

julia> @benchmark rand(Bool)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.700 ns … 131.000 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.200 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.218 ns ±   2.112 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 0 bytes, allocs estimate: 0.

but as you can see rand(('H', 'T')) has a similar speed and leads to a much more readable code.

DataFrames.jl: avoiding compilation

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/05/17/compilation.html

Introduction

Today I want to go back to a topic of performance considerations of using anonymous functions in combination with DataFrames.jl.
I have written about it in the past, but it is an issue that new users often ask about.
In the post I will explain you the problem, its causes, and how to avoid it.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Performance issues caused by anonymous functions

Consider the following sequence of DataFrames.jl operations (I am suppressing the output of the operations with ; as it is irrelevant):

julia> using DataFrames

julia> df = DataFrame(x=1:3);

julia> @time select(df, :x => (x -> 2 * x) => :x2);
  0.077134 seconds (73.33 k allocations: 5.010 MiB, 99.28% compilation time)

julia> @time select(df, :x => (x -> 2 * x) => :x2);
  0.013731 seconds (6.30 k allocations: 450.148 KiB, 98.15% compilation time)

julia> @time subset(df, :x => ByRow(x -> x > 1.5));
  0.094046 seconds (91.86 k allocations: 6.219 MiB, 99.29% compilation time)

julia> @time subset(df, :x => ByRow(x -> x > 1.5));
  0.086597 seconds (43.05 k allocations: 2.881 MiB, 42.62% gc time, 99.44% compilation time)

In both select and subset examples I used an anonymous function. In the first case it was x -> 2 * x, and in the second x -> x > 1.5.

What you can notice is that most of the time (even in consecutive calls) is spent in compilation. What is the reason for this?

Let me explain this by example. When we pass the x -> 2 * x function to select, the select function needs to be compiled. Since the select function
is quite complex its compilation time is long.

Why does this happen. The reason is that each time we write x -> 2 * x Julia defines a new anonymous function. Julia compiler does not recognize that it is in fact the same function. Have a look here:

julia> x -> 2 * x
#9 (generic function with 1 method)

julia> x -> 2 * x
#11 (generic function with 1 method)

We can see that we get a different function (one denoted #9 and the other #11) although the definition of the function is identical.

How to solve the compilation issue?

Fortunately, there is a simple way to resolve this problem. Instead of using an anonymous function, just use a named function:

julia> times2(x) = 2 * x
times2 (generic function with 1 method)

julia> @time select(df, :x => times2 => :x2);
  0.013728 seconds (5.63 k allocations: 401.305 KiB, 98.54% compilation time)

julia> @time select(df, :x => times2 => :x2);
  0.000142 seconds (71 allocations: 3.516 KiB)

julia> gt15(x) = x > 1.5
gt15 (generic function with 1 method)

julia> @time subset(df, :x => ByRow(gt15));
  0.041173 seconds (42.64 k allocations: 2.849 MiB, 99.01% compilation time)

julia> @time subset(df, :x => ByRow(gt15));
  0.000165 seconds (120 allocations: 5.648 KiB)

Now you see that consecutive calls are fast and do not cause compilation.

Actually, instead of defining the gt15 function we could just have written >(1.5):

julia> >(1.5)
(::Base.Fix2{typeof(>), Float64}) (generic function with 1 method)

Which defines a functor that works as a named function (so it requires only one compilation):

julia> @time subset(df, :x => ByRow(>(1.5)));
  0.075423 seconds (41.80 k allocations: 2.804 MiB, 99.32% compilation time)

julia> @time subset(df, :x => ByRow(>(1.5)));
  0.000189 seconds (124 allocations: 5.898 KiB)

If you want to learn how functors work in Julia, have a look here.

Conclusions

Today I have presented some simple examples, but I hope that they are useful for new users of DataFrames.jl in helping them to improve the performance of their code.

Julia for Data Analysis Strikes Back

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/05/10/jda.html

Introduction

This week I got a nice little surprise in my office. A year after my Julia for Data Analysis
book has been published I got a package with a set of printed versions of its Korean translation
데이터 분석을 위한 줄리아. It was really a nice experience and I hope that Julia users from Korea will like it.

데이터 분석을 위한 줄리아

Therefore, for today, I decided to discuss a functionality that is little known, but often quite useful.
It is related to adding conditional columns to a data frame.

The post was written under Julia 1.10.1, DataFrames.jl 1.6.1, and DataFramesMeta.jl 0.15.2.

The problem

Assume you have the following data frame:

julia> using DataFrames

julia> df = DataFrame(x=-2.0:0.5:2.0)
9×1 DataFrame
 Row │ x
     │ Float64
─────┼─────────
   1 │    -2.0
   2 │    -1.5
   3 │    -1.0
   4 │    -0.5
   5 │     0.0
   6 │     0.5
   7 │     1.0
   8 │     1.5
   9 │     2.0

Now we want to add a second column to this data frame that contains a square root of column "x".

A basic approach fails:

julia> df.sqrtx = sqrt.(df.x)
ERROR: DomainError with -2.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).

The reason is that we cannot normally take a square root of a negative number.

We can perform a conditional processing for example like this:

julia> df.sqrtx = (x -> x < 0.0 ? missing : sqrt(x)).(df.x)
9-element Vector{Union{Missing, Float64}}:
  missing
  missing
  missing
  missing
 0.0
 0.7071067811865476
 1.0
 1.224744871391589
 1.4142135623730951

julia> df
9×2 DataFrame
 Row │ x        sqrtx
     │ Float64  Float64?
─────┼─────────────────────────
   1 │    -2.0  missing
   2 │    -1.5  missing
   3 │    -1.0  missing
   4 │    -0.5  missing
   5 │     0.0        0.0
   6 │     0.5        0.707107
   7 │     1.0        1.0
   8 │     1.5        1.22474
   9 │     2.0        1.41421

but I do not find this approach very readable (especially from the perspective of a beginner).

The alternative that I prefer is to work with a view of the source data frame. Let us first create such a view that contains all columns of the original data frame, but only rows in which column "x" is non-negative:

julia> dfv = filter(:x => >=(0.0), df, view=true)
5×2 SubDataFrame
 Row │ x        sqrtx
     │ Float64  Float64?
─────┼───────────────────
   1 │     0.0  0.0
   2 │     0.5  0.707107
   3 │     1.0  1.0
   4 │     1.5  1.22474
   5 │     2.0  1.41421

Now, we can add a column to such a view by using a plain sqrt function without any decorations:

julia> dfv.sqrtx2 = sqrt.(dfv.x)
5-element Vector{Float64}:
 0.0
 0.7071067811865476
 1.0
 1.224744871391589
 1.4142135623730951

julia> dfv
5×3 SubDataFrame
 Row │ x        sqrtx     sqrtx2
     │ Float64  Float64?  Float64?
─────┼─────────────────────────────
   1 │     0.0  0.0       0.0
   2 │     0.5  0.707107  0.707107
   3 │     1.0  1.0       1.0
   4 │     1.5  1.22474   1.22474
   5 │     2.0  1.41421   1.41421

julia> df
9×3 DataFrame
 Row │ x        sqrtx           sqrtx2
     │ Float64  Float64?        Float64?
─────┼─────────────────────────────────────────
   1 │    -2.0  missing         missing
   2 │    -1.5  missing         missing
   3 │    -1.0  missing         missing
   4 │    -0.5  missing         missing
   5 │     0.0        0.0             0.0
   6 │     0.5        0.707107        0.707107
   7 │     1.0        1.0             1.0
   8 │     1.5        1.22474         1.22474
   9 │     2.0        1.41421         1.41421

Note that both dfv and df are updated as expected. The filtered-out rows get missing values.

It is important to highlight that this functionality works if the view (SubDataFrame) was created using all columns of the source data frame (like is done in the case of our filter call above).
The reason for this restriction is that if view contained some subset of columns the operation of adding a column would be unsafe (there would be a risk of accidental and unwanted overwrite of a column present in the source data frame that was not included in the view).

This functionality is especially nice in combination with DataFramesMeta.jl, just have a look:

julia> @chain df begin
           @rsubset(:x >= 0; view=true)
           @rtransform!(:sqrtx3 = sqrt(:x))
           parent
       end
9×4 DataFrame
 Row │ x        sqrtx           sqrtx2          sqrtx3
     │ Float64  Float64?        Float64?        Float64?
─────┼─────────────────────────────────────────────────────────
   1 │    -2.0  missing         missing         missing
   2 │    -1.5  missing         missing         missing
   3 │    -1.0  missing         missing         missing
   4 │    -0.5  missing         missing         missing
   5 │     0.0        0.0             0.0             0.0
   6 │     0.5        0.707107        0.707107        0.707107
   7 │     1.0        1.0             1.0             1.0
   8 │     1.5        1.22474         1.22474         1.22474
   9 │     2.0        1.41421         1.41421         1.41421

In the code above I used parent in the last step to recover the source df.

As a final comment note that an alternative in DataFramesMeta.jl is to just use a plain @rtransform! macro:

julia> @rtransform!(df, :sqrtx4 = :x < 0 ? missing : sqrt(:x))
9×5 DataFrame
 Row │ x        sqrtx           sqrtx2          sqrtx3          sqrtx4
     │ Float64  Float64?        Float64?        Float64?        Float64?
─────┼─────────────────────────────────────────────────────────────────────────
   1 │    -2.0  missing         missing         missing         missing
   2 │    -1.5  missing         missing         missing         missing
   3 │    -1.0  missing         missing         missing         missing
   4 │    -0.5  missing         missing         missing         missing
   5 │     0.0        0.0             0.0             0.0             0.0
   6 │     0.5        0.707107        0.707107        0.707107        0.707107
   7 │     1.0        1.0             1.0             1.0             1.0
   8 │     1.5        1.22474         1.22474         1.22474         1.22474
   9 │     2.0        1.41421         1.41421         1.41421         1.41421

In this case it also quite clean.

Conclusions

I am really happy that we have a Korean version of Julia for Data Analysis.

I hope that the example transformations I have shown today were useful and improved your knowledge of DataFrames.jl and DataFramesMeta.jl packages.