Tag Archives: julialang

WAW2024 conference: June 3-6, 2024

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/05/31/waw2024.html

Introduction

Next week I organize WAW2024 conference. The event covers various aspects of theoretical and applied modeling of networks.

As an introduction I want to run a simulation of an example problem. Consider a random graph with a probability of an edge between two nodes equal to p. Next, assume that we pick an edge uniformly at random from this graph and then remove two nodes forming this edge from the graph as matched. The question is what is the expected fraction of nodes that are going to be matched by this process.

Today, I will investigate this problem using simulation.

The post was written under Julia 1.10.1, Graphs.jl 1.11.0, and DataFrames.jl 1.6.1.

The simulation

Here is a simulator of our greedy matching process. In the simulation we traverse all edges of the graph in a random order.
In the matched vector we keep track of which nodes have been already matched.

using Graphs
using Random
using Statistics

function run_sim(n::Integer, p::Real)
    g = erdos_renyi(n, p)
    matched = fill(false, n)
    for e in shuffle!(collect(edges(g)))
        n1, n2 = e.src, e.dst
        if !(matched[n1] || matched[n2])
            matched[n1] = true
            matched[n2] = true
        end
    end
    return mean(matched)
end

The experiment

Let us now test our simulator for a graph on 10000 nodes and p varying from 0.00001 to 0.1 (on logarithmic scale).

julia> using DataFrames

julia> df = DataFrame(p=Float64[], rep=Int[], res=Float64[])
0×3 DataFrame
 Row │ p        rep    res
     │ Float64  Int64  Float64
─────┴─────────────────────────

julia> ps = [10.0^i for i in -5:-1]
5-element Vector{Float64}:
 1.0e-5
 0.0001
 0.001
 0.010000000000000002
 0.1

julia> Random.seed!(1234);

julia> @time for p in ps, rep in 1:16
           push!(df, (p, rep, run_sim(10_000, p)))
       end
 79.190585 seconds (438.02 M allocations: 14.196 GiB, 7.13% gc time, 0.36% compilation time)

julia> df
80×3 DataFrame
 Row │ p        rep    res
     │ Float64  Int64  Float64
─────┼─────────────────────────
   1 │  1.0e-5      1   0.0948
   2 │  1.0e-5      2   0.094
   3 │  1.0e-5      3   0.097
   4 │  1.0e-5      4   0.0892
   5 │  1.0e-5      5   0.0848
   6 │  1.0e-5      6   0.093
  ⋮  │    ⋮       ⋮       ⋮
  76 │  0.1        12   0.9992
  77 │  0.1        13   0.999
  78 │  0.1        14   0.999
  79 │  0.1        15   0.999
  80 │  0.1        16   0.9988
                69 rows omitted

The simulation took a bit over 1 minute, mainly due to the p=0.1 case which generates a lot of edges in the graph.
Let us aggregate the obtained data to get the mean and standard error, and range of the results over all values of p:

julia> combine(groupby(df, "p"),
               "p" => (x -> 10_000 * first(x)) => "mean_degree",
               "res" => mean,
               "res" => (x -> std(x) / sqrt(length(x))) => "res_se",
               "res" => extrema)
5×5 DataFrame
 Row │ p        mean_degree  res_mean  res_se       res_extrema
     │ Float64  Float64      Float64   Float64      Tuple…
─────┼───────────────────────────────────────────────────────────────
   1 │  1.0e-5          0.1  0.090975  0.000842986  (0.0848, 0.097)
   2 │  0.0001          1.0  0.499888  0.00190971   (0.4848, 0.5134)
   3 │  0.001          10.0  0.909425  0.000523729  (0.9062, 0.9126)
   4 │  0.01          100.0  0.990162  0.000257694  (0.9888, 0.992)
   5 │  0.1          1000.0  0.999     5.47723e-5   (0.9986, 0.9994)

We can see that the sharp increase of fraction of matched nodes happens around mean degree of 1 in the graph.
Additionally we see that even for high p we do not match every node in the greedy matching process.
Finally the obtained results are relatively well concentrated around the mean.

Conclusions

If you want to see how this problem can be solved analytically I recommend you to read this paper.
Using the formulas derived there we can compare our simulation results with the asymptotic theory:

julia> (10_000 .* ps) ./ (10_000 .* ps .+ 1)
5-element Vector{Float64}:
 0.09090909090909091
 0.5
 0.9090909090909091
 0.9900990099009901
 0.999000999000999

Indeed we see that the match is quite good.

If such problems are interesting for you I invite you to join us during WAW2024 conference.

A simple coin-tossing game

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/05/24/probability.html

Introduction

I have been writing my blog for over 4 years now (without missing a single week).
My first post was on May 10, 2020, you can find it here.

There is a small change in how I distribute my content. Starting from last week I made the repository of my blog public,
so if you find any mistake please do not hesitate to open a Pull Request here.

To celebrate this I decided to go back to my favorite topic – mathematical puzzles.
Today I use a classic coin-tossing game example

The post was written under Julia 1.10.1 and StatsBase.jl 0.34.4, and FreqTables.jl 0.4.6, BenchmarkTools.jl 1.5.0.

The problem

Assume Alice and Bob toss a fair coin. Alice wins if after tossing a head (H) tail (T) is tossed, that is we see a HT sequence.
Bob wins if two consecutive heads are tossed, that is we see a HH sequence.

The questions are:

  • Who is more likely to win this game?
  • If only Alice played, how long, on the average, would she wait till HT was tossed?
  • If only Bob played, how long, on the average, would she wait till HH was tossed?

Let us try to answer these questions using Julia.

Both players play

This code simulates the situation when Alice and Bob play together:

function both()
    a = rand(('H', 'T'))
    while true
        b = rand(('H', 'T'))
        if a == 'T'
            a = b
        else
            return b == 'H' ? "Bob" : "Alice"
        end
    end
end

The both function returns "Bob" if Bob wins, and "Alice" otherwise.
From the code it should be already clear that both players have the same probability of winning.
The only way to terminate the simulation is return b == 'H' ? "Bob" : "Alice" and this condition is symmetric
with respect to Alice and Bob. Let us confirm this by running a simulation:

julia> using FreqTables, Random

julia> Random.seed!(1234);

julia> freqtable([both() for _ in 1:100_000_000])
2-element Named Vector{Int64}
Dim1  │
──────┼─────────
A     │ 50000012
B     │ 49999988

Indeed, the number of times Alice and Bob win seem to be the same.

Alice’s waiting time

Now let us check how long, on the average, Alice has to wait to see the HT sequence. Here is Alice’s simulator:

function alice()
    a = rand(('H', 'T'))
    i = 1
    while true
        b = rand(('H', 'T'))
        i += 1
        a == 'H' && b == 'T' && return i
        a = b
    end
end

Let us check it:

julia> using StatsBase

julia> describe([alice() for _ in 1:100_000_000])
Summary Stats:
Length:         100000000
Missing Count:  0
Mean:           3.999890
Std. Deviation: 2.000032
Minimum:        2.000000
1st Quartile:   2.000000
Median:         3.000000
3rd Quartile:   5.000000
Maximum:        31.000000
Type:           Int64

So it seems that, in expectation, Alice finishes her game in 4 tosses.
Can we expect the same for Ben (as we remember – if they play together they have the same chances of finishing first)?
Let us see.

Alice’s waiting time

Now let us check how long, on the average, Bob has to wait to see the HH sequence. Here is Bob’s simulator:

function bob()
    a = rand(('H', 'T'))
    i = 1
    while true
        b = rand(('H', 'T'))
        i += 1
        a == 'H' && b == 'H' && return i
        a = b
    end
end

Let us check it:

julia> describe([bob() for _ in 1:100_000_000])
Summary Stats:
Length:         100000000
Missing Count:  0
Mean:           5.999915
Std. Deviation: 4.690177
Minimum:        2.000000
1st Quartile:   2.000000
Median:         5.000000
3rd Quartile:   8.000000
Maximum:        87.000000
Type:           Int64

To our surprise, Bob needs 6 coin tosses, on the average, to see HH.

What is the reason of this difference? Assume we have just tossed H. Start with Bob. If we hit H we finish. If we hit T we then need to wait till we see H again to be able to consider finishing.
However, if we are Alice if we hit T we finish, but if we hit H we do not have to wait for anything – we are already in a state that gives us a chance to finish the game in the next step.

Conclusions

The difference between joint games and separate games is a bit surprising and I hope you found it interesting if you have not seen this puzzle before.
Today I have approached this problem using simulation. However, it is easy to write down a Markov chain representation of all three scenarios and solve them analytically.
I encourage you to try doing this exercise.

PS.

In the code I use the rand(('H', 'T')) form to generate randomness. It is much faster than e.g. writing rand(["H", "T"]) (which would be a first instinct), for two reasons:

  • using Char instead of String is a more lightweight option;
  • using Tuple instead of Vector avoids allocations.

Let us see a comparison of timing (I cut out the histograms from the output):

julia> using BenchmarkTools

julia> @benchmark rand(('H', 'T'))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.900 ns … 233.700 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.500 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.316 ns ±   2.772 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark rand(["H", "T"])
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  15.816 ns …  2.086 μs  ┊ GC (min … max): 0.00% … 96.30%
 Time  (median):     19.019 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.041 ns ± 60.335 ns  ┊ GC (mean ± σ):  9.44% ±  3.63%

 Memory estimate: 64 bytes, allocs estimate: 1.

In this case we could also just use rand(Bool) (as the coin is fair and has only two states):

julia> @benchmark rand(Bool)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.700 ns … 131.000 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.200 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.218 ns ±   2.112 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 0 bytes, allocs estimate: 0.

but as you can see rand(('H', 'T')) has a similar speed and leads to a much more readable code.

DataFrames.jl: avoiding compilation

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/05/17/compilation.html

Introduction

Today I want to go back to a topic of performance considerations of using anonymous functions in combination with DataFrames.jl.
I have written about it in the past, but it is an issue that new users often ask about.
In the post I will explain you the problem, its causes, and how to avoid it.

The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.

Performance issues caused by anonymous functions

Consider the following sequence of DataFrames.jl operations (I am suppressing the output of the operations with ; as it is irrelevant):

julia> using DataFrames

julia> df = DataFrame(x=1:3);

julia> @time select(df, :x => (x -> 2 * x) => :x2);
  0.077134 seconds (73.33 k allocations: 5.010 MiB, 99.28% compilation time)

julia> @time select(df, :x => (x -> 2 * x) => :x2);
  0.013731 seconds (6.30 k allocations: 450.148 KiB, 98.15% compilation time)

julia> @time subset(df, :x => ByRow(x -> x > 1.5));
  0.094046 seconds (91.86 k allocations: 6.219 MiB, 99.29% compilation time)

julia> @time subset(df, :x => ByRow(x -> x > 1.5));
  0.086597 seconds (43.05 k allocations: 2.881 MiB, 42.62% gc time, 99.44% compilation time)

In both select and subset examples I used an anonymous function. In the first case it was x -> 2 * x, and in the second x -> x > 1.5.

What you can notice is that most of the time (even in consecutive calls) is spent in compilation. What is the reason for this?

Let me explain this by example. When we pass the x -> 2 * x function to select, the select function needs to be compiled. Since the select function
is quite complex its compilation time is long.

Why does this happen. The reason is that each time we write x -> 2 * x Julia defines a new anonymous function. Julia compiler does not recognize that it is in fact the same function. Have a look here:

julia> x -> 2 * x
#9 (generic function with 1 method)

julia> x -> 2 * x
#11 (generic function with 1 method)

We can see that we get a different function (one denoted #9 and the other #11) although the definition of the function is identical.

How to solve the compilation issue?

Fortunately, there is a simple way to resolve this problem. Instead of using an anonymous function, just use a named function:

julia> times2(x) = 2 * x
times2 (generic function with 1 method)

julia> @time select(df, :x => times2 => :x2);
  0.013728 seconds (5.63 k allocations: 401.305 KiB, 98.54% compilation time)

julia> @time select(df, :x => times2 => :x2);
  0.000142 seconds (71 allocations: 3.516 KiB)

julia> gt15(x) = x > 1.5
gt15 (generic function with 1 method)

julia> @time subset(df, :x => ByRow(gt15));
  0.041173 seconds (42.64 k allocations: 2.849 MiB, 99.01% compilation time)

julia> @time subset(df, :x => ByRow(gt15));
  0.000165 seconds (120 allocations: 5.648 KiB)

Now you see that consecutive calls are fast and do not cause compilation.

Actually, instead of defining the gt15 function we could just have written >(1.5):

julia> >(1.5)
(::Base.Fix2{typeof(>), Float64}) (generic function with 1 method)

Which defines a functor that works as a named function (so it requires only one compilation):

julia> @time subset(df, :x => ByRow(>(1.5)));
  0.075423 seconds (41.80 k allocations: 2.804 MiB, 99.32% compilation time)

julia> @time subset(df, :x => ByRow(>(1.5)));
  0.000189 seconds (124 allocations: 5.898 KiB)

If you want to learn how functors work in Julia, have a look here.

Conclusions

Today I have presented some simple examples, but I hope that they are useful for new users of DataFrames.jl in helping them to improve the performance of their code.