Author Archives: Blog by Bogumił Kamiński

My understanding of object property access in Julia

Re-posted from: https://bkamins.github.io/julialang/2024/01/26/fields.html

Introduction

Today I wanted to discuss a conceptual aspect of Julia programming.
It is related to the question how you should query some object for its properties.
The topic is especially relevant if you want to write code that is expected to be stable
in the longer term, that means that it is easy to maintain as versions of its dependencies change.

The post was written under Julia 1.10.0 and DataFrames.jl 1.6.1.

The internals

A fundamental element of Julia design are composite types. This kind of object
is a collection of fields, that have names. Each of such fields can hold some value.

To make things non-abstract let us have a look at a SubDataFrame type from DataFrames.jl.
First create an instance of such object:

julia> using DataFrames

julia> df = DataFrame(x=1:3, y=11:13, z=111:113)
3×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11    111
   2 │     2     12    112
   3 │     3     13    113

julia> sdf = @view df[1:2, 1:2]
2×2 SubDataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12

To check what fields SubDataFrame contains you can use the the fieldnames function:

julia> fieldnames(SubDataFrame)
(:parent, :colindex, :rows)

Note that we pass a type to fieldnames. It is important – the list of fields is fixed for every
instance of an object of a given type.

In this case we learned that SubDataFrame has three fields. The three functions associated with
fieldnames are: fieldcount returning the number of fields of a type,
fieldtypes returning their declared types, and hasfield allowing you
to query if a specific field is present. There is an example:

julia> fieldcount(SubDataFrame)
3

julia> fieldtypes(SubDataFrame)
(AbstractDataFrame, DataFrames.AbstractIndex, AbstractVector{Int64})

julia> hasfield(SubDataFrame, :parent)
true

julia> hasfield(SubDataFrame, :parentx)
false

For a given instance of a type you can query the field with getfield and set it with setfield!.
For example, let us get the field :parent of our sdf object (a source data frame in this case):

julia> getfield(sdf, :parent)
3×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11    111
   2 │     2     12    112
   3 │     3     13    113

Having learned all these methods you might ask yourself when to use it. The short answer is:

Never directly access fields of a type. They might be changed
between versions of code you use without warning.

The longer answer is that you should assume that direct field access is typically considered internal.
The list and fields and their types are an implementation detail and as a user of this type you should
not rely on them. The use of property access is restricted to the designers of a type to allow them
manipulate its inner physical representation.

So how should we work with composite types then?

The composite type interface

Julia introduces a concept of property that is a logical representation of data stored in a given object.
You can query for properties of an object with the propertynames function. You also have the hasproperty,
getproperty and setproperty! functions similar as for fields.

In case of our sdf SubDataFrame we have the following logical representation:

julia> propertynames(sdf)
2-element Vector{Symbol}:
 :x
 :y

julia> hasproperty(sdf, :x)
true

julia> getproperty(sdf, :x)
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
 1
 2

julia> setproperty!(sdf, :x, [1001, 1002])
2-element Vector{Int64}:
 1001
 1002

julia> sdf
2×2 SubDataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │  1001     11
   2 │  1002     12

We immediately see a significant difference. The sdf properties in this case are columns of our data frame.
We do not care how they are mapped to a physical representation of SubDataFrame, this is taken care of
by designers of the DataFrames.jl package.

There are the following important aspects of properties.

The first is that property access is typically considered a public API.
Designers of the type should make sure that the way you can access properties
of an object should remain stable and a change in this area would be breaking, so:

You should access properties of objects in your code (not fields).

The second is that properties are bound to object, not to a type.
This means that different objects of the same type may have different sets of properties.
It is quite useful, e.g. each data frame can have a different set of columns.

The third, practical, information is that by default properties fall back to fields,
as you can read here in the Julia Manual.

The next aspect is convenient syntax.
You do not need to call the getproperty and setproperty! functions explicitly.
The getproperty(a, :b) is equivalent to a.b, and setproperty!(a, :b, v) is the same as a.b = v.

Finally note that the propertynames function optionally takes a second positional argument
that is Bool. If it is passed and set to true you get a list of all properties of some object.
By default the second argument is false and you get a list of public properties of some object
(and in practice you should use the default mode).

Conclusions

Today I have a short conclusion.

Fields represent physical layout of a type.
Properties represent a logical view of an object.

In your code use object properties and not their fields.
Field access is considered internal and typically should be only done by developers of a
package providing a given object.

A little exercise in CSV.jl and DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/01/19/puzzles.html

Introduction

This week I have discussed with my colleague the Lichess puzzle dataset
that I use in my Julia for Data Analysis book.

The dataset contains a list of puzzles along with information about them,
such as puzzle difficulty, puzzle solution, and tags describing puzzle type.

We were discussing if tags assigned to puzzles in this dataset are accurate.
In this post I give you an example how one can check it
(and practice a bit CSV.jl and DataFrames.jl).

The post was written under Julia 1.10.0, CSV.jl 0.10.12, and DataFrames.jl 1.6.1.

Getting the data

In this post I show you a relatively brief code. Therefore I assume that first
you download the file with the puzzle dataset and unpack it manually.
(In the book I show how to do it using Julia. You can find the source code on
GitHub repository of the book.)

Assuming you downloaded and unpacked the dataset into the puzzles.csv file
we read it in. We are interested only in columns 3 and 8 of this file,
so I use the following commands:

julia> using CSV

julia> using DataFrames

julia> df = CSV.read("puzzles.csv", DataFrame; select=[3, 8], header=false)
2132989×2 DataFrame
     Row │ Column3                            Column8
         │ String                             String
─────────┼──────────────────────────────────────────────────────────────────────
       1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1      crushing hangingPiece long middl…
       2 │ d3d6 f8d8 d6d8 f6d8                advantage endgame short
       3 │ b6c5 e2g4 h3g4 d1g4                advantage middlegame short
       4 │ g5e7 a5c3 b2c3 c6e7                advantage master middlegame short
       5 │ e8f7 e2e6 f7f8 e6f7                mate mateIn2 middlegame short
       6 │ a6a5 e5c7 a5b4 c7d8                crushing endgame fork short
       7 │ d4b6 f6e4 h1g1 e4f2                crushing endgame short trappedPi…
       8 │ d8f6 d1h5 h7h6 h5c5                advantage middlegame short
    ⋮    │                 ⋮                                  ⋮
 2132982 │ d2c2 c5d3 c2d3 c4d3                crushing fork middlegame short
 2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7      crushing long middlegame quietMo…
 2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6      crushing defensiveMove endgame l…
 2132985 │ g1h1 e3e1 f7f1 e1f1                endgame mate mateIn2 short
 2132986 │ g5c1 d5d6 d7f6 h7h8                advantage middlegame short
 2132987 │ d2f3 d8a5 c1d2 a5b5                advantage fork opening short
 2132988 │ f7f2 b2c2 c1b1 e2d1                endgame mate mateIn2 queensideAt…
 2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3      advantage long opening
                                                            2132973 rows omitted

julia> rename!(df, ["moves", "tags"])
2132989×2 DataFrame
     Row │ moves                              tags
         │ String                             String
─────────┼──────────────────────────────────────────────────────────────────────
       1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1      crushing hangingPiece long middl…
       2 │ d3d6 f8d8 d6d8 f6d8                advantage endgame short
       3 │ b6c5 e2g4 h3g4 d1g4                advantage middlegame short
       4 │ g5e7 a5c3 b2c3 c6e7                advantage master middlegame short
       5 │ e8f7 e2e6 f7f8 e6f7                mate mateIn2 middlegame short
       6 │ a6a5 e5c7 a5b4 c7d8                crushing endgame fork short
       7 │ d4b6 f6e4 h1g1 e4f2                crushing endgame short trappedPi…
       8 │ d8f6 d1h5 h7h6 h5c5                advantage middlegame short
    ⋮    │                 ⋮                                  ⋮
 2132982 │ d2c2 c5d3 c2d3 c4d3                crushing fork middlegame short
 2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7      crushing long middlegame quietMo…
 2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6      crushing defensiveMove endgame l…
 2132985 │ g1h1 e3e1 f7f1 e1f1                endgame mate mateIn2 short
 2132986 │ g5c1 d5d6 d7f6 h7h8                advantage middlegame short
 2132987 │ d2f3 d8a5 c1d2 a5b5                advantage fork opening short
 2132988 │ f7f2 b2c2 c1b1 e2d1                endgame mate mateIn2 queensideAt…
 2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3      advantage long opening
                                                            2132973 rows omitted

Note that the file does not have a header so when reading it we passed header=false
and then manually named the columns using rename!.

The task

I wanted only these two columns since today I want to check if the tags related
to mating are accurate. You can notice in the above printout that in the "tags"
column we have a tag "mateIn2". It indicates that the puzzle is mate in two moves.
This is the case for example for rows 5, 2132985, and 2132988.
In the matching "moves" column we see that we have 4 corresponding moves.
The reason is that we have two players making the move (and 2 + 2 = 4).

What we want to check if these "mateInX" tags are correct. I will check the
values of X from 1 to 5 (as only these five options are present in tags,
I leave it to you as an exercise to verify).

When should we call the tags correct. There are two conditions:

there is no duplicate tagging (e.g. a puzzle cannot be "mateIn1" and "mateIn2" at the same time);
the number of moves in a puzzle matches the tag.

Let us check it.

The solution

As a first step we (in place, i.e. modifying our df data frame) transform the original columns
into more convenient form. Instead of the raw "moves" I want the "nmoves" column that gives me
a number of moves in the puzzle. Similarly instead of "tags" I want indicator columns "mateInX"
for X ranging from 1 to 5 showing me the puzzle type. Here is how you can achieve this:

julia> select!(df,
               "moves" => ByRow(length∘split) => "nmoves",
               ["tags" => ByRow(contains("mateIn$i")) => "mateIn$i" for i in 1:5])
2132989×6 DataFrame
     Row │ nmoves  mateIn1  mateIn2  mateIn3  mateIn4  mateIn5
         │ Int64   Bool     Bool     Bool     Bool     Bool
─────────┼─────────────────────────────────────────────────────
       1 │      6    false    false    false    false    false
       2 │      4    false    false    false    false    false
       3 │      4    false    false    false    false    false
       4 │      4    false    false    false    false    false
       5 │      4    false     true    false    false    false
       6 │      4    false    false    false    false    false
       7 │      4    false    false    false    false    false
       8 │      4    false    false    false    false    false
    ⋮    │   ⋮        ⋮        ⋮        ⋮        ⋮        ⋮
 2132982 │      4    false    false    false    false    false
 2132983 │      6    false    false    false    false    false
 2132984 │      6    false    false    false    false    false
 2132985 │      4    false     true    false    false    false
 2132986 │      4    false    false    false    false    false
 2132987 │      4    false    false    false    false    false
 2132988 │      4    false     true    false    false    false
 2132989 │      6    false    false    false    false    false
                                           2132973 rows omitted

Now we see that some of the rows are not tagged as "mateInX". Let us filter them out,
to have only tagged rows left (again, we do the operation in-place):

julia> filter!(row -> any(row[Not("nmoves")]), df)
491743×6 DataFrame
    Row │ nmoves  mateIn1  mateIn2  mateIn3  mateIn4  mateIn5
        │ Int64   Bool     Bool     Bool     Bool     Bool
────────┼─────────────────────────────────────────────────────
      1 │      4    false     true    false    false    false
      2 │      4    false     true    false    false    false
      3 │      2     true    false    false    false    false
      4 │      4    false     true    false    false    false
      5 │      2     true    false    false    false    false
      6 │      4    false     true    false    false    false
      7 │      4    false     true    false    false    false
      8 │      2     true    false    false    false    false
   ⋮    │   ⋮        ⋮        ⋮        ⋮        ⋮        ⋮
 491736 │      6    false    false     true    false    false
 491737 │      4    false     true    false    false    false
 491738 │      2     true    false    false    false    false
 491739 │      4    false     true    false    false    false
 491740 │      2     true    false    false    false    false
 491741 │      2     true    false    false    false    false
 491742 │      4    false     true    false    false    false
 491743 │      4    false     true    false    false    false
                                           491727 rows omitted

Note that in the condition I used the row[Not("nmoves")] selector, as I wanted to check all columns except the "nmoves".

Now we are ready to check the correctness of tags:

julia> combine(groupby(df, "nmoves"), Not("nmoves") .=> sum)
10×6 DataFrame
 Row │ nmoves  mateIn1_sum  mateIn2_sum  mateIn3_sum  mateIn4_sum  mateIn5_sum
     │ Int64   Int64        Int64        Int64        Int64        Int64
─────┼─────────────────────────────────────────────────────────────────────────
   1 │      2       136843            0            0            0            0
   2 │      4            0       274135            0            0            0
   3 │      6            0            0        68623            0            0
   4 │      8            0            0            0         9924            0
   5 │     10            0            0            0            0         1691
   6 │     12            0            0            0            0          367
   7 │     14            0            0            0            0          127
   8 │     16            0            0            0            0           25
   9 │     18            0            0            0            0            7
  10 │     20            0            0            0            0            1

The table reads as follows:

There are no duplicates in tags.
Tags "mateInX" for X in 1 to 4 range are correct.
The "mateIn5" tag actually means a situation where there are five or more moves.

So the verdict is that tagging is correct, but we need to know the interpretation of
"mateIn5" column as it is actually five or more moves. We could rename the column to
e.g. "mateIn5+" to reflect that or add a metadata to our df table where we would store
this information (I leave this to you as an exercise).

Conclusions

I hope that CSV.jl and DataFrames.jl users found the examples that I gave today useful and interesting. Enjoy!

What to do when you are stuck with Julia?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/01/12/rgg.html

Introduction

Today I decided to write about code refactoring in Julia.
This is a topic that is, in my experience, a quite big advantage of this language.

A common situation you are faced with when writing your code is as follows.
You need some functionality in your program and it is available in some library.
However, what the library provides does not meet your expectations.
Since in Julia most packages are written in Julia under MIT license, it is easy to solve this issue.
You just take the source code and modify it.

Today I want to show you a practical example of such a situation I have had this week
when working with the Graphs.jl package.

The post was written using Julia 1.10.0, BenchmarkTools.jl 1.4.0, and Graphs.jl 1.9.0.

The problem

In my work I needed to generate random geometric graphs.
This is a simple random graph model that works as follows
(here I describe a general idea, for details please check the
Wikipedia entry on random geometric graphs).
To generate a graph on N vertices you first drop N random points
in some metric space. Next you connect two points with an edge
if their distance is less than some pre-specified distance.

The Graphs.jl library provides the euclidean_graph function
that generates such graphs. Here is a summary of its docstring:

euclidean_graph(N, d; rng=nothing, seed=nothing, L=1., p=2., cutoff=-1., bc=:open)

Generate N uniformly distributed points in the box [0,L]^{d}
and return a Euclidean graph, a map containing the distance on each
edge and a matrix with the points' positions.

An edge between vertices x[i] and x[j] is inserted if norm(x[i]-x[j], p) < cutoff.
In case of negative cutoff instead every edge is inserted.
Set bc=:periodic to impose periodic boundary conditions in the box [0,L]^d.

So what is the problem with this function? Unfortunately it is slow.
Let us, for example check how long it takes to compute an average degree
of a node in such a graph with n nodes and cutoff=sqrt(10/n), when setting
bc=:periodic (periodic boundary, i.e. distance is measured on a torus) for two-dimensional space.

julia> using Graphs

julia> for n in 1_000:1_000:10_000
           println(@time ne(euclidean_graph(n, 2; cutoff=sqrt(10/n), bc=:periodic, seed=1)[1])/n)
       end
091657 seconds (2.50 M allocations: 193.170 MiB, 14.06% gc time)
604
300285 seconds (10.00 M allocations: 765.801 MiB, 12.59% gc time)
661
686230 seconds (22.50 M allocations: 1.686 GiB, 12.47% gc time)
744
175881 seconds (40.00 M allocations: 2.990 GiB, 10.97% gc time)
7065
800561 seconds (62.50 M allocations: 4.666 GiB, 10.76% gc time)
6568
697535 seconds (90.00 M allocations: 6.716 GiB, 14.76% gc time)
641333333333334
690599 seconds (122.50 M allocations: 9.138 GiB, 13.28% gc time)
743571428571428
745701 seconds (160.00 M allocations: 11.932 GiB, 13.53% gc time)
714
962431 seconds (202.51 M allocations: 15.099 GiB, 12.70% gc time)
722222222222221
195257 seconds (250.01 M allocations: 18.638 GiB, 11.42% gc time)
7086

We can see that the euclidean_graph function scales badly with n.
Note that by choosing cutoff=sqrt(10/n) we have a roughly constant
average degree, so the number of edges generates scales linearly, but
the generation time seems to grow much faster.

The investigation

To find out the source of the problem we can investigate the source
of euclidean_graph, which consists of two methods:

function euclidean_graph(
    N::Int,
    d::Int;
    L=1.0,
    rng::Union{Nothing,AbstractRNG}=nothing,
    seed::Union{Nothing,Integer}=nothing,
    kws...,
)
    rng = rng_from_rng_or_seed(rng, seed)
    points = rmul!(rand(rng, d, N), L)
    return (euclidean_graph(points; L=L, kws...)..., points)
end

function euclidean_graph(points::Matrix; L=1.0, p=2.0, cutoff=-1.0, bc=:open)
    d, N = size(points)
    weights = Dict{SimpleEdge{Int},Float64}()
    cutoff < 0.0 && (cutoff = typemax(Float64))
    if bc == :periodic
        maximum(points) > L && throw(
            DomainError(maximum(points), "Some points are outside the box of size $L")
        )
    end
    for i in 1:N
        for j in (i + 1):N
            if bc == :open
                Δ = points[:, i] - points[:, j]
            elseif bc == :periodic
                Δ = abs.(points[:, i] - points[:, j])
                Δ = min.(L .- Δ, Δ)
            else
                throw(ArgumentError("$bc is not a valid boundary condition"))
            end
            dist = norm(Δ, p)
            if dist < cutoff
                e = SimpleEdge(i, j)
                weights[e] = dist
            end
        end
    end
    g = Graphs.SimpleGraphs._SimpleGraphFromIterator(keys(weights), Int)
    if nv(g) < N
        add_vertices!(g, N - nv(g))
    end
    return g, weights
end

The beauty of Julia is that this source is written in Julia and is pretty short.
It immediately allows us to pinpoint the source of our problems. The core of we work
is done in a double-loop iterating over i and j indices. So the complexity of this
algorithm is quadratic in number of vertices.

Fixing the problem

The second beauty of Julia is that we can easily fix this. The idea can be found in
the Wikipedia entry on random geometric graphs in the algorithms section
here.

A simple way to improve the performance of the algorithm is to notice that if
you know L and cutoff you can partition the space into equal sized cells
having side length floor(Int, L / cutoff). Now you see that if you have a vertex
in some cell then it can be connected only to nodes in the same cell or cells directly
adjacent to it (the cells that are more far away contain the points that must be farther
than cutoff from our point). This means that we will have a much lower number of points
to consider. Below I show a code that is a modification of the original source that
adds this feature. The key added function is to_buckets which computes the bucket
identifier for each vertex and creates a dictionary that is a mapping from bucked
identifier to a vector of node numbers that fall into it:

using LinearAlgebra
using Random

function euclidean_graph2(
    N::Int,
    d::Int;
    L=1.0,
    rng::Union{Nothing,AbstractRNG}=nothing,
    seed::Union{Nothing,Integer}=nothing,
    kws...,
)
    rng = Graphs.rng_from_rng_or_seed(rng, seed)
    points = rmul!(rand(rng, d, N), L)
    return (euclidean_graph2(points; L=L, kws...)..., points)
end

function to_buckets(points::Matrix, L, cutoff)
    d, N = size(points)
    dimlen = max(floor(Int, L / max(cutoff, eps())), 1)
    buckets = Dict{Vector{Int}, Vector{Int}}()
    for (i, point) in enumerate(eachcol(points))
        bucket = floor.(Int, point .* dimlen ./ L)
        push!(get!(() -> Int[], buckets, bucket), i)
    end
    return buckets, dimlen
end

function euclidean_graph2(points::Matrix; L=1.0, p=2.0, cutoff=-1.0, bc=:open)
    d, N = size(points)
    weights = Dict{Graphs.SimpleEdge{Int},Float64}()
    cutoff < 0.0 && (cutoff = typemax(Float64))
    if bc == :periodic
        maximum(points) > L && throw(
            DomainError(maximum(points), "Some points are outside the box of size $L")
        )
    end
    buckets, dimlen = to_buckets(points, L, cutoff)
    deltas = collect(Iterators.product((-1:1 for _ in 1:size(points, 1))...))
    void = Int[]
    for (k1, v1) in pairs(buckets)
        for i in v1
            for d in deltas
                k2 = bc == :periodic ? mod.(k1 .+ d, dimlen) : k2 = k1 .+ d
                v2 = get(buckets, k2, void)
                for j in v2
                    i < j || continue
                    if bc == :open
                        Δ = points[:, i] - points[:, j]
                    elseif bc == :periodic
                        Δ = abs.(points[:, i] - points[:, j])
                        Δ = min.(L .- Δ, Δ)
                    else
                        throw(ArgumentError("$bc is not a valid boundary condition"))
                    end
                    dist = norm(Δ, p)
                    if dist < cutoff
                        e = Graphs.SimpleEdge(i, j)
                        weights[e] = dist
                    end
                end
            end
        end
    end

    g = Graphs.SimpleGraphs._SimpleGraphFromIterator(keys(weights), Int)
    if nv(g) < N
        add_vertices!(g, N - nv(g))
    end
    return g, weights
end

Note that it took less than 30 lines of code to add the requested feature to the code.

Performance of the improved code

Let us test our new code:

julia> for n in 1_000:1_000:10_000
           println(@time ne(euclidean_graph2(n, 2; cutoff=sqrt(10/n), bc=:periodic, seed=1)[1])/n)
       end
017221 seconds (274.10 k allocations: 21.751 MiB, 22.15% gc time)
604
022855 seconds (558.52 k allocations: 42.289 MiB, 10.43% gc time)
661
032693 seconds (852.46 k allocations: 69.684 MiB, 8.52% gc time)
744
043141 seconds (1.10 M allocations: 87.196 MiB, 14.73% gc time)
7065
071273 seconds (1.41 M allocations: 109.725 MiB, 7.67% gc time)
6568
068194 seconds (1.70 M allocations: 130.828 MiB, 12.54% gc time)
641333333333334
071277 seconds (1.98 M allocations: 150.712 MiB, 11.85% gc time)
743571428571428
081463 seconds (2.24 M allocations: 169.153 MiB, 10.67% gc time)
714
099957 seconds (2.48 M allocations: 186.492 MiB, 8.08% gc time)
722222222222221
148573 seconds (2.84 M allocations: 213.214 MiB, 18.37% gc time)
7086

We seem to get what we wanted. Our computation time looks to scale quite well.
Also the obtained average degree numbers are identical to the original ones.

Let us compare the performance on an even larger graph:

julia> n = 100_000;

julia> @time ne(euclidean_graph(n, 2; cutoff=sqrt(10/n), bc=:periodic, seed=1)[1])/n
908.976252 seconds (25.00 G allocations: 1.819 TiB, 11.64% gc time, 0.00% compilation time)
15.70797

julia> julia> @time ne(euclidean_graph2(n, 2; cutoff=sqrt(10/n), bc=:periodic, seed=1)[1])/n
  1.640495 seconds (27.53 M allocations: 2.121 GiB, 19.83% gc time)
15.70797

Indeed we see that the timing of the original implementation becomes prohibitive for
larger graphs.

Making sure things are correct

Before we finish there is one important task we need to make. We should check if
indeed the euclidean_graph2 function produces the same results as euclidean_graph.
This is easy to test with the following randomized test:

julia> using Test

julia> Random.seed!(1234);

julia> @time for i in 1:1000
           N = rand(10:500)
           d = rand(1:5)
           L = rand()
           p = 3*rand()
           cutoff = rand() * L / 4
           bc = rand([:open, :periodic])
           seed = rand(UInt32)
           @test euclidean_graph(N, d; L, p, cutoff, bc, seed) ==
                 euclidean_graph2(N, d; L, p, cutoff, bc, seed)
       end
 16.955773 seconds (275.09 M allocations: 20.342 GiB, 12.27% gc time)

We have tested 1000 random setups of the experiments. In each of them both functions
returned the same results.

Conclusions

In my post I have shown you an example that one can easily tweak some package code to your needs.
In this case this was performance, but it can be equally well functionality.

I did not comment too much on the code itself, as it was a bit longer than usual, but let me
discuss as a closing remark one performance aspect of the code. In my to_buckets function I used
the get! function to populate the dictionary with a mutable default value (Int[] in that case).
You might wonder why I preferred to use an anonymous function instead of passing a default
as a third argument. The reason is number of allocations. Check this code:

julia> using BenchmarkTools

julia> function f1()
           d = Dict(1 => Int[])
           for i in 1:10^6
               get!(d, 1, Int[])
           end
           return d
       end
f1 (generic function with 1 method)

julia> function f2()
           d = Dict(1 => Int[])
           for i in 1:10^6
               get!(() -> Int[], d, 1)
           end
           return d
       end
f2 (generic function with 1 method)

julia> @benchmark f1()
BenchmarkTools.Trial: 195 samples with 1 evaluation.
 Range (min … max):  19.961 ms … 45.328 ms  ┊ GC (min … max): 10.26% … 13.19%
 Time  (median):     23.962 ms              ┊ GC (median):    10.45%
 Time  (mean ± σ):   25.660 ms ±  5.050 ms  ┊ GC (mean ± σ):  12.27% ±  3.76%

    ▃▂▂█▃▃ ▃▂▃
  ▆▄██████▇███▇▆▄▄▇▄▄▁▃▆▄▃▅▄▅▄▄▄▃▄▃▁▃▃▁▁▁▃▃▃▃▃▃▁▃▁▁▃▁▁▁▁▁▃▁▁▃ ▃
  20 ms           Histogram: frequency by time        44.3 ms <

 Memory estimate: 61.04 MiB, allocs estimate: 1000005.

julia> @benchmark f2()
BenchmarkTools.Trial: 902 samples with 1 evaluation.
 Range (min … max):  4.564 ms … 11.178 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.149 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.526 ms ±  1.396 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▄▄▃▃▄▅▆▄▁▁                                           ▃
  ██████████████▆▆▆▆▆▅▇▆▄▆▆▅▅▁▄▅▁▅▁▇▅▆▄▄▇▁▅▅▅▄▁▄▄▆▅▁▅▁▁▅▆██▅ █
  4.56 ms      Histogram: log(frequency) by time       10 ms <

 Memory estimate: 592 bytes, allocs estimate: 5.

As you can see f2 is much faster than f1 and does much less allocations. The issue
is that f1 allocates a fresh Int[] object in every iteration of the loop, while
f2 would allocate it only if get! does not hit a key that already exists in d
(and in our experiment I always queried for 1 which was present in d).

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Blog by Bogumił Kamiński

My understanding of object property access in Julia

Introduction

The internals

The composite type interface

Conclusions

A little exercise in CSV.jl and DataFrames.jl

Introduction

Getting the data

The task

The solution

Conclusions

What to do when you are stuck with Julia?

Introduction

The problem

The investigation

Fixing the problem

Performance of the improved code

Making sure things are correct

Conclusions