Author Archives: Blog by Bogumił Kamiński

Yet another graph puzzle

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/07/14/graphpuzzle.html

Introduction

Recently I have written a post that presented a puzzle
that can be solved using graph analysis. I received quite nice feedback
from my readers and one of them has shared with me another interesting
puzzle that can be solved using graphs.

Today I want to discuss this puzzle. Like in the recent post
I solve the problem both analytically and computationally.

The post was written under Julia 1.9.2 and Graphs.jl 1.8.0.

The problem

The problem my friend shared with me goes as follows:

Seven people A, B, C, E, F, G, and H visit person D.
Upon investigation they report that they have seen
the following persons during their visits:

  • A met B, C, F, and G;
  • B met A, C, E, F, and H;
  • C met A, B, and E;
  • E met B, C, and F;
  • F met A, B, E, and H;
  • G met A and H;
  • H met B, F, and G.

Now, interestingly, they all also report that all of
them visited D only once for a continuous period of time.
The questions are:

  1. Show that this is not possible,
    i.e. that one of the visitors must have made more than one visit.
  2. Show that it is enough to assume that
    a single person visited D several times and identify this person.

Analytical solution

First, note that the answers about who has seen whom are consistent
among all reporting people.
We can visualize them using the following graph:

Visit graph

To start let us think why it is not possible that every person visited D only once.
Consider visitors A, B, H, and G. We see that A and H have not met.
This means that their periods of visits are disjoint. Since B and G
both saw A and H, then assuming they made only one visit,
they both must have been present in the period between visits of A and H.
But this would mean that in that period B and H would have seen each other,
which is not the case.
Therefore at least one person must have visited D more than once.

We notice that we have a problem because we have a cycle of length 4 in the graph,
namely A-B-G-H, that does not contain shorter cycles inside
(i.e. an edge connecting A and H or B and G). By the reasoning
we have presented above all such cycles will be problematic. There are three such
cycles: A-B-G-H, A-G-H-F, and A-C-E-F. We notice that in all of these cycles
A is present and this is the only person that is commonly present in all of them.
This means that:

  • removing A from the graph (i.e. assuming that A visited D multiple times) could
    solve the problem;
  • removing any other node will not solve the problem.

So to finish the solution of the puzzle we need to show that if we remove A from
consideration all other people could indeed come only once to visit D. Such a
solution can be relatively quickly found manually to be e.g.:

  • B comes;
  • C comes;
  • E comes;
  • C leaves;
  • F comes;
  • E leaves;
  • H comes;
  • B leaves;
  • F leaves;
  • G comes;
  • G leaves;
  • H leaves.

Computational solution

The solution was not long, but required some thought.
Now we solve the puzzle computationally using brute force.

Start by defining the graph we will work with:

using Graphs
g = SimpleGraph(7)
add_edge!(g, 1, 2)
add_edge!(g, 1, 3)
add_edge!(g, 1, 5)
add_edge!(g, 1, 6)
add_edge!(g, 2, 3)
add_edge!(g, 2, 4)
add_edge!(g, 2, 5)
add_edge!(g, 2, 7)
add_edge!(g, 3, 4)
add_edge!(g, 4, 5)
add_edge!(g, 5, 7)
add_edge!(g, 6, 7)

Note that I mapped people to numbers, A to 1, B to 2, C to 3, E to 4, F to 5, G to 6, H to 7.

Now, define a function that checks that a graph can be solved by each person visiting only once.

The logic of the code is as follows. If we n people we can, without loss of generality,
assume that they arrive or leave in time moments 1, 2, …, 2n-1, 2n, where in a given moment
only one event (arrival or departure of some person) happens.
We will check all possible assignments
of people to pairs of moments (pairs, because a person comes and leaves in two distinct moments).
The process of checking is done iteratively:

  • we try to add a next person (starting from the first and ending with the last);
  • we check all possible paris of time spots still left unoccupied by people already assigned;
  • for each pair we check if this assignment is consistent with the constraints defined by the graph.

I implemented this procedure recursively using depth-first search. The find_assignment
function performs the traversal of options. It has three arguments:

  • g is graph of constraints (who has seen whom);
  • v is a vector of length 2n holding information what visitors already have their
    arrival and departure time allocated (0 indicates a free spot)
  • level is information about the number of visitor that we are currently considering.

The check_g_v checks is adding a given person (with level number) in moments specified by
the v vector is consistent with information given in graph g.

The code is as follows:

function check_g_v(g, v, level)
    lev_v = findall(==(level), v)
    for i in 1:level-1 # need to check only against previous visitors
        lev_i = findall(==(i), v)
        if has_edge(g, i, level)
            if lev_v[2] < lev_i[1] || lev_v[1] > lev_i[2]
                return false
            end
        else
            if !(lev_v[2] < lev_i[1] || lev_v[1] > lev_i[2])
                return false
            end
        end
    end
    return true
end

function find_assignment(g, level=1, v=zeros(Int, 2*nv(g)))
    n = nv(g)
    if level == n # we check last person
        loc = findall(==(0), v) # no choice of free spots
        @assert length(loc) == 2
        v[loc] .= n
        if check_g_v(g, v, level)
            @info "Graph is good! Solution $v"
            return true # signal success up the recursion
        end
        v[loc] .= 0 # clean up on failure
    else
        for i in 1:2*n-1 # i indicates arrival time
            v[i] == 0 || continue
            v[i] = level
            for j in i+1:2*n # j indicates departure time
                v[j] == 0 || continue
                v[j] = level
                if check_g_v(g, v, level)
                    if find_assignment(g, level+1, v)
                        return true # signal success up the recursion
                    end
                end
                v[j] = 0 # clean up on failure
            end
            v[i] = 0 # clean up on failure
        end
    end
    if level == 1 # all attempts to find good allocations failed
        @info "Graph is not good!"
    end
    return false # signal failure up the recursion
end

We can use it to check if the original graph is feasible:

julia> find_assignment(g);
[ Info: Graph is not good!

As we already knew – the graph is not good and the computational
procedure confirmed it.

How can we check graphs when we would remove one of the persons from it?
It is easy with Graphs.jl, as we can just subset our graph
by removing some node from it. Let us see:

julia> for i in 1:nv(g)
           @info "removing $i"
           find_assignment(g[[1:i-1; i+1:7]])
       end
[ Info: removing 1
[ Info: Graph is good! Solution [1, 2, 3, 2, 4, 3, 6, 1, 4, 5, 5, 6]
[ Info: removing 2
[ Info: Graph is not good!
[ Info: removing 3
[ Info: Graph is not good!
[ Info: removing 4
[ Info: Graph is not good!
[ Info: removing 5
[ Info: Graph is not good!
[ Info: removing 6
[ Info: Graph is not good!
[ Info: removing 7
[ Info: Graph is not good!

As we can see indeed removing node 1 (representing person A) solves the problem
and this is the only solution.

Let us do the mapping of numbers to letters to make the solution more readable.
Note that since we removed node 1 (corresponding to person A)
from the graph numbers of all nodes got decreased by 1.
This can be done by comprehension or by broadcasting (I show both approaches
for a reference):

julia> mapping = ["B", "C", "E", "F", "G", "H"]
6-element Vector{String}:
 "B"
 "C"
 "E"
 "F"
 "G"
 "H"

julia> [mapping[i] for i in [1, 2, 3, 2, 4, 3, 6, 1, 4, 5, 5, 6]]
12-element Vector{String}:
 "B"
 "C"
 "E"
 "C"
 "F"
 "E"
 "H"
 "B"
 "F"
 "G"
 "G"
 "H"

julia> getindex.(Ref(mapping), [1, 2, 3, 2, 4, 3, 6, 1, 4, 5, 5, 6])
12-element Vector{String}:
 "B"
 "C"
 "E"
 "C"
 "F"
 "E"
 "H"
 "B"
 "F"
 "G"
 "G"
 "H"

Indeed, this is the same order that we had in the analytical solution.

Conclusions

I hope you find this puzzle and a solution interesting.

Next week I will post about features introduced in
a 1.6 release of DataFrames.jl that we have recently made.

Tips and tricks of broadcasting in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/07/07/dfbroadcast.html

Introduction

Broadcasting, a.k.a. the . operator, is a powerful feature of Julia that allows you to
easily vectorize any function. Today I want to write about some common broadcasting
patterns that can be used in DataFrames.jl.

The post was written under Julia 1.9.2 and DataFrames.jl 1.5.0.

Conditional replacement of values in a data frame

Let us create a data frame with some random contents:

julia> using DataFrames

julia> using Random

julia> df = DataFrame(rand(5, 6), :auto)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.690894   0.240867   0.720774  0.331573  0.549766
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.775367  0.35275
   3 │ 0.19913    0.0633375  0.767805   0.280096  0.721995  0.917259
   4 │ 0.708132   0.230088   0.702677   0.947402  0.928979  0.66101
   5 │ 0.0267573  0.0122425  0.549734   0.331788  0.32658   0.00476749

Assume we want to get a new data frame that has true when the value stored
in the cell is greater than 0.5 and false otherwise. This is easy.
We just broadcast the > operator:

julia> df .> 0.5
5×6 DataFrame
 Row │ x1     x2     x3     x4     x5     x6
     │ Bool   Bool   Bool   Bool   Bool   Bool
─────┼──────────────────────────────────────────
   1 │ false   true  false   true  false   true
   2 │ false  false  false  false   true  false
   3 │ false  false   true  false   true   true
   4 │  true  false   true   true   true   true
   5 │ false  false   true  false  false  false

Now assume we want to replace all values greater than 0.5 with 0.5 and
keep the lower values untouched. This can be done with ifelse:

julia> ifelse.(df .> 0.5, 0.5, df)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.5        0.240867   0.5       0.331573  0.5
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.5       0.35275
   3 │ 0.19913    0.0633375  0.5        0.280096  0.5       0.5
   4 │ 0.5        0.230088   0.5        0.5       0.5       0.5
   5 │ 0.0267573  0.0122425  0.5        0.331788  0.32658   0.00476749

Or with clamp:

julia> clamp.(df, -Inf, 0.5)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.5        0.240867   0.5       0.331573  0.5
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.5       0.35275
   3 │ 0.19913    0.0633375  0.5        0.280096  0.5       0.5
   4 │ 0.5        0.230088   0.5        0.5       0.5       0.5
   5 │ 0.0267573  0.0122425  0.5        0.331788  0.32658   0.00476749

Similarly we could clamp values to the [0.1, 0.9] interval:

julia> clamp.(df, 0.1, 0.9)
5×6 DataFrame
 Row │ x1        x2        x3        x4        x5        x6
     │ Float64   Float64   Float64   Float64   Float64   Float64
─────┼────────────────────────────────────────────────────────────
   1 │ 0.364225  0.690894  0.240867  0.720774  0.331573  0.549766
   2 │ 0.225226  0.241412  0.1       0.418206  0.775367  0.35275
   3 │ 0.19913   0.1       0.767805  0.280096  0.721995  0.9
   4 │ 0.708132  0.230088  0.702677  0.9       0.9       0.66101
   5 │ 0.1       0.1       0.549734  0.331788  0.32658   0.1

Importantly, we do not need to keep the element type of the source column fixed.
Assume that we want to set values greater than 0.5 to missing:

julia> ifelse.(df .> 0.5, missing, df)
5×6 DataFrame
 Row │ x1               x2               x3               x4              x5              x6
     │ Float64?         Float64?         Float64?         Float64?        Float64?        Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.364225   missing                0.240867   missing               0.331573  missing
   2 │       0.225226         0.241412         0.0793279        0.418206  missing               0.35275
   3 │       0.19913          0.0633375  missing                0.280096  missing         missing
   4 │ missing                0.230088   missing          missing         missing         missing
   5 │       0.0267573        0.0122425  missing                0.331788        0.32658         0.00476749

Note that the operation performed an automatic promotion of column element types.

As a final operation consider taking sign(log(x) + 1) on our data frame:

julia> sign.(log.(df) .+ 1)
5×6 DataFrame
 Row │ x1       x2       x3       x4       x5       x6
     │ Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────
   1 │    -1.0      1.0     -1.0      1.0     -1.0      1.0
   2 │    -1.0     -1.0     -1.0      1.0      1.0     -1.0
   3 │    -1.0     -1.0      1.0     -1.0      1.0      1.0
   4 │     1.0     -1.0      1.0      1.0      1.0      1.0
   5 │    -1.0     -1.0      1.0     -1.0     -1.0     -1.0

Again – things are easy and intuitive. Data frame behaves just like a matrix in all operations.

I hope now you are comfortable with creation of a new data frame using broadcasting.

We can turn to in-place operations on a data frame.

In-place update of values in a data frame

In general it is enough to just put data frame on a right hand side of a broadcasted assignment
operator to update it in-place:

julia> df2 = copy(df)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.690894   0.240867   0.720774  0.331573  0.549766
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.775367  0.35275
   3 │ 0.19913    0.0633375  0.767805   0.280096  0.721995  0.917259
   4 │ 0.708132   0.230088   0.702677   0.947402  0.928979  0.66101
   5 │ 0.0267573  0.0122425  0.549734   0.331788  0.32658   0.00476749

julia> df3 = df2
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.690894   0.240867   0.720774  0.331573  0.549766
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.775367  0.35275
   3 │ 0.19913    0.0633375  0.767805   0.280096  0.721995  0.917259
   4 │ 0.708132   0.230088   0.702677   0.947402  0.928979  0.66101
   5 │ 0.0267573  0.0122425  0.549734   0.331788  0.32658   0.00476749

julia> df2 .= log.(df)
5×6 DataFrame
 Row │ x1         x2         x3         x4          x5          x6
     │ Float64    Float64    Float64    Float64     Float64     Float64
─────┼─────────────────────────────────────────────────────────────────────
   1 │ -1.00998   -0.369769  -1.42351   -0.32743    -1.10391    -0.598262
   2 │ -1.49065   -1.42125   -2.53417   -0.87178    -0.254419   -1.04199
   3 │ -1.6138    -2.75928   -0.26422   -1.27262    -0.325737   -0.0863653
   4 │ -0.345124  -1.46929   -0.352858  -0.0540319  -0.0736689  -0.413987
   5 │ -3.62095   -4.40285   -0.598321  -1.10326    -1.11908    -5.34594

julia> df2 === df3
true

Note that with the last check I made sure that indeed the df2 .= log.(df) was in-place.
We updated the contents of df2, and not created a new object.

However, sometimes things are more tricky. Consider the df .> 0.5 operation we did above:

julia> df2 .= df .> 0.5
5×6 DataFrame
 Row │ x1       x2       x3       x4       x5       x6
     │ Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────
   1 │     0.0      1.0      0.0      1.0      0.0      1.0
   2 │     0.0      0.0      0.0      0.0      1.0      0.0
   3 │     0.0      0.0      1.0      0.0      1.0      1.0
   4 │     1.0      0.0      1.0      1.0      1.0      1.0
   5 │     0.0      0.0      1.0      0.0      0.0      0.0

Note that there is a difference from creating a new data frame with df .> 0.5.
The issue is that columns of df2 keep their original types. This is expected, as we
wanted a fully in-place operation. However, sometimes you might want to change the
element type of a column when doing broadcasting. This is possible, however,
then you need to use data frame indexing with a special ! row selector which signals
that column replacement is requested:

julia> df2[!, :] .= df .> 0.5
5×6 DataFrame
 Row │ x1     x2     x3     x4     x5     x6
     │ Bool   Bool   Bool   Bool   Bool   Bool
─────┼──────────────────────────────────────────
   1 │ false   true  false   true  false   true
   2 │ false  false  false  false   true  false
   3 │ false  false   true  false   true   true
   4 │  true  false   true   true   true   true
   5 │ false  false   true  false  false  false

julia> df3
5×6 DataFrame
 Row │ x1     x2     x3     x4     x5     x6
     │ Bool   Bool   Bool   Bool   Bool   Bool
─────┼──────────────────────────────────────────
   1 │ false   true  false   true  false   true
   2 │ false  false  false  false   true  false
   3 │ false  false   true  false   true   true
   4 │  true  false   true   true   true   true
   5 │ false  false   true  false  false  false

Indeed we got what we wanted. I showed df3 variable to convince you that
still all operations were done on the same data frame object and df2 and df3
are still pointing to it.

Let me give an example where the difference between in-place and column replace
operations particularly matters and is a common surprise for new users.
It is a case when we want to introduce missing values to a column that initially does not allow them.

julia> df2 = copy(df)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.690894   0.240867   0.720774  0.331573  0.549766
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.775367  0.35275
   3 │ 0.19913    0.0633375  0.767805   0.280096  0.721995  0.917259
   4 │ 0.708132   0.230088   0.702677   0.947402  0.928979  0.66101
   5 │ 0.0267573  0.0122425  0.549734   0.331788  0.32658   0.00476749

julia> df2 .= ifelse.(df .> 0.5, missing, df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

julia> df2[!, :] .= ifelse.(df .> 0.5, missing, df)
5×6 DataFrame
 Row │ x1               x2               x3               x4              x5              x6
     │ Float64?         Float64?         Float64?         Float64?        Float64?        Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.364225   missing                0.240867   missing               0.331573  missing
   2 │       0.225226         0.241412         0.0793279        0.418206  missing               0.35275
   3 │       0.19913          0.0633375  missing                0.280096  missing         missing
   4 │ missing                0.230088   missing          missing         missing         missing
   5 │       0.0267573        0.0122425  missing                0.331788        0.32658         0.00476749

Note that df2 originally does not allow missing values in any of the columns. Therefore
df2 .= ifelse.(df .> 0.5, missing, df) fails. However, replacing df2 .= by df2[!, :] .=
works, because the ! selector explicitly requests overwriting the original columns
with new ones, possibly changing their types.

Conclusions

I hope you found these examples useful and they will help you to work with DataFrames.jl more
easily and confidently.

As a final comment let me explain why df2 .= ifelse.(df .> 0.5, missing, df) is fully in-place
(and does not replace the columns with proper element types like df2[!, :] .=) as this is a common question.
There are three reasons for this:

  • performance: a fully in-place operation is faster and allocates less;
  • safety: in production code we might want to make sure that the type of a column is not changed by mistake;
  • design consistency: the operation was designed to work the same way as broadcasting on matrices
    (broadcasted assignment used with a matrix does allow to change its element type).

My approach to named tuples in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/06/30/namedtuple.html

Introduction

Named tuples are one of the most commonly used data types in Julia.
In this post I want to discuss how I think of this type and use it in practice.

This post was written under Julia 1.9.1.

The basics

The Julia Manual in the section Named Tuples specifies:

The components of tuples can optionally be named, in which case a named tuple is constructed (…)
Named tuples are very similar to tuples, except that fields can additionally
be accessed by name using dot syntax (x.a) in addition to the regular indexing syntax (x[1]).

And gives the following example:

julia> x = (a=2, b=1+2)
(a = 2, b = 3)

julia> x[1]
2

julia> x.a
2

This definition highlights that NamedTuple is similar to a Tuple except that its
components can have names.

However, I find it more natural to think of a NamedTuple as an anonymous struct,
that optionally can be integer-indexed to get access to its components.

Let me give a few reasons why I think it is a better mental model.

Subtyping behavior

Let us have a look at the Tuple type first. Construct a basic tuple:

julia> t = (1, "a")
(1, "a")

julia> typeof(t)
Tuple{Int64, String}

Note that type of t is a subtype of e.g. Tuple{Integer, AbstractString}:

julia> t isa Tuple{Integer, AbstractString}
true

This subtyping behavior is called covariant.

Moreover Tuple{Integer, AbstractString} is not concrete:

julia> isconcretetype(Tuple{Integer, AbstractString})
false

This means that you cannot create a tuple that has this type.
If you try the following call:

julia> t2 = Tuple{Integer, AbstractString}(t)
(1, "a")

julia> typeof(t2)
Tuple{Int64, String}

Nothing changes. The resulting t2 tuple has concrete parameters Int64 and String.

You might ask how much of these features a NamedTuple shares? The answer is none.
Let us check:

julia> nt = (a=1, b="a")
(a = 1, b = "a")

julia> typeof(nt)
NamedTuple{(:a, :b), Tuple{Int64, String}}

Type of nt is not a subtype of e.g. NamedTuple{(:a, :b), Tuple{Integer, AbstractString}}:

julia> nt isa NamedTuple{(:a, :b), Tuple{Integer, AbstractString}}
false

However, it is a subtype of UnionAll type NamedTuple{(:a, :b), <:Tuple{Integer, AbstractString}}.

julia> nt isa NamedTuple{(:a, :b), <:Tuple{Integer, AbstractString}}
true

This subtyping behavior is called invariant and is shared with struct types.

Moreover, you can construct a value that has NamedTuple{(:a, :b), Tuple{Integer, AbstractString}} type (so it is concrete):

julia> isconcretetype(NamedTuple{(:a, :b), Tuple{Integer, AbstractString}})
true

julia> nt2 = NamedTuple{(:a, :b), Tuple{Integer, AbstractString}}((1, "a"))
NamedTuple{(:a, :b), Tuple{Integer, AbstractString}}((1, "a"))

julia> typeof(nt2)
NamedTuple{(:a, :b), Tuple{Integer, AbstractString}}

Again, this behavior is the same as for struct types.

The fact that you can flexibly set parameters of a named tuple is mostly useful
if you have heterogeneous data. Here is an example.

Permissive code:

julia> [(a=1, b=missing), (a=missing, b="x")]
2-element Vector{NamedTuple{(:a, :b)}}:
 (a = 1, b = missing)
 (a = missing, b = "x")

julia> typeof.([(a=1, b=missing), (a=missing, b="x")])
2-element Vector{DataType}:
 NamedTuple{(:a, :b), Tuple{Int64, Missing}}
 NamedTuple{(:a, :b), Tuple{Missing, String}}

Here we created a vector with elements having different types.
The code using it would not be type stable. However, we could make the types the same
(which would be more compiler friendly; this would be relevant if you processed really large data):

julia> @NamedTuple{a::Union{Int,Missing}, b::Union{String,Missing}}.([(a=1, b=missing), (a=missing, b="x")])
2-element Vector{NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}}}:
 NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}}((1, missing))
 NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}}((missing, "x"))

The difference is that now the compiler will know the type of every field of our named tuple
(and Julia gracefully handles small Unions).

Note that I used the @NamedTuple{a::Union{Int,Missing}, b::Union{String,Missing}} macro call
for a convenient way to specify NamedTuple type. Let us check it (with one more way to write it):

julia> @NamedTuple{a::Union{Int,Missing}, b::Union{String,Missing}}
NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}}

julia> @NamedTuple begin
           a::Union{Int,Missing}
           b::Union{String,Missing}
       end
NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}}

Now, even on the syntax level, the similarity to struct definition is apparent.

Let us define two types:

julia> const S1 = @NamedTuple begin
           a::Union{Int,Missing}
           b::Union{String,Missing}
       end
NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}}

julia> struct S2
           a::Union{Int,Missing}
           b::Union{String,Missing}
       end

The S1 and S2 types are almost undistinguishable. The only differences are:

  • S1 is a subtype of NamedTuple UnionAll, while S2 is just a subtype of Any;
  • S1 accepts a tuple as a single argument, while S2 accepts two arguments;
  • S1 is indexable, while S2 is not;
  • adding methods to foreign functions for S1 is type piracy.

How named tuples are different from structs

Let us create instances of S1 and S2 and discuss the differences mentioned above.

First the difference in the constructor:

julia> x1 = S1((1, "a"))
NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}}((1, "a"))

julia> x2 = S2(1, "a")
S2(1, "a")

We can see the difference in the constructor. An extra pair of parentheses is needed for S1.
It would be tempting to define:

julia> S1(x, y) = S1((x, y))
NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}}

julia> S1(1, "a")
NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}}((1, "a"))

This would work. But if you are writing library code you should never do this.
The reason is that you have just added a method for
NamedTuple{(:a, :b), Tuple{Union{Missing, Int64}, Union{Missing, String}}} type
and this type is part of Base Julia. Such behavior is called type piracy
and potentially could break some third-party code (in this case the risk is minimal,
but it is worth haveing this habit).

The difference is that with your custom type S2 you are free to add methods to
any functions you like (even third party), since you own S2. Therefore
(provided you make a correct method definition) you will not break any code by doing so.

Let us give an example. We have discussed the benefit of NamedTuple over
struct is that it is indexable:

julia> x1[1]
1

julia> x2[1]
ERROR: MethodError: no method matching getindex(::S2, ::Int64)

However, you can easily fix this:

julia> Base.getindex(x::S2, i::Int) = getfield(x, i)

julia> x2[1]
1

What we did by adding a method to Base.getindex is not type piracy since
S2 is a type that we own.

The last difference between S1 is S2 is that, as we mentioned,
S1 is subtype of NamedTuple, so it will get accepted by functions
that work with named tuples, as opposed to S2. Here is a small example:

julia> empty(x1)
NamedTuple()

julia> empty(x2)
ERROR: MethodError: no method matching empty(::S2)

Again – we could easily add a method for Base.empty for S2 if we wanted, just as we did
with Base.getindex above.

Conclusions

Given these examples named tuple shares some features with tuples, and some with structs.
My experience is that it is more similar to a struct in practice. There are two reasons:

  1. I rarely index or iterate named tuples, most often I access their fields by name.
  2. In dispatch named tuple does behave like a struct (this is quite relevant when writing your code).

Still I like to think of named tuple as an anonymous struct. Although we could give it an alias
like we did with S1 definition, this is rarely done. Typically the concrete type of a named tuple
is left dynamic, and I just define my functions for NamedTuple UnionAll.

Happy hacking!