What you can learn from accessing an element of an array in Julia

Re-posted from: http://giordano.github.io/blog/2021-03-25-access-array/

This innocent question from a newcomer in the official Slack
workspace of the Julia programming language

How can I extract the Float64 element from a 1×1 Array{Float64, 1}?

spurred a series of more or less serious answers:

use the only function: only(a)
get the element with index 1: a[1]
omit all indices: a[]
use the first function: first(a)
get the element with index 1, 1: a[1, 1]
use the last function: last(a)
get the element with index 1, 1, 1, 1, 1, …: a[ones(Int, 100)...] (the
... operator is called
splatting)
get the last element with the special end index (end here is automatically
lowered to lastindex(a)): a[end]
get the element in the first row and last column: a[begin, end]
reduce the array with the
sum function: reduce(sum, a)
get the element with index floor(Int, -real(exp(π * im))): a[floor(Int, -real(exp(π * im)))], or the variant x[floor(Int, -real(cispi(1)))]
dereference the pointer
of a
(remember?):
unsafe_load(pointer(a))
use the StarWarsArrays.jl
package and get the fourth element (mentioned in the post about random
indexing):
StarWarsArray(a)[4]
use the
getindex
function (a[n, ...] is a syntactic sugar for getindex(a, n, ...)):
getindex(a, firstindex(a))
get a random element from the array: rand(a)

Benchmarking

We can compare the performance of all these versions. To do this, we can use
the BenchmarkTools.jl package
(benchmarks kindly provided by @Seelengrab):

using BenchmarkTools # to install it: ]add BenchmarkTools
using StarWarsArrays # to install it: ]add https://github.com/giordano/StarWarsArrays.jl
using Random # it's a standard library

only_bench(x)         = only(x)
one_idx_bench(x)      = x[1]
omit_bench(x)         = x[]
first_bench(x)        = first(x)
two_ones_idx_bench(x) = x[1, 1]
splat_bench(x)        = x[ones(Int, 100)...]
end_bench(x)          = x[end]
begin_end_bench(x)    = x[begin, end]
reduce_sum_bench(x)   = reduce(sum, x)
floor_bench(x)        = x[floor(Int, -real(exp(π * im)))]
cispi_bench(x)        = x[floor(Int, -real(cispi(1)))]
dereference_bench(x)  = unsafe_load(pointer(x))
getindex_bench(x)     = getindex(x, firstindex(x))
starwars_bench(x)     = StarWarsArray(x)[4] # ಠ_ಠ
rand_bench(x)         = rand(x)

for f in [
    only_bench,
    one_idx_bench,
    omit_bench,
    first_bench,
    two_ones_idx_bench,
    splat_bench,
    end_bench,
    begin_end_bench,
    reduce_sum_bench,
    floor_bench,
    cispi_bench,
    dereference_bench,
    getindex_bench,
    starwars_bench,
    rand_bench,
]
    println(f, ':')
    @btime $f(x) setup=(x=rand(1,1))
end

On my computer I get

only_bench:
  2.800 ns (0 allocations: 0 bytes)
one_idx_bench:
  1.631 ns (0 allocations: 0 bytes)
omit_bench:
  2.796 ns (0 allocations: 0 bytes)
first_bench:
  1.630 ns (0 allocations: 0 bytes)
two_ones_idx_bench:
  1.685 ns (0 allocations: 0 bytes)
splat_bench:
  4.524 μs (5 allocations: 1.75 KiB)
end_bench:
  1.378 ns (0 allocations: 0 bytes)
begin_end_bench:
  1.644 ns (0 allocations: 0 bytes)
reduce_sum_bench:
  1.901 ns (0 allocations: 0 bytes)
floor_bench:
  1.630 ns (0 allocations: 0 bytes)
cispi_bench:
  1.631 ns (0 allocations: 0 bytes)
dereference_bench:
  1.369 ns (0 allocations: 0 bytes)
getindex_bench:
  1.371 ns (0 allocations: 0 bytes)
starwars_bench:
  1.371 ns (0 allocations: 0 bytes)
rand_bench:
  9.599 ns (0 allocations: 0 bytes)

Unsurprisingly, the fastest solution is dereferencing a pointer, which compiles
down to the following code

julia> x = rand(1,1);

julia> @code_native debuginfo=:none dereference_bench(x)
        .text
        movq    (%rdi), %rax
        vmovsd  (%rax), %xmm0                   # xmm0 = mem[0],zero
        retq
        nopl    (%rax,%rax)

This is pretty much what you’d expect from the corresponding C
code.

What’s remarkable is that the
StarWarsArrays.jl package
achieves about the same performance: accessing an element of a custom data
structure using a funny custom indexing, and written without having performance
in mind, is just as efficient as dereferencing a pointer.

However, subnanosecond differences aren’t credible: most functions that clocked
in the range 1.3-1.6 nanoseconds have basically the same performance. If you
run the benchmarks again you’ll find that their performance get more or less
close to that of dereferencing. The difference with dereferencing is that all
other functions do bounds
checking, which on average lead
to slightly larger runtime. However, in high-performance computing you usually
want to make sure to be within bounds at a higher level and skip bounds checks
when accessing individual elements. If you’re 100% sure that the element of the
array you want to access is within its bounds, you can use the macro
@inbounds to
skip runtime bounds checks.

To see the performance when skipping bounds checks in the above functions, you
can either use @inbounds in their definitions (e.g., one_idx_bench(x) = @inbounds x[1], etc…), or start Julia with the --check-bounds=no flag (this
flag is not recommended for general use, as it will skip all bounds checks,
not only where you know it’s safe to do so). With --check-bounds=no I get

only_bench:
  2.725 ns (0 allocations: 0 bytes)
one_idx_bench:
  1.369 ns (0 allocations: 0 bytes)
omit_bench:
  1.369 ns (0 allocations: 0 bytes)
first_bench:
  1.369 ns (0 allocations: 0 bytes)
two_ones_idx_bench:
  1.369 ns (0 allocations: 0 bytes)
splat_bench:
  4.510 μs (5 allocations: 1.75 KiB)
end_bench:
  1.369 ns (0 allocations: 0 bytes)
begin_end_bench:
  1.630 ns (0 allocations: 0 bytes)
reduce_sum_bench:
  1.960 ns (0 allocations: 0 bytes)
floor_bench:
  1.408 ns (0 allocations: 0 bytes)
cispi_bench:
  1.369 ns (0 allocations: 0 bytes)
dereference_bench:
  1.369 ns (0 allocations: 0 bytes)
getindex_bench:
  1.368 ns (0 allocations: 0 bytes)
starwars_bench:
  1.369 ns (0 allocations: 0 bytes)
rand_bench:
  9.606 ns (0 allocations: 0 bytes)

As expected, most functions now more consistently perform as well as
dereferencing. This is confirmed by the fact they are all compiled down to the
same efficient code, even the most bizarre ones like floor_bench and
starwars_bench:

julia> @code_native debuginfo=:none omit_bench(x)
        .text
        movq    (%rdi), %rax
        vmovsd  (%rax), %xmm0                   # xmm0 = mem[0],zero
        retq
        nopl    (%rax,%rax)

julia> @code_native debuginfo=:none floor_bench(x)
        .text
        movq    (%rdi), %rax
        vmovsd  (%rax), %xmm0                   # xmm0 = mem[0],zero
        retq
        nopl    (%rax,%rax)

julia> @code_native debuginfo=:none getindex_bench(x)
        .text
        movq    (%rdi), %rax
        vmovsd  (%rax), %xmm0                   # xmm0 = mem[0],zero
        retq
        nopl    (%rax,%rax)

julia> @code_native debuginfo=:none starwars_bench(x)
        .text
        movq    (%rdi), %rax
        vmovsd  (%rax), %xmm0                   # xmm0 = mem[0],zero
        retq
        nopl    (%rax,%rax)

Lessons learned

To seriously answer the question that led to this digression, I think the
first answers are all good ways to access the only element of a 1×1 array:
only(a) (which however always does a runtime check to make sure that’s the
only element of the array, incurring a performance hit), a[1], a[begin]
a[1, 1], a[], first(a)
Custom Julia data types are efficient! The fact that a custom data type with
custom indexing like
StarWarsArray has basically
the same performance as dereferencing a pointer (exactly same native code
when skipping all bounds checks) is a testament to the power and expresiveness
of the language: you can write your code in a high-level form, and have it
compiled down to very efficient code
Splatting a long array can be bad for performance, both at compile- and
run-time. Splatting is a convenient syntax, but use it with great care,
especially when you have many elements
Reduction has relatively little overhead

rand(a) performed quite bad, but actually most of the time is spent in
accessing the global default random number generator (RNG): if you care about
performance (and reproducibility, too) you should always pass in the RNG as
argument (see also The need for rand
speed by Bogumił
Kamiński):

  julia> @btime rand(x) setup=(x=rand(1,1))
  9.596 ns (0 allocations: 0 bytes)
0.08218288442687438

julia> @btime rand($(Random.default_rng()), x) setup=(x=rand(1,1))
  4.770 ns (0 allocations: 0 bytes)
0.8112313537560083

In the second benchmark we explicitly passed the default RNG to rand and can
see that avoiding accessing the global RNG saves more than 50% of runtime.
Performance is still far from ideal though. As an aside, drawing a random
element from a short tuple (<= 4 elements) is optimised, and we can get again
the same performance as derefercing a pointer for a 1-element tuple:

julia> @btime rand($(Random.default_rng()), Ref(x)[]) setup=(x=(rand(),))
  1.369 ns (0 allocations: 0 bytes)
0.4308805082102649

julia> @code_native debuginfo=:none rand(Random.default_rng(), (rand(),))
        .text
        vmovsd  (%rsi), %xmm0                   # xmm0 = mem[0],zero
        retq
        nopw    %cs:(%rax,%rax)

juliabloggers.com

A Julia Language Blog Aggregator

What you can learn from accessing an element of an array in Julia

Benchmarking

Lessons learned

Related