By: Mosè Giordano
Re-posted from: http://giordano.github.io/blog/2021-03-25-access-array/
This innocent question from a newcomer in the official Slack
workspace of the Julia programming language
How can I extract the
Float64
element from a 1×1Array{Float64, 1}
?
spurred a series of more or less serious answers:
- use the
only
function:only(a)
- get the element with index 1:
a[1]
- omit all indices:
a[]
- use the
first
function:first(a)
- get the element with index 1, 1:
a[1, 1]
- use the
last
function:last(a)
- get the element with index 1, 1, 1, 1, 1, …:
a[ones(Int, 100)...]
(the
...
operator is called
splatting) - get the last element with the special
end
index (end
here is automatically
lowered tolastindex(a)
):a[end]
- get the element in the first row and last column:
a[begin, end]
- reduce the array with the
sum
function:reduce(sum, a)
- get the element with index
floor(Int, -real(exp(π * im)))
:a[floor(Int, -real(exp(π * im)))]
, or the variantx[floor(Int, -real(cispi(1)))]
- dereference the pointer
ofa
(remember?):
unsafe_load(pointer(a))
- use the
StarWarsArrays.jl
package and get the fourth element (mentioned in the post about random
indexing):
StarWarsArray(a)[4]
- use the
getindex
function (a[n, ...]
is a syntactic sugar forgetindex(a, n, ...)
):
getindex(a, firstindex(a))
- get a random element from the array:
rand(a)
Benchmarking
We can compare the performance of all these versions. To do this, we can use
the BenchmarkTools.jl
package
(benchmarks kindly provided by @Seelengrab
):
using BenchmarkTools # to install it: ]add BenchmarkTools
using StarWarsArrays # to install it: ]add https://github.com/giordano/StarWarsArrays.jl
using Random # it's a standard library
only_bench(x) = only(x)
one_idx_bench(x) = x[1]
omit_bench(x) = x[]
first_bench(x) = first(x)
two_ones_idx_bench(x) = x[1, 1]
splat_bench(x) = x[ones(Int, 100)...]
end_bench(x) = x[end]
begin_end_bench(x) = x[begin, end]
reduce_sum_bench(x) = reduce(sum, x)
floor_bench(x) = x[floor(Int, -real(exp(π * im)))]
cispi_bench(x) = x[floor(Int, -real(cispi(1)))]
dereference_bench(x) = unsafe_load(pointer(x))
getindex_bench(x) = getindex(x, firstindex(x))
starwars_bench(x) = StarWarsArray(x)[4] # ಠ_ಠ
rand_bench(x) = rand(x)
for f in [
only_bench,
one_idx_bench,
omit_bench,
first_bench,
two_ones_idx_bench,
splat_bench,
end_bench,
begin_end_bench,
reduce_sum_bench,
floor_bench,
cispi_bench,
dereference_bench,
getindex_bench,
starwars_bench,
rand_bench,
]
println(f, ':')
@btime $f(x) setup=(x=rand(1,1))
end
On my computer I get
only_bench:
2.800 ns (0 allocations: 0 bytes)
one_idx_bench:
1.631 ns (0 allocations: 0 bytes)
omit_bench:
2.796 ns (0 allocations: 0 bytes)
first_bench:
1.630 ns (0 allocations: 0 bytes)
two_ones_idx_bench:
1.685 ns (0 allocations: 0 bytes)
splat_bench:
4.524 μs (5 allocations: 1.75 KiB)
end_bench:
1.378 ns (0 allocations: 0 bytes)
begin_end_bench:
1.644 ns (0 allocations: 0 bytes)
reduce_sum_bench:
1.901 ns (0 allocations: 0 bytes)
floor_bench:
1.630 ns (0 allocations: 0 bytes)
cispi_bench:
1.631 ns (0 allocations: 0 bytes)
dereference_bench:
1.369 ns (0 allocations: 0 bytes)
getindex_bench:
1.371 ns (0 allocations: 0 bytes)
starwars_bench:
1.371 ns (0 allocations: 0 bytes)
rand_bench:
9.599 ns (0 allocations: 0 bytes)
Unsurprisingly, the fastest solution is dereferencing a pointer, which compiles
down to the following code
julia> x = rand(1,1);
julia> @code_native debuginfo=:none dereference_bench(x)
.text
movq (%rdi), %rax
vmovsd (%rax), %xmm0 # xmm0 = mem[0],zero
retq
nopl (%rax,%rax)
This is pretty much what you’d expect from the corresponding C
code.
What’s remarkable is that the
StarWarsArrays.jl
package
achieves about the same performance: accessing an element of a custom data
structure using a funny custom indexing, and written without having performance
in mind, is just as efficient as dereferencing a pointer.
However, subnanosecond differences aren’t credible: most functions that clocked
in the range 1.3-1.6 nanoseconds have basically the same performance. If you
run the benchmarks again you’ll find that their performance get more or less
close to that of dereferencing. The difference with dereferencing is that all
other functions do bounds
checking, which on average lead
to slightly larger runtime. However, in high-performance computing you usually
want to make sure to be within bounds at a higher level and skip bounds checks
when accessing individual elements. If you’re 100% sure that the element of the
array you want to access is within its bounds, you can use the macro
@inbounds
to
skip runtime bounds checks.
To see the performance when skipping bounds checks in the above functions, you
can either use @inbounds
in their definitions (e.g., one_idx_bench(x) =
, etc…), or start Julia with the
@inbounds x[1]--check-bounds=no
flag (this
flag is not recommended for general use, as it will skip all bounds checks,
not only where you know it’s safe to do so). With --check-bounds=no
I get
only_bench:
2.725 ns (0 allocations: 0 bytes)
one_idx_bench:
1.369 ns (0 allocations: 0 bytes)
omit_bench:
1.369 ns (0 allocations: 0 bytes)
first_bench:
1.369 ns (0 allocations: 0 bytes)
two_ones_idx_bench:
1.369 ns (0 allocations: 0 bytes)
splat_bench:
4.510 μs (5 allocations: 1.75 KiB)
end_bench:
1.369 ns (0 allocations: 0 bytes)
begin_end_bench:
1.630 ns (0 allocations: 0 bytes)
reduce_sum_bench:
1.960 ns (0 allocations: 0 bytes)
floor_bench:
1.408 ns (0 allocations: 0 bytes)
cispi_bench:
1.369 ns (0 allocations: 0 bytes)
dereference_bench:
1.369 ns (0 allocations: 0 bytes)
getindex_bench:
1.368 ns (0 allocations: 0 bytes)
starwars_bench:
1.369 ns (0 allocations: 0 bytes)
rand_bench:
9.606 ns (0 allocations: 0 bytes)
As expected, most functions now more consistently perform as well as
dereferencing. This is confirmed by the fact they are all compiled down to the
same efficient code, even the most bizarre ones like floor_bench
and
starwars_bench
:
julia> @code_native debuginfo=:none omit_bench(x)
.text
movq (%rdi), %rax
vmovsd (%rax), %xmm0 # xmm0 = mem[0],zero
retq
nopl (%rax,%rax)
julia> @code_native debuginfo=:none floor_bench(x)
.text
movq (%rdi), %rax
vmovsd (%rax), %xmm0 # xmm0 = mem[0],zero
retq
nopl (%rax,%rax)
julia> @code_native debuginfo=:none getindex_bench(x)
.text
movq (%rdi), %rax
vmovsd (%rax), %xmm0 # xmm0 = mem[0],zero
retq
nopl (%rax,%rax)
julia> @code_native debuginfo=:none starwars_bench(x)
.text
movq (%rdi), %rax
vmovsd (%rax), %xmm0 # xmm0 = mem[0],zero
retq
nopl (%rax,%rax)
Lessons learned
- To seriously answer the question that led to this digression, I think the
first answers are all good ways to access the only element of a 1×1 array:
only(a)
(which however always does a runtime check to make sure that’s the
only element of the array, incurring a performance hit),a[1]
,a[begin]
a[1, 1]
,a[]
,first(a)
- Custom Julia data types are efficient! The fact that a custom data type with
custom indexing like
StarWarsArray
has basically
the same performance as dereferencing a pointer (exactly same native code
when skipping all bounds checks) is a testament to the power and expresiveness
of the language: you can write your code in a high-level form, and have it
compiled down to very efficient code - Splatting a long array can be bad for performance, both at compile- and
run-time. Splatting is a convenient syntax, but use it with great care,
especially when you have many elements - Reduction has relatively little overhead
rand(a)
performed quite bad, but actually most of the time is spent in
accessing the global default random number generator (RNG): if you care about
performance (and reproducibility, too) you should always pass in the RNG as
argument (see also The need for rand
speed by Bogumił
Kamiński):julia> @btime rand(x) setup=(x=rand(1,1)) 9.596 ns (0 allocations: 0 bytes) 0.08218288442687438 julia> @btime rand($(Random.default_rng()), x) setup=(x=rand(1,1)) 4.770 ns (0 allocations: 0 bytes) 0.8112313537560083
In the second benchmark we explicitly passed the default RNG to
rand
and can
see that avoiding accessing the global RNG saves more than 50% of runtime.
Performance is still far from ideal though. As an aside, drawing a random
element from a short tuple (<= 4 elements) is optimised, and we can get again
the same performance as derefercing a pointer for a 1-element tuple:julia> @btime rand($(Random.default_rng()), Ref(x)[]) setup=(x=(rand(),)) 1.369 ns (0 allocations: 0 bytes) 0.4308805082102649 julia> @code_native debuginfo=:none rand(Random.default_rng(), (rand(),)) .text vmovsd (%rsi), %xmm0 # xmm0 = mem[0],zero retq nopw %cs:(%rax,%rax)