Author Archives: Tamás K. Papp

Branch prediction: yet another example

Re-posted from: https://tamaspapp.eu/post/branch_prediction2/

Tomas Lycken linked a very nice discussion on StackOverflow about branch prediction as a comment on the previous post. It has an intuitive explanation (read it if you like good metaphors) and some Java benchmarks. I was curious about how it looks in Julia.

The exercise is to sum elements in a vector only if they are greater than or equal to 128.

function sumabove_if(x)
    s = zero(eltype(x))
    for elt in x
        if elt ≥ 128
            s += elt
        end
    end
    s
end

This calculation naturally has a branch in it, while the branchless version, using ifelse, does not:

function sumabove_ifelse(x)
    s = zero(eltype(x))
    for elt in x
        s += ifelse(elt ≥ 128, elt, zero(eltype(x)))
    end
    s
end

The actual example has something different: using tricky bit-twiddling to calculate the same value. I generally like to leave this sort of thing up to the compiler, because it is much, much better at it than I am, and I make mistakes all the time; worse, I don’t know what I actually did when I reread the code 6 months later. But I included it here for comparison:

function sumabove_tricky(x::Vector{Int64})
    s = Int64(0)
    for elt in x
        s += ~((elt - 128) >> 63) & elt
    end
    s
end

Following the original example on StackOverflow, we sum 2^15 random integers in 1:256. For this, we don’t need to worry about overflow. We also sum the sorted vector: this will facilitate branch predicion, since the various branches will be contiguous.

I also benchmark a simple version using generators:

sumabove_generator(x) = sum(y for y in x if y ≥ 128)

Benchmarks (\(μ\)s)
	random	sorted
`if`	139	28
`ifelse`	21	21
`if` & `sort`	96	n/a
tricky	27	27
generator	219	168

Benchmarks are in the table above. Note that

for the version with if, working on sorted vectors is dramatically faster (about 5x).
the non-branching ifelse version beats them hands down, and naturally it does not care about sorting.
if you have to use if, then you are better off sorting, even if you take the time of that into account.
generators are susprisingly bad.
the tricky bit-twiddling version is actually worse than ifelse (which reinforces my aversion to it).

Self-contained code for everything is available below.

download code as code.jl

CPU pipelines: when more is less

By: Tamás K. Papp

Re-posted from: https://tamaspapp.eu/post/branch_prediction/

I have been working on micro-optimizations for some simulation code, and was reminded of a counter-intuitive artifact of modern CPU architecture, which is worth a short post.

Consider (just for the sake of example) a very simple (if not particularly meaningful) function,

\[
f(x) = \begin{cases}
(x+2)^2 & \text{if } x \ge 0,\\
1-x & \text{otherwise}
\end{cases}
\]

with implementations

f1(x) = ifelse(x ≥ 0, abs2(x+2), 1-x)
f2(x) = x ≥ 0 ? abs2(x+2) : 1-x

f1 calculates both possibilities before choosing between them with ifelse, while f2 will only calculate values on demand. So, intuitively, it should be faster.

But it isn’t…

julia> x = randn(1_000_000);

julia> using BenchmarkTools

julia> @btime f1.($x);
  664.228 μs (2 allocations: 7.63 MiB)

julia> @btime f2.($x);
  6.519 ms (2 allocations: 7.63 MiB)

…it is about 10x slower.

This can be understood as an artifact of the instruction pipeline: your x86 CPU likes to perform similar operations in staggered manner, and it does not like branches (jumps) because they break the flow.

Comparing the native code reveals that while f1 is jump-free, the if in f2 results in a jump (jae):

julia> @code_native f1(1.0)
        .text
Filename: REPL[2]
        pushq   %rbp
        movq    %rsp, %rbp
        movabsq $139862498743472, %rax  # imm = 0x7F34468E14B0
        movsd   (%rax), %xmm2           # xmm2 = mem[0],zero
Source line: 1
        addsd   %xmm0, %xmm2
        mulsd   %xmm2, %xmm2
        movabsq $139862498743480, %rax  # imm = 0x7F34468E14B8
        movsd   (%rax), %xmm3           # xmm3 = mem[0],zero
        subsd   %xmm0, %xmm3
        xorps   %xmm1, %xmm1
        cmpnlesd        %xmm0, %xmm1
        andpd   %xmm1, %xmm3
        andnpd  %xmm2, %xmm1
        orpd    %xmm3, %xmm1
        movapd  %xmm1, %xmm0
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

julia> @code_native f2(1.0)
        .text
Filename: REPL[3]
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 1
        xorps   %xmm1, %xmm1
        ucomisd %xmm1, %xmm0
        jae     L37
        movabsq $139862498680736, %rax  # imm = 0x7F34468D1FA0
        movsd   (%rax), %xmm1           # xmm1 = mem[0],zero
        subsd   %xmm0, %xmm1
        movapd  %xmm1, %xmm0
        popq    %rbp
        retq
L37:
        movabsq $139862498680728, %rax  # imm = 0x7F34468D1F98
        addsd   (%rax), %xmm0
        mulsd   %xmm0, %xmm0
        popq    %rbp
        retq
        nopl    (%rax)

In my application the speed gain was more modest, but still sizeable. Benchmarking a non-branching version of your code is sometimes worth it, especially if it the change is simple and both branches of the conditional can be run error-free. If, for example, we had to calculate

g(x) = x ≥ 0 ? √(x+2) : 1-x

then we could not use ifelse without restricting the domain, since √(x+2) would fail whenever x < -2.

Julia Base contains many optimizations like this: for a particularly nice example see functions that use Base.null_safe_op.

Blog redesign 2.0

By: Tamás K. Papp

Re-posted from: https://tamaspapp.eu/post/blog-redesign-201709/

I have redesigned my blog (again), mainly tweaking the CSS and
hopefully achieving better support on small screens (which are still
not ideal for math, but now you should get a red warning float at the
bottom).

I also re-did the feed code so that it would render better on
juliabloggers.com. Now the whole
content of each post should be scraped seamlessly, and appear
correctly. However, the only way to test the whole toolchain is to do
it live, and I apologize if something is still not perfect and you get
bogus updates (BTW, this post should not show up in the Julia feed).

Highlights of the changes:

responsive design,
nicer fonts,
line breaks in MathJax when necessary and supported,
better highlighting (Julia now looks especially nice),
embedded code blocks with a download link,
better image placement,
Emacs screenshots now use a branch of emacs-htmlize, which hopefully gets merged soon.

As always, the source code for the whole site is available.