Author Archives: Tim Besard

CUDA.jl 3.0

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2021-04-09-cuda_3.0/index.html

CUDA.jl 3.0 is a significant, semi-breaking release that features greatly improved multi-tasking and multi-threading, support for CUDA 11.2 and its new memory allocator, compiler tooling for GPU method overrides, device-side random number generation and a completely revamped cuDNN interface.

Improved multi-tasking and multi-threading

Before this release, CUDA operations were enqueued on a single global stream, and many of these operations (like copying memory, or synchronizing execution) were fully blocking. This posed difficulties when using multiple tasks to perform independent operations: Blocking operations prevent all tasks from making progress, and using the same stream introduces unintended dependencies on otherwise independend operations. CUDA.jl now uses private streams for each Julia task, and avoids blocking operations where possible, enabling task-based concurrent execution. It is also possible to use different devices on each task, and there is experimental support for executing those tasks from different threads.

A picture snippet of code is worth a thousand words, so let's demonstrate using a computation that uses both a library function (GEMM from CUBLAS) and a native Julia broadcast kernel:

using CUDA, LinearAlgebrafunction compute(a,b,c)
    mul!(c, a, b)
    broadcast!(sin, c, c)
    synchronize()
    c
end

To execute multiple invocations of this function concurrently, we can simply use Julia's task-based programming interfaces and wrap each call to compute in an @async block. Then, we synchronize execution again by wrapping in a @sync block:

function iteration(a,b,c)
    results = Vector{Any}(undef, 2)
    NVTX.@range "computation" @sync begin
        @async begin
            results[1] = compute(a,b,c)
        end
        @async begin
            results[2] = compute(a,b,c)
        end
    end
    NVTX.@range "comparison" Array(results[1]) == Array(results[2])
end

The calls to the @range macro from NVTX, a submodule of CUDA.jl, will visualize the different phases of execution when we profile our program. We now invoke our function using some random data:

function main(N=1024)
    a = CUDA.rand(N,N)
    b = CUDA.rand(N,N)
    c = CUDA.rand(N,N)    # make sure this data can be used by other tasks!
    synchronize()    # warm-up
    iteration(a,b,c)
    GC.gc(true)    NVTX.@range "main" iteration(a,b,c)
end

The snippet above illustrates one breaking aspect of this release: Because each task uses its own stream, you now need to synchronize when re-using data in another task. Although it is unlikely that any user code was relying on the old behavior, it is technically a breaking change, and as such we are bumping the major version of the CUDA.jl package.

If we profile these our program using NSight Systems, we can see how the execution of both calls to compute was overlapped:

Overlapping execution on the GPU using task-based concurrency

The region highlighted in green was spent enqueueing operations from the CPU, which includes the call to synchronize(). This used to be a blocking operation, whereas now it only synchronizes the task-local stream while yielding to the Julia scheduler so that it can continue execution on another task. For synchronizing the entire device, use the new device_synchronize() function.

The remainder of computation was then spent executing kernels. Here, execution was overlapped, but that obviously depends on the exact characteristics of the computations and your GPU. Also note that copying to and from the CPU is always going to block for some time, unless the memory was page-locked. CUDA.jl now supports locking memory like that using the pin function; for more details refer to the CUDA.jl documentation on tasks and threads.

CUDA 11.2 and stream-ordered allocations

CUDA.jl now also fully supports CUDA 11.2, and it will default to using that version of the toolkit if your driver supports it. The release came with several new features, such as the new stream-ordered memory allocator. Without going into details, it is now possible to asynchonously allocate memory, obviating much of the need to cache those allocations in a memory pool. Initial benchmarks have shown nice speed-ups from using this allocator, while lowering memory pressure and thus reducing invocations of the Julia garbage collector.

When using CUDA 11.2, CUDA.jl will default to the CUDA-backed memory pool and disable its own caching layer. If you want to compare performance, you can still use the old allocator and caching memory pool by setting the JULIA_CUDA_MEMORY_POOL environment variable to, e.g. binned. On older versions of CUDA, the binned pool is still used by default.

GPU method overrides

With the new AbstractInterpreter functionality in Julia 1.6, it is now much easier to further customize the Base compiler. This has enabled us to develop a mechanism for overriding methods with GPU-specific counterparts. It used to be required to explicitly pick CUDA-specific versions, e.g. CUDA.sin, because the Base version performed some GPU-incompatible operation. This was problematic as it did not compose with generic code, and the CUDA-specific versions often lacked support for specific combinations of argument types (for example, CUDA.sin(::Complex) was not supported).

With CUDA 3.0, it is possible to define GPU-specific methods that override an existing definition, without requiring a new function type. For now, this functionality is private to CUDA.jl, but we expect to make it available to other packages starting with Julia 1.7.

This functionality has unblocked many issues, as can be seen in the corresponding pull request. It is now no longer needed to prefix a call with the CUDA module to ensure a GPU-compatible version is used. Furthermore, it also protects users from accidentally calling GPU intrinsics, as doing so will now result in an error instead of a crash:

julia> CUDA.saturate(1f0)
ERROR: This function is not intended for use on the CPU
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] saturate(x::Float32)
   @ CUDA ~/Julia/pkg/CUDA/src/device/intrinsics.jl:23
 [3] top-level scope
   @ REPL[10]:1

Device-side random number generation

As an illustration of the value of GPU method overrides, CUDA.jl now provides a device-side random number generator that is accessible by simply calling rand() from a kernel:

julia> function kernel()
         @cushow rand()
         return
       end
kernel (generic function with 1 method)julia> @cuda kernel()
rand() = 0.668274

This works by overriding the Random.default_rng() method, and providing a GPU-compatible random number generator: Building on exploratory work by @S-D-R, the current generator is a maximally equidistributed combined Tausworthe RNG that shares 32-bytes of random state across threads in a warp for performance. The generator performs well, but does not pass the Crush battery of tests, so PRs are welcome here to improve the implementation!

Note that for host-side operations, e.g. rand!(::CuArray), the generator is not yet used by default. Instead, we use CURAND whenever possible, and fall back to the slower but more full-featured GPUArrays.jl-generator in other cases.

Revamped cuDNN interface

Finally, the cuDNN wrappers have been completely revamped by @denizyuret. The goal of the redesign is to more faithfully map the cuDNN API to more natural Julia functions, so that packages like Knet.jl or NNlib.jl can more easily use advanced cuDNN features without having to resort to low-level C calls. For more details, refer to the design document. As part of this redesign, the high-level wrappers of CUDNN have been moved to a subpackage of NNlib.jl.

Bugs in Julia with AFL and C-Reduce

By: Tim Besard

Re-posted from: https://blog.maleadt.net/2018/11/16/julia_bugs/

Recently, I’ve been looking at some existing tools to efficiently find bugs in
the Julia language implementation: American Fuzzy
Lop
(AFL) for fuzzing the compiler, and
C-Reduce for reducing test cases. In this
post, I’ll detail the set-up and use of these tools. Skip to the bottom
of this post for a demonstration.

American Fuzzy Lop

Fuzzing Julia used to be impractical due to the significant start-up time of the
Julia VM, rendering typical fuzzing approaches inefficient and calling for
specialized tools such as
TypeFuzz.jl. However, start-up
latency has been improving lately, and AFL now has some features to cope with
slow binaries.

Quick start

Working with AFL 2.52b, the naive approach of using afl-gcc and afl-g++ to
compile Julia by setting the CC and CXX build variables, results in a
functioning julia binary. This only instruments those parts of the compiler
that have been written in C/C++; code written in Julia is not covered.

To trap interesting errors, I built with LLVM_ASSERTIONS=1 and applied the
following patch to at least detect errors in the Julia parts of the compiler:

--- a/julia/src/gf.c
+++ b/julia/src/gf.c
@@ -280,10 +280,11 @@ jl_code_info_t *jl_type_infer(...)
 jl_printf(JL_STDERR, "Internal error: encountered unexpected error in runtime:\n");
 jl_static_show(JL_STDERR, jl_current_exception());
 jl_printf(JL_STDERR, "\n");
 jlbacktrace(); // written to STDERR_FILENO
 linfo_src = NULL;
+abort();

Running the instrumented binary with afl-fuzz crashes the AFL forkserver. This
is due to an incompatibility with the LD_BIND_NOW linker option that AFL uses.
It can be disabled by specifying LD_BIND_LAZY=1 before running the fuzzer.
Furthermore, Julia preallocates quite a bit of memory, and interesting
test-cases might take a while due to infer and compile, so I specify -m 2048
to use up to 2G per process and -t 5000 to let them run for up to 5 seconds:

$ LD_BIND_LAZY=1 afl-fuzz -i inputs/ -o outputs/
                          -m 2048 -t 5000 -- julia @@
afl-fuzz 2.52b by <lcamtuf@google.com>
[+] You have 16 CPU cores and 2 runnable tasks (utilization: 12%).
[+] Try parallel jobs - see docs/parallel_fuzzing.txt.
[*] Checking CPU core loadout...
[+] Found a free CPU core, binding to #4.
[*] Checking core_pattern...
[*] Checking CPU scaling governor...
[*] Setting up output directories...
[*] Scanning 'inputs'...
[+] No auto-generated dictionary tokens to reuse.
[*] Creating hard links for all input files...
[*] Validating target binary...
[+] Deferred forkserver binary detected.
[*] Attempting dry run with 'id:000000,orig:binary_trees.jl'...
[*] Spinning up the fork server...
[+] All right - fork server is up.
    len = 1324, map size = 12396, exec speed = 249839 us
[+] All test cases processed.
[+] All set and ready to roll!

Sadly, performance is pretty low: between 0.2 and 2 executions per second,
respectively executions that error (not crash) and succeed.

Let’s try and optimize that.

Only instrument the compiler

A first trivial optimization is to only instrument code that matters.
Specifically, compiling dependencies (such as LLVM and OpenBLAS) with the system
compiler bumps the speed up a notch:

julia$ make -C deps
julia$ make CC=afl/afl-gcc CXX=afl/afl-g++

Fail hard and fast

Fuzzing code results in lots of invalid sources getting processed by the Julia
compiler. Each of these result in a nice exception, which relies on expensive
stack unwinding to provide back traces. We can disable this functionality by
building with DISABLE_LIBUNWIND=1.

Another costly element of errors is displaying them, as this code is not part of
the precompiled system image. This functionality, courtesy of the
display_error function in julia/base/client.jl, is easily disabled by
preloading a file that redefines the relevant methods:

$ cat fuzz.jl
Base.display_error(er, bt) = println("ERROR")
$ julia -L fuzz.jl -e "error()"
ERROR

The above also applies to deprecation warnings, but those can be disabled more
easily by passing --depwarn=no to Julia.

Deferred fork server

Much of the time launching Julia is spent setting-up libraries, initializing
code generation, etc. This recurring cost can be avoided by using a deferred
fork server, where AFL starts the binary at a manually-specified point during
execution. In the case of Julia, this would be right before loading the input
file:

--- a/julia/ui/repl.c
+++ b/julia/ui/repl.c
@@ -89,6 +89,10 @@ static NOINLINE int true_main(int argc, char *argv[])
     jl_function_t *start_client = jl_base_module ?
         (jl_function_t*)jl_get_global(jl_base_module, jl_symbol("_start")) : NULL;
 
+#ifdef __AFL_HAVE_MANUAL_CONTROL
+  __AFL_INIT();
+#endif
+
     if (start_client) {

This approach relies on the binary being compiled with AFL’s LLVM instrumenter,
available as afl-clang-fast and afl-clang-fast++ for compiling respectively
C and C++ code, again specified using the CC and CXX build variables. Note
that this instrumentation seems incompatible with OpenBLAS, so if you were to
instrument dependencies you would need to compile that library using a regular
compiler.

Use of the deferred fork server necessitated several changes. For one, the
symbols AFL defines need to be exported globally:

--- a/julia/src/julia.expmap
+++ b/julia/src/julia.expmap
@@ -1,5 +1,6 @@
 {
   global:
+    __afl*;

Furthermore, despite the __afl_manual_init symbol that enables the deferred
fork server being part of the Julia binary, AFL does not pick it up and defaults
to regular forking. I did not debug this issue, and instead patched AFL to
unconditionally enter deferred mode:

--- a/afl/afl-fuzz.c
+++ b/afl/afl-fuzz.c
@@ -6950,7 +6950,7 @@ EXP_ST void check_binary(u8* fname) {
-  if (memmem(f_data, f_len, DEFER_SIG, strlen(DEFER_SIG) + 1)) {
+  if (1) {
 
     OKF(cPIN "Deferred forkserver binary detected.")

You should now see the following in the afl-fuzz output:

[+] Deferred forkserver binary detected.

Note that the AFL documentation suggests putting the __AFL_INIT call before
the initialization of important threads, as they don’t survive the fork. The
patch above does not do that, as threads are spawned by restore_signals as
well as different module initializers. To be safe, I built with
JULIA_THREADS=0, configured LLVM with ENABLE_THREADS unset and executed with
OPENBLAS_NUM_THREADS=1 to reduce possible issues. Still, your mileage might
vary and you might have to move the initialization point further upwards.

Multithreaded fuzzing

With the above fixes, we go from 0.2-2 executions/s to 3-30; still considered
slow but a significant improvement nonetheless. Luckily, AFL scales pretty well
and can be easily executed across all threads of a system. This brings the final
execution speed to around 500 per second.

Now let’s focus on improving fuzzing quality.

Instrumenting Julia code

AFL can more accurately measure the impact of changes to inputs if all relevant
code is instrumented. Nowadays, part of the Julia compiler is written in Julia,
and compiled during initial bootstrap of the julia binary. Julia code is
compiled with LLVM, so we can re-use the LLVM-based instrumentation passes that
are part of AFL. However, Julia uses its own (heavily patched) version of LLVM,
so we need to recompile AFL and link against Julia’s LLVM library:

afl/llvm_mode$ PATH=julia/usr/tools:$PATH make
# building the tests will fail

In order to use this pass, which is now linked against Julia’s LLVM, we need a
compatible build of Clang that can load the instrumentation pass. We can do so
by rebuilding Julia’s copy of LLVM with the BUILD_LLVM_CLANG variable set:

julia$ make -C deps BUILD_LLVM_CLANG=1

# we can now run AFL's tests
afl/llvm_mode$ PATH=julia/usr/tools:$PATH make

AFL needs to be modified to expose its instrumentation pass:

--- a/afl/llvm_mode/afl-llvm-pass.so.cc
+++ b/afl/llvm_mode/afl-llvm-pass.so.cc
@@ -177,6 +177,10 @@
+namespace llvm {
+Pass *createAFLCoveragePass() {
+  return new AFLCoverage();
+}
+};

The following changes make Julia use these instrumentation passes:

--- a/julia/src/Makefile
+++ b/julia/src/Makefile
@@ -95,6 +95,8 @@
 endif # USE_LLVM_SHLIB == 1
 endif

+LLVMLINK += afl/afl-llvm-pass.so
--- a/julia/src/jitlayers.cpp
+++ b/julia/src/jitlayers.cpp
@@ -32,6 +32,7 @@
 namespace llvm {
     extern Pass *createLowerSimdLoopPass();
+    extern Pass *createAFLCoveragePass();
 }

@@ -230,6 +231,8 @@ void addOptimizationPasses(...)
+    PM->add(createAFLCoveragePass());
 }

Now build Julia while pointing AFL to the compatible clang binary:

julia$ export AFL_CC=julia/usr/tools/clang
julia$ export AFL_CXX=julia/usr/tools/clang++
julia$ make CC=afl/afl-clang-fast CXX=afl/afl-clang-fast++

For some reason, compiled Julia code uses AFL’s afl_prev_loc TLS variable
differently then how it is defined in the contained binary (instead using
__emutls_v.__afl_prev_loc). I worked around this issue by changing the
definition to a regular variable, since we’re not using threads anyway:

--- a/afl/llvm_mode/afl-llvm-rt.o.c
+++ b/afl/llvm_mode/afl-llvm-rt.o.c
@@ -52,7 +52,7 @@
-__thread u32 __afl_prev_loc;
+u32 __afl_prev_loc;
--- a/afl/llvm_mode/afl-llvm-pass.so.cc
+++ b/afl/llvm_mode/afl-llvm-pass.so.cc
@@ -101,8 +101,7 @@ bool AFLCoverage::runOnModule(Module &M) {
 GlobalVariable *AFLPrevLoc = new GlobalVariable(
-M, Int32Ty, false, GlobalValue::ExternalLinkage, 0, "__afl_prev_loc",
-0, GlobalVariable::GeneralDynamicTLSModel, 0, false);
+M, Int32Ty, false, GlobalValue::ExternalLinkage, 0, "__afl_prev_loc");

Make sure to rebuild all of Julia after applying this patch.

We can now compare the generated code with and without instrumentation:

--- julia-original     -e "@code_llvm sin(1.)"
+++ julia-instrumented -e "@code_llvm sin(1.)"
 L4:                                               ; preds = %top
 ; Location: special/trig.jl:32
 ; Function <; {
 ; Location: float.jl:452
-  %4 = fcmp uge double %2, 0x3E50000000000000
+  %10 = load i32, i32* @__afl_prev_loc
+  %11 = load i8*, i8** @__afl_area_ptr
+  %12 = xor i32 %10, 56933
+  %13 = getelementptr i8, i8* %11, i32 %12
+  %14 = load i8, i8* %13
+  %15 = add i8 %14, 1
+  store i8 %15, i8* %13
+  store i32 28466, i32* @__afl_prev_loc
+  %16 = fcmp uge double %8, 0x3E50000000000000
 ;}

Dictionary for Julia code

A dictionary makes it easier for AFL to explore verbose grammars such as
programming languages. I created a dictionary for
Julia
, with
most keywords, important operators, and common syntax such as values and
definitions. You can use this dictionary by passing -x julia.dict to
afl-fuzz.

Test-case reduction

Although AFL provides a test-case minimizer (afl-tmin), I decided to look into
a independent tool. C-Reduce is such a
tool, blindly reducing test cases based on a user-specified condition. As such,
it is more flexible, and independent from the complicated set-up from above.

Using C-Reduce is fairly simple: provide a script whose return code indicates
the state of the current test case, and run it over some files. I’ve created
some scripts that improve integration with Julia and its code loading mechanism:
maleadt/creduce_julia. For more
information, refer to the repository’s README, or have a look at this Discourse
thread
.

Demo

After running AFL for about a day on a 16-core machine, afl-whatsup reported
the following:

$ afl-whatsup -s outputs/
status check tool for afl-fuzz by <lcamtuf@google.com>

Summary stats
=============

       Fuzzers alive : 16
      Total run time : 13 days, 9 hours
         Total execs : 3 million
    Cumulative speed : 42 execs/sec
       Pending paths : 2023 faves, 62710 total
  Pending per fuzzer : 126 faves, 3919 total (on average)
       Crashes found : 1 locally unique

We found a crash! The actual input is written in one of the crashes
subdirectories, and contained the following code:

abstract type BTree end

mutable struct Empty <: BTree
end

mutable struct Node <: BTree
  info
  left::BTree
  right::BTree
end

function make(val, d)
  if d == 0
    Node(val, Empty(), Empty())
  else
    nval = val * 2
    Node(val, make(nval-1, d-1), make(nval, d-1))
  end
end

check(t::Empty) = 0
check(t::Node) = t.info + check(t.left) - check(t.right)

function loop_depths(d, min_depth, max_depth)
  for i = 0:div(max_depth - d, 2)
    niter = 1 << (max_depth - d + min_depth)
    c = 0
    for j = 1:niter
      c += check(make(i, d)) + check(make(-i, d))
    end,#@printf("%i\t trees of depth %i\t check: %i\n", 2*niter, d, c)
    d += 2
  end
end

const N=10
min_depth = 4
max_depth = N
stretch_depth = max_depth + 1

let c = check(make(0, stretch_depth))
  #@printf("stretch tree of depth %i\t check: %i\n", stretch_depth, c)
end

long_lived_tree = make(0, max_depth)

loop_depths(min_depth, min_depth, max_depth)

As you may notice, I used the BaseBenchmark’s shootout
benchmarks

as inputs to AFL. Pasting the reduced code in a Julia REPL indeed crashes the
compiler:

Internal error: encountered unexpected error in runtime:
BoundsError(a=Array{Core.Compiler.NewNode, (0,)}[], i=(3,))

signal (4): Illegal instruction

To reduce this code, I check-out
maleadt/creduce_julia and use the
crashing code as the initial input. We need to make sure that the run script
successfully catches the error condition; in this case it is sufficient to
grep for encountered unexpected error in runtime:

--- a/creduce_julia/run
+++ b/creduce_julia/run
@@ -30,4 +30,4 @@ JULIA_LOAD_PATH=$TEMP \
-julia main.jl |& grep "example error message"
+julia main.jl |& grep "encountered unexpected error in runtime"
creduce_julia$ ./run
Internal error: encountered unexpected error in runtime:

creduce_julia$ echo $?
0

Finally, we run the reduce script to kick-off the process. It will
automatically work in parallel on all available cores:

creduce_julia$ ./reduce
===< 30926 >===
running 8 interestingness tests in parallel
...
===================== done ====================

pass statistics:
  method pass_balanced :: parens-inside worked 1 times and failed 2 times
  method pass_indent :: final worked 1 times and failed 0 times
  method pass_clex :: rm-toks-3 worked 1 times and failed 61 times
  method pass_lines :: 4 worked 2 times and failed 50 times
  method pass_lines :: 1 worked 2 times and failed 50 times
  method pass_clex :: rename-toks worked 2 times and failed 3 times
  method pass_blank :: 0 worked 2 times and failed 0 times
  method pass_indent :: regular worked 2 times and failed 0 times
  method pass_lines :: 8 worked 2 times and failed 50 times
  method pass_lines :: 10 worked 2 times and failed 50 times
  method pass_lines :: 6 worked 2 times and failed 50 times
  method pass_lines :: 3 worked 2 times and failed 50 times
  method pass_clex :: rm-toks-2 worked 2 times and failed 64 times
  method pass_lines :: 2 worked 2 times and failed 50 times
  method pass_balanced :: parens-to-zero worked 2 times and failed 6 times
  method pass_clex :: rm-toks-4 worked 3 times and failed 49 times
  method pass_clex :: rm-toks-13 worked 3 times and failed 18 times
  method pass_lines :: 0 worked 15 times and failed 78 times
  method pass_clex :: rm-toks-1 worked 16 times and failed 68 times

The above took about 2 minutes on an 4-core workstation. Doing some manual
clean-up, we end up with the issue as reported in
JuliaLang/julia#30062:

for a = 1 end, b += 2

Future work

Although a tool like AFL isn’t ideal
for fuzzing a compiler as done in this blogpost, the technique and its set-up
can be used to fuzz other libraries and tools that are written in Julia (eg.
HTTP.jl or Images.jl). That would of course require productionisation of this
proof-of-concept and its many workarounds.

A hypothetical AFL.jl package would:

Fuzzing the compiler itself could also be improved:

  • deferring the fork server even more: there’s significant initialization done
    by Julia code, and the manual initialization point could move right before the
    input eval loop
  • persistent fuzzing: having AFL input new test cases without restarting the
    process. Although Julia is far from stateless, AFL could be modified to verify
    crashes in a fresh process.