Author Archives: Tim Besard

Debugging Julia with Address Sanitizer

By: Tim Besard

Re-posted from: https://blog.maleadt.net/2017/02/24/julia-asan/

Address sanitizer is a useful tool for
debugging various memory problems, from invalid accesses to mismanagement or leaks. It is
similar to Valgrind’s
memcheck, but uses compile-time
instrumentation to lower the cost.

In this post I’ll explain how to use Clang’s address sanitizer (or ASAN) with Julia. This is
somewhat tricky, as the Julia compiler uses LLVM for code generation purposes. Long story
short, this implies that all instances of LLVM (ie. the one Julia is compiled with, and the
one used for code generation) have to match up exactly for the instrumentation to work as
expected.

LLVM toolchain

We’ll start by building a toolchain to compile Julia with. As mentioned before, all LLVM
instances in play have to match up exactly for instrumentation to work, so we’ll use Julia’s
build infrastructure to generate us an LLVM toolchain.

Start by checking-out Julia, and creating an out-of-tree build directory:

$ git clone https://github.com/JuliaLang/julia
$ cd julia
$ make O=configure sanitize_toolchain

This build will need to provide clang, so create a Make.user containing
BUILD_LLVM_CLANG=1. In addition, LLVM does not build its sanitizers with autotools, so add
override LLVM_USE_CMAKE=1 to that file as well. And because that triggers LLVM bug
#23649
, also add USE_LLVM_SHLIB=0

Now execute make install-llvm from the deps subfolder. When it
finishes, check if binaries have been written to usr/bin (due to what’s probably a bug in
LLVM’s build scripts), and move them to usr/tools if they have.

Sanitized Julia

Now that we have a working toolchain, we’ll use it to compile a sanitized version of the
Julia compiler and libraries. Start by creating a new out-of-tree build directory using
make O=configure sanitize. But this time, our Make.user will be significantly more
complex:

TOOLCHAIN=$(BUILDROOT)/sanitize_toolchain/usr/tools

# use our new toolchain
USECLANG=1
override CC=$(TOOLCHAIN)/clang
override CXX=$(TOOLCHAIN)/clang++
export ASAN_SYMBOLIZER_PATH=$(TOOLCHAIN)/llvm-symbolizer

# enable ASAN
override SANITIZE=1
override LLVM_SANITIZE=1

# autotools doesn't have a self-sanitize mode
override LLVM_USE_CMAKE=1

# make the GC use regular malloc/frees, which are intercepted by ASAN
override WITH_GC_DEBUG_ENV=1

# default to a debug build for better line number reporting
override JULIA_BUILD_MODE=debug

Now kick-off the build using make from the sanitize build directory. Barring any memory
issues triggered during system image generation, this should yield a sanitized julia
binary and system image.

Running the test-suite

The test-suite is a beast, and because ASAN keeps track of a lot of information it easily
takes over 128GiB of memory to run it to completion. Instead, we’ll tune ASAN to consume
less memory at the expense of accuracy and report detail.

Julia however already configures default ASAN
options
,
which we need to copy when specifying a different set. Do so by defining the
ASAN_OPTIONS environment variable and assigning it the value of
detect_leaks=0:allow_user_segv_handler=1:fast_unwind_on_malloc=0:malloc_context_size=2.
This copies aforementioned default values, and caps backtrace collection.

Using CUDA packages

If you thought all that was convoluted, prepare for some more. ASAN uses so-called shadow memory to store
information about memory allocations. There is a correspondence between regular memory
addresses and their shadow counterpart, and this mapping is
fixed

in order to keep the instrumentation overhead
low
. Sadly, the default shadow memory
location overlaps with fixed memory allocated by CUDA (presumably for its unified virtual
address
space
).

Because the shadow memory is fixed, we need to patch both instances of LLVM (easiest to add
a patch to llvm.mk) and have it pick a different shadow offset:

--- lib/Transforms/Instrumentation/AddressSanitizer.cpp
+++ lib/Transforms/Instrumentation/AddressSanitizer.cpp
@@ -359,7 +359,7 @@
       if (IsKasan)
         Mapping.Offset = kLinuxKasan_ShadowOffset64;
       else
-        Mapping.Offset = kSmallX86_64ShadowOffset;
+        Mapping.Offset = kDefaultShadowOffset64;
     } else if (IsMIPS64)
       Mapping.Offset = kMIPS64_ShadowOffset64;
     else if (IsAArch64)
--- projects/compiler-rt/lib/asan/asan_mapping.h
+++ projects/compiler-rt/lib/asan/asan_mapping.h
@@ -146,7 +146,7 @@
 #  elif SANITIZER_IOS
 #    define SHADOW_OFFSET kIosShadowOffset64
 #  else
-#   define SHADOW_OFFSET kDefaultShort64bitShadowOffset
+#   define SHADOW_OFFSET kDefaultShadowOffset64
 #  endif
 # endif
 #endif

Note that you might need to redefine a different macro for your platform.

Sanitizing older versions of Julia

If you want to sanitize older versions of Julia, before the switch to LLVM
3.9
, there’s yet other issues: only LLVM
3.9 is compatible with recent versions of
glibc, while the CMake build system of LLVM 3.7 doesn’t
export
all necessary public symbols. You can
work around these issues by using a sufficiently old system, and overriding the LLVM version
to 3.8 (by specifying override LLVM_VER=3.8.1 in the Make.user of both build
directories) or preventing it from generating a shared library (by specifying
USE_LLVM_SHLIB=0 in the Make.user of the final Julia build).

Compiling Julia for NVIDIA GPUs

By: Tim Besard

Re-posted from: https://blog.maleadt.net/2015/01/15/julia-cuda/

For the few last months, I have been working on CUDA support for the Julia
language
. It is now possible to write kernels in Julia
and without much hassle execute them on a NVIDIA GPU, but there are still many
limitations. As I unexpectedly won’t have much time to work on this anymore, I’m
publishing
and documenting my work already.

Note from 2018: I ended up being able to continue this work, and so the
following blog post has become pretty stale. Check-out
CUDAnative.jl for more details
about the current state of affairs.

My work allows for code such as:

using CUDA

# define a kernel
@target ptx function kernel_vadd(a, b, c)
    i = blockId_x() + (threadId_x()-1) * numBlocks_x()
    c[i] = a[i] + b[i]

    return nothing
end

# set-up
dev = CuDevice(0)
ctx = CuContext(dev)
cgctx = CuCodegenContext(ctx, dev)

# create some data
dims = (3, 4)
a = round(rand(Float32, dims) * 100)
b = round(rand(Float32, dims) * 100)
c = Array(Float32, dims)

# execute!
len = prod(dims)
@cuda (len, 1) kernel_vadd(CuIn(a), CuIn(b), CuOut(c))

# verify
@show a+b == c

# tear-down
destroy(cgctx)
destroy(ctx)

which is pretty neat I think 🙂

I’ll start by giving a quick description of the modifications. Jump to the
bottom of this post for usage instructions.

Overview

Compiling Julia for GPUs requires support at multiple levels. I’ve tried to
avoid touching too much of core compiler; as a consequence most functionality is
part of the CUDA.jl package. This should make it easier to maintain and
eventually merge the code.

All of the relevant repositories are hosted at my
Github page, and contain README and TODO
files. If you have any questions though, feel free to contact me.

Julia compiler

Using the NVPTX back-end of LLVM, I have
modified the Julia compiler so that it can
generate PTX assembly. A non-exhaustive list of modifications:

  • @target macro for annotating functions with target information
  • per-target compiler state (module, pass manager, etc)
  • diverse small changes to generate suitable IR
  • exported functions for accessing the PTX code

Most of the code churn comes from using an address-preserving bitcast, which is
already being upstreamed thanks
to Valentin Churavy.

CUDA.jl support package

Generating PTX assembly is only one part of the puzzle: hardware needs to be
configured, code needs to be uploaded, etc. This functionality is exposed
through the CUDA runtime driver, which already was conveniently wrapped in the
CUDA.jl package.

I have extended this package with
functionality required for GPU code generation, and developed user-friendly
wrappers which should make it easier to interact with PTX code:

  • @cuda macro for invoking GPU kernels
  • automatic argument management
  • lightweight on-device arrays
  • improved API consistency
  • many new features

The significant part is obviously the @cuda macro, allowing for seamless
execution of kernel functions on your GPU. The macro compiles the kernel you’re
calling to PTX assembly, and generates code for interacting with the driver
(creating a module, uploading code, managing arguments, etc).

The argument management is also pretty interesting. In function of the argument
type, it generates type conversions and/or memory operations in order to
mimic Julia’s pass-by-sharing convention. For example, if you pass an array to a
kernel, @cuda will automatically up- and download it when required1.

Most functionality of @cuda is built using staged functions, and thus only
executes once without a recurring runtime cost. This means that it should be
possible to reach the same average performance of a traditional, precompiled
CUDA application
🙂

GPU Ocelot emulator

I have also forked the GPU Ocelot
project, which is a research project providing a dynamic compilation framework
(read: emulator) for CUDA hardware. By extending API support calls and fixing
certain bugs, you can use this as a drop-in replacement for libcuda.so, fully
compatible with CUDA.jl.

In practice, I used this emulator for everyday development on a system without
an NVIDIA GPU, while testing happened on real hardware.

Limitations

The code is far from production ready: it is not cross-platform (Linux only),
several changes should be discussed with upstream, and only a restricted subset
of the language is supported. Most notable shortcomings:

  • cannot alloc in PTX mode (which breaks most of the language)
  • can only pass bitstypes, arrays or pointers
  • standard library functionality is unavailable, because it lives in another
    LLVM module

In short: unless you’re only using relatively simple kernels with
non-complex data interactions, this code is not yet usable for you.

Usage

Even though all code is pretty functional and well-maintained, you need some
basic development skills to put the pieces together. Don’t expect a polished
product!

Julia compiler

Compile the modified compiler from source, using LLVM 3.5:

$ git clone https://github.com/maleadt/julia.git
$ cd julia
$ make LLVM_VER=3.5.0

Optionally, make sure Julia is not broken (this does not include GPU tests):

$ make LLVM_VER=3.5.0 testall

Note: the compiler will require libdevice to link kernel binaries. This
library is only part of recent CUDA toolkits (version 5.5 or greater). If you
use an older CUDA release, you will need to get a hold of these
files
.
Afterwards, you can point Julia to them using the NVVMIR_LIBRARY_DIR
environment variable.

GPU Ocelot emulator

If you don’t have any suitable CUDA hardware, you can use GPU Ocelot:

$ git clone --recursive https://github.com/maleadt/gpuocelot.git
$ cd gpuocelot
$ $PAGER README.md
$ CUDA_BIN_PATH=/opt/cuda-5.0/bin \
  CUDA_LIB_PATH=/opt/cuda-5.0/lib \
  CUDA_INC_PATH=/opt/cuda-5.0/include \
  python2 build.py --install -p $(realpath ../julia/usr)

Note: this probably will not build flawlessly. You’ll need at least the CUDA
toolkit2 (headers and tools, not the driver), gcc 4.6, scons, LLVM 3.5 and
Boost. Check the README!

Now if you load CUDA.jl and it doesn’t find libcuda.so, it will look for
libocelot.so instead:

$ ./julia
  > using CUDA
  > CUDA_VENDOR
    "Ocelot"

CUDA.jl support package

Installing packages is easy3 (just make sure you use the correct julia
binary):

$ ./julia
  > Pkg.clone("https://github.com/maleadt/CUDA.jl.git")

Optionally, but recommended, test GPU support:

$ ./julia
  > Pkg.test("CUDA")

What now?

You tell me! I think this work can be a good start for future GPU endeavours in
Julia land, even without most code being directly re-usable. For me at least it
has been a very interesting project, but it’s in the hands of the community now.


  1. You can influence this behaviour using the CuIn, CuOut, and CuInOut
    wrapper types.

    [return]

  2. GPU Ocelot is only compatible with CUDA 5.0 or older.
    This means you’ll need to get libdevice separately.

    [return]

  3. If you don’t want to pollute your main package directory with this
    experimental stuff, redefine the JULIA_PKGDIR environment variable.

    [return]