Author Archives: Tim Besard

CUDA.jl 3.5-3.8

Re-posted from: https://juliagpu.org/post/2022-01-28-cuda_3.5_3.8/index.html

CUDA.jl versions 3.5 to 3.8 have brought several new features to improve performance and productivity. This blog post will highlight a couple: direct copies between devices, better performance by preserving array index types and changing the memory pool, and a much-improved interface to the compute sanitizer utility.

Copies between devices

Typically, when sending data between devices you need to stage through the CPU. CUDA.jl now does this automatically, making it possible to directly copy between CuArrays on different devices:

julia> device!(0);julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119julia> device!(1);julia> b = CUDA.zeros(2,2);julia> copyto!(b, a)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119

When your hardware supports it, CUDA.jl will automatically enable so-called peer-to-peer mode, making it possible to copy data directly without going through the CPU. This can result in significant bandwidth and latency reductions. You can check if this mode of communication is possible:

julia> src = CuDevice(0)
CuDevice(0): NVIDIA A100-PCIE-40GBjulia> dst = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GBjulia> can_access_peer(src, dst)
false

In this case, peer-to-peer communication is not possible because the devices have a different compute capability major revision number. With a compatible device, the function reports true:

julia> src = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GBjulia> dst = CuDevice(2)
CuDevice(2): Tesla V100-PCIE-16GBjulia> can_access_peer(src, dst)
true

Thanks to @kshyatt for help with this change!

Helper function to use `compute-sanitizer`

The CUDA toolkit comes with a powerful tool to check GPU kernels for common issues like memory errors and race conditions: the compute sanitizer. To make it easier to use this tool, CUDA.jl now ships the binary as part of its artifacts, and provides a helper function to restart Julia under the compute-sanitizer. Let's demonstrate, and trigger a memory error to show what the compute sanitizer can detect:

julia> using CUDAjulia> CUDA.run_compute_sanitizer()
Re-starting your active Julia session...========= COMPUTE-SANITIZER
julia> using CUDAjulia> unsafe_wrap(CuArray, pointer(CuArray([1])), 2) .= 1
========= Invalid __global__ write of size 8 bytes
=========     at 0x2a0 in LLVM/src/interop/base.jl:45:julia_broadcast_kernel_1892(CuKernelContext, CuDeviceArray<Int64, (int)1, (int)1>, Broadcasted<void, Tuple<OneTo<Int64>>, _identity, Broadcasted<Int64>>, Int64)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0xa64000008 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0xa64000000 of size 8 bytes

Other tools are available too, e.g. racecheck for detecting races or synccheck for finding synchronization issues. These tools can be selected using the tool keyword argument to run_compute_sanitizer.

Updated binary dependencies

As is common with every release, CUDA.jl now supports newer versions of NVIDIA's tools and libraries:

CUDA toolkit 11.5 and 11.6
CUDNN 8.3.2
CUTENSOR 1.4.0

The update to CUDA toolkit 11.6 comes with improved debug info compatibility. If you need to debug Julia GPU code with tools like compute-sanitizer or cuda-gdb, and you need debug info (the equivalent of nvcc -G), ensure CUDA.jl can use the latest version of the CUDA toolkit.

To make it easier to use the latest supported toolkit, CUDA.jl now implements CUDA's so-called Forward Compatibility mode: When your driver is outdated, CUDA.jl will attempt to load a newer version of the CUDA driver library, enabling use of a newer CUDA toolkit and libraries. Note that this is only supported on select hardware, refer to the NVIDIA documentation for more details.

Preserving array indices

Julia's integers are typically 64-bits wide, which can be wasteful when dealing with GPU indexing intrinsics that are typically only 32-bits wide. CUDA.jl's device array type now carefully preserves the type of indices so that 32-bits indices aren't unnecessarily promoted to 64-bits. With some careful kernel programming (note the use of 0x1 instead of 1 below), this makes it possible to significantly reduce the register pressure surrounding indexing operations, which may be useful in register-constrained situations:

julia> function memset(arr, val)
           i = (blockIdx().x-0x1) * blockDim().x + threadIdx().x
           @inbounds arr[i] = val
           return
       endjulia> CUDA.code_ptx(memset, Tuple{CuDeviceArray{Float32,1,AS.Global},Float32})
.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<5>;        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mov.u32         %r3, %tid.x;
        mad.lo.s32      %r4, %r2, %r1, %r3;
        ld.u64          %rd2, [%rd1];
        mul.wide.s32    %rd3, %r4, 4;
        add.s64         %rd4, %rd2, %rd3;
        st.global.f32   [%rd4], %f1;
        ret;
}

On CUDA.jl 3.4, this simple function used 3 more 64-bit registers:

.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<8>;        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mul.wide.u32    %rd2, %r2, %r1;
        mov.u32         %r3, %tid.x;
        add.s32         %r4, %r3, 1;
        cvt.u64.u32     %rd3, %r4;
        ld.u64          %rd4, [%rd1];
        add.s64         %rd5, %rd2, %rd3;
        shl.b64         %rd6, %rd5, 2;
        add.s64         %rd7, %rd4, %rd6;
        st.global.f32   [%rd7+-4], %f1;
        ret;
}

More aggressive memory management

Starting with CUDA 3.8, the memory pool used to allocate CuArrays will be configured differently: The pool will now be allowed to use all available GPU memory, whereas previously all cached memory was released at each synchronization point. This can significantly improve performance, and makes synchronization much cheaper.

This behavior can be observed by calling the memory_status() function:

julia> CUDA.memory_status()
Effective GPU memory usage: 13.57% (2.001 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)julia> a = CuArray{Float32}(undef, (1024, 1024, 1024));
julia> Base.format_bytes(sizeof(a))
"4.000 GiB"julia> a = nothing
julia> GC.gc()julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

So far nothing new. On previous versions of CUDA.jl however, any subsequent synchronization of the GPU (e.g., by copying memory to the CPU) would have resulted in a release of this reserved memory. This is not the case anymore:

julia> synchronize()julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

If you still want to release this memory, you can call the reclaim() function:

julia> CUDA.reclaim()julia> CUDA.memory_status()
Effective GPU memory usage: 13.48% (1.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)

With interactive Julia sessions, this function is called periodically so that the GPU's memory isn't held on to unnecessarily. Otherwise it shouldn't be necessary to call this function, as memory is freed automatically when it is needed.

Minor changes and improvements

Bitonic sort is now used instead of quicksort (by @xaellison).
CuDeviceArray now stores the length of the array, greatly speeding up indexing with high-dimensional arrays.
Device intrinsics cannot be called on the CPU anymore, protecting against segfaults when something isn't dispatching correctly.
Support for Multi-GPU instances has been improved, providing the parent_uuid function to look up the UUID of the parent device.
randn and randexp are now supported in kernel code, which should help with initial support of Distributions.jl-based operations.

CUDA.jl 3.4

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2021-08-13-cuda_3.4/index.html

The latest version of CUDA.jl brings several new features, from improved atomic operations to initial support for arrays with unified memory. The native random number generator introduced in CUDA.jl is now the default, and support for memory pools other than the CUDA stream-ordered one has been removed.

Streamlined atomic operations

In preparation of integrating with the new standard @atomic macro introduced in Julia 1.7, we have streamlined the capabilities of atomic operations in CUDA.jl. The API is now split into two levels: low-level atomic_ methods for atomic functionality that's directly supported by the hardware, and a high-level @atomic macro that tries to perform operations natively or falls back to a loop with compare-and-swap. This fall-back implementation makes it possible to use more complex operations that do not map onto a single atomic operation:

julia> a = CuArray([1]);julia> function kernel(a)
         CUDA.@atomic a[] <<= 1
         return
       endjulia> @cuda threads=16 kernel(a)julia> a
1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 65536julia> 1<<16
65536

The only requirement is that the types being used are supported by CUDA.atomic_cas!. This includes common types like 32 and 64-bit integers and floating-point numbers, as well as 16-bit numbers on devices with compute capability 7.0 or higher.

Note that on Julia 1.7 and higher, CUDA.jl does not export the @atomic macro anymore to avoid conflicts with the version in Base. That means it is recommended to always fully specify uses of the macro, i.e., use CUDA.@atomic as in the example above.

Arrays with unified memory

You may have noticed that the CuArray type in the example above included an additional parameter, Mem.DeviceBuffer. This has been introduced to support arrays backed by different kinds of buffers. By default, we will use an ordinary device buffer, but it's now possible to allocate arrays backed by unified buffers that can be used on multiple devices:

julia> a = cu([0]; unified=true)
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 0julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 1julia> device!(1)julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 2

Although all operations should work equally well with arrays backed by unified memory, they have not been optimized yet. For example, copying memory to the device could be avoided as the driver can automatically page in unified memory on-demand.

New default random number generator

CUDA.jl 3.0 introduced a new random number generator, and starting with CUDA.jl 3.2 performance and quality of this generator was improved up to the point it could be used by applications. A couple of features were still missing though, such as generating normally-distributed random numbers, or support for complex numbers. These features have been added in CUDA.jl 3.3, and the generator is now used as the default fallback when CURAND does not support the requested element types.

Both the performance and quality of this generator is much better than the previous, GPUArrays.jl-based one:

julia> using BenchmarkTools
julia> cuda_rng = CUDA.RNG();
julia> gpuarrays_rng = GPUArrays.default_rng(CuArray);
julia> a = CUDA.zeros(1024,1024);julia> @benchmark CUDA.@sync rand!($cuda_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.040 μs …  2.430 ms  ┊ GC (min … max): 0.00% … 99.04%
 Time  (median):     18.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.604 μs ± 34.734 μs  ┊ GC (mean ± σ):  1.17% ±  0.99%         ▃▆█▇▇▅▄▂▁
  ▂▂▂▃▄▆███████████▇▆▆▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▄
  17 μs           Histogram: frequency by time        24.1 μs <julia> @benchmark CUDA.@sync rand!($gpuarrays_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  72.489 μs …  2.790 ms  ┊ GC (min … max): 0.00% … 98.44%
 Time  (median):     74.479 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   81.211 μs ± 61.598 μs  ┊ GC (mean ± σ):  0.67% ±  1.40%  █                                                           ▁
  █▆▃▁▃▃▅▆▅▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▄▆▁▁▁▁▁▁▁▁▄▄▃▄▃▁▁▁▁▁▁▁▁▁▃▃▄▆▄▁▄▃▆ █
  72.5 μs      Histogram: log(frequency) by time       443 μs <

julia> using RNGTest
julia> test_cuda_rng = RNGTest.wrap(cuda_rng, UInt32);
julia> test_gpuarrays_rng = RNGTest.wrap(gpuarrays_rng, UInt32);julia> RNGTest.smallcrushTestU01(test_cuda_rng)
 All tests were passedjulia> RNGTest.smallcrushTestU01(test_gpuarrays_rng)
 The following tests gave p-values outside [0.001, 0.9990]:       Test                          p-value
 ----------------------------------------------
  1  BirthdaySpacings                 eps
  2  Collision                        eps
  3  Gap                              eps
  4  SimpPoker                       1.0e-4
  5  CouponCollector                  eps
  6  MaxOft                           eps
  7  WeightDistrib                    eps
 10  RandomWalk1 M                   6.0e-4
 ----------------------------------------------
 (eps  means a value < 1.0e-300):

Removal of old memory pools

With the new stream-ordered allocator, caching memory allocations at the CUDA library level, much of the need for memory pools to cache memory allocations has disappeared. To simplify the allocation code, we have removed support for those Julia-managed memory pools (i.e., binned, split and simple). You can now only use the cuda memory pool, or use no pool at all by setting the JULIA_CUDA_MEMORY_POOL environment variable to none.

Not using a memory pool degrades performance, so if you are stuck on an NVIDIA driver that does not support CUDA 11.2, it is advised to remain on CUDA.jl 3.3 until you can upgrade.

Also note that the new stream-ordered allocator has turned out incompatible with legacy cuIpc APIs as used by OpenMPI. If that applies to you, consider disabling the memory pool or reverting to CUDA.jl 3.3 if your application's allocation pattern benefits from a memory pool.

Because of this, we will be maintaining CUDA.jl 3.3 longer than usual. All bug fixes in CUDA.jl 3.4 have already been backported to the previous release, which is currently at version 3.3.6.

Device capability-dependent kernel code

Some of the improvements in this release depend on the ability to write generic code that only uses certain hardware features when they are available. To facilitate writing such code, the compiler now embeds metadata in the generated code that can be used to branch on.

Currently, the device capability and PTX ISA version are embedded and made available using respectively the compute_capability and ptx_isa_version functions. A simplified version number type, constructable using the sv"..." string macro, can be used to test against these properties. For example:

julia> function kernel(a)
           a[] = compute_capability() >= sv"6.0" ? 1 : 2
           return
       end
kernel (generic function with 1 method)julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_1({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
top:
  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
  store float 1.000000e+00, float addrspace(1)* %2, align 4
  ret void
}julia> capability(device!(1))
v"3.5.0"julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_2({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
top:
  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
  store float 2.000000e+00, float addrspace(1)* %2, align 4
  ret void
}

The branch on the compute capability is completely optimized away. At the same time, this does not require re-inferring the function as the optimization happens at the LLVM level.

Other changes

Support for CUDA 11.4 Update 1
Improved thread safety [1] [2]

CUDA.jl 3.3

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2021-06-10-cuda_3.3/index.html

There have been several releases of CUDA.jl in the past couple of months, with many bugfixes and many exciting new features to improve GPU programming in Julia: CuArray now supports isbits Unions, CUDA.jl can emit debug info for use with NVIDIA tools, and changes to the compiler make it even easier to use the latest version of the CUDA toolkit.

`CuArray` support for isbits Unions

Unions are a way to represent values of one type or another, e.g., a value that can be an integer or a floating point. If all possible element types of a Union are so-called bitstypes, which can be stored contiguously in memory, the Union of these types can be stored contiguously too. This kind of optimization is implemented by the Array type, which can store such "isbits Unions" inline, as opposed to storing a pointer to a heap-allocated box. For more details, refer to the Julia documentation.

With CUDA.jl 3.3, the CuArray GPU array type now supports this optimization too. That means you can safely allocate CuArrays with isbits union element types and perform GPU-accelerated operations on then:

julia> a = CuArray([1, nothing, 3])
3-element CuArray{Union{Nothing, Int64}, 1}:
 1
  nothing
 3julia> findfirst(isnothing, a)
2

It is also safe to pass these CuArrays to a kernel and use unions there:

julia> function kernel(a)
         i = threadIdx().x
         if a[i] !== nothing
           a[i] += 1
         end
         return
       endjulia> @cuda threads=3 kernel(a)julia> a
3-element CuArray{Union{Nothing, Int64}, 1}:
 2
  nothing
 4

This feature is especially valuable to represent missing values, and is an important step towards GPU support for DataFrames.jl.

Debug and location information

Another noteworthy addition is the support for emitting debug and location information. The debug level, set by passing -g <level> to the julia executable, determines how much info is emitted. The default of level 1 only enables location information instructions which should not impact performance. Passing -g0 disables this, while passing -g2 also enables the output of DWARF debug information and compiles in debug mode.

Location information is useful for a variety of reasons. Many tools, like the NVIDIA profilers, use it corelate instructions to source code:

NVIDIA Visual Profiler with source-code location information

Debug information can be used to debug compiled code using cuda-gdb:

$ cuda-gdb --args julia -g2 examples/vadd.jl
(cuda-gdb) set cuda break_on_launch all
(cuda-gdb) run
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
macro expansion () at .julia/packages/LLVM/hHQuD/src/interop/base.jl:74
74                  Base.llvmcall(($ir,$fn), $rettyp, $argtyp, $(args.args...))(cuda-gdb) bt
#0  macro expansion () at .julia/packages/LLVM/hHQuD/src/interop/base.jl:74
#1  macro expansion () at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:6
#2  _index () at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:6
#3  blockIdx_x () at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:56
#4  blockIdx () at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:76
#5  julia_vadd<<<(1,1,1),(12,1,1)>>> (a=..., b=..., c=...) at .julia/dev/CUDA/examples/vadd.jl:6(cuda-gdb) f 5
#5  julia_vadd<<<(1,1,1),(12,1,1)>>> (a=..., b=..., c=...) at .julia/dev/CUDA/examples/vadd.jl:6
6           i = (blockIdx().x-1) * blockDim().x + threadIdx().x(cuda-gdb) l
1       using Test
2
3       using CUDA
4
5       function vadd(a, b, c)
6           i = (blockIdx().x-1) * blockDim().x + threadIdx().x
7           c[i] = a[i] + b[i]
8           return
9       end
10

Improved CUDA compatibility support

As always, new CUDA.jl releases come with updated support for the CUDA toolkit. CUDA.jl is now compatible with CUDA 11.3, as well as CUDA 11.3 Update 1. Users don't have to do anything to update to these versions, as CUDA.jl will automatically select and download the latest supported version.

Of course, for CUDA.jl to use the latest versions of the CUDA toolkit, a sufficiently recent version of the NVIDIA driver is required. Before CUDA 11.0, the driver's CUDA compatibility was a strict lower bound, and every minor CUDA release required a driver update. CUDA 11.0 comes with an enhanced compatibility option that follows semantic versioning, e.g., CUDA 11.3 can be used on an NVIDIA driver that only supports up to CUDA 11.0. CUDA.jl now follows semantic versioning when selecting a compatible toolkit, making it easier to use the latest version of the CUDA toolkit in Julia.

For those interested: Implementing semantic versioning required the CUDA.jl compiler to use ptxas instead of the driver's embedded JIT to generate GPU machine code. At the same time, many parts of CUDA.jl still use the CUDA driver APIs, so it's always recommended to keep your NVIDIA driver up-to-date.

High-level graph APIs

To overcome the cost of launching kernels, CUDA makes it possible to build computational graphs, and execute those graphs with less overhead than the underlying operations. In CUDA.jl we provide easy access to the APIs to record and execute these graphs:

A = CUDA.zeros(Int, 1)# ensure the operation is compiled
A .+= 1# capture
graph = capture() do
    A .+= 1
end
@test Array(A) == [1]   # didn't change anything# instantiate and launch
exec = instantiate(graph)
CUDA.launch(exec)
@test Array(A) == [2]# update and instantiate/launch again
graph′ = capture() do
    A .+= 2
end
update(exec, graph′)
CUDA.launch(exec)
@test Array(A) == [4]

This sequence of operations is common enough that we provide a high-level @captured macro wraps that automatically records, updates, instantiates and launches the graph:

A = CUDA.zeros(Int, 1)for i in 1:2
    @captured A .+= 1
end
@test Array(A) == [2]

Minor changes and features

CUDA.jl now supports @atomic multiplication and division (by @yuehhua)
Several statistics functions have been implemented (by @berquist)
The device-side random number generator in is now based on Philox2x, greatly improving quality of randomness (passing BigCrush) while allowing calls to rand() from divergent threads.
Dependent libraries like CUDNN and CUTENSOR are now only downloaded and initialized when they are used.
The synchronize() function in now first spins before yielding and sleeping, to improve the latency of short-running operations.
Several additional operations are now supported on Float16 inputs, such as CUSPARSE and CUBLAS operations, and various math intrinsics.
Kepler support (compute capability 3.5) has been reinstated for the time being.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Tim Besard

CUDA.jl 3.5-3.8

Copies between devices

Helper function to use `compute-sanitizer`

Updated binary dependencies

Preserving array indices

More aggressive memory management

Minor changes and improvements

CUDA.jl 3.4

Streamlined atomic operations

Arrays with unified memory

New default random number generator

Removal of old memory pools

Device capability-dependent kernel code

Other changes

CUDA.jl 3.3

`CuArray` support for isbits Unions

Debug and location information

Improved CUDA compatibility support

High-level graph APIs

Minor changes and features

Copies between devices

Helper function to use compute-sanitizer

Updated binary dependencies

Preserving array indices

More aggressive memory management

Minor changes and improvements

Streamlined atomic operations

Arrays with unified memory

New default random number generator

Removal of old memory pools

Device capability-dependent kernel code

Other changes

CuArray support for isbits Unions

Debug and location information

Improved CUDA compatibility support

High-level graph APIs

Minor changes and features

Helper function to use `compute-sanitizer`

`CuArray` support for isbits Unions