Author Archives: Blog on JuliaGPU

CUDA.jl 2.0

By: Blog on JuliaGPU

Re-posted from: https://juliagpu.org/2020-10-02-cuda_2.0/

Today we’re releasing CUDA.jl 2.0, a breaking release with several new features. Highlights
include initial support for Float16, a switch to CUDA’s new stream model, a much-needed
rework of the sparse array support and support for CUDA 11.1.

The release now requires Julia 1.5, and assumes a GPU with compute capability 5.0 or
higher (although most of the package will still work with an older GPU).

Low- and mixed-precision operations

With NVIDIA’s latest GPUs emphasizing support for low-precision operations, CUDA.jl
now starts to support. For example,
the CUBLAS wrappers can be used with (B)Float16 inputs (running under JULIA_DEBUG=CUBLAS
to illustrate the called methods) thanks to the cublasGemmEx API call:

julia> mul!(CUDA.zeros(Float32,2,2), cu(rand(Float16,2,2)), cu(rand(Float16,2,2)))

I! cuBLAS (v11.0) function cublasStatus_t cublasGemmEx(...) called:
i!  Atype: type=cudaDataType_t; val=CUDA_R_16F(2)
i!  Btype: type=cudaDataType_t; val=CUDA_R_16F(2)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F(68)

2×2 CuArray{Float32,2}:
 0.481284  0.561241
 1.12923   1.04541
julia> using BFloat16s

julia> mul!(CUDA.zeros(BFloat16,2,2), cu(BFloat16.(rand(2,2))), cu(BFloat16.(rand(2,2))))

I! cuBLAS (v11.0) function cublasStatus_t cublasGemmEx(...) called:
i!  Atype: type=cudaDataType_t; val=CUDA_R_16BF(14)
i!  Btype: type=cudaDataType_t; val=CUDA_R_16BF(14)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_16BF(14)
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F(68)

2×2 CuArray{BFloat16,2}:
 0.300781   0.71875
 0.0163574  0.0241699

Alternatively, CUBLAS can be configured to automatically down-cast 32-bit inputs to Float16.
This is now exposed through a task-local
CUDA.jl math mode:

julia> CUDA.math_mode!(CUDA.FAST_MATH; precision=:Float16)

julia> mul!(CuArray(zeros(Float32,2,2)), CuArray(rand(Float32,2,2)), CuArray(rand(Float32,2,2)))

I! cuBLAS (v11.0) function cublasStatus_t cublasGemmEx(...) called:
i!  Atype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  Btype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F_FAST_16F(74)

2×2 CuArray{Float32,2}:
 0.175258  0.226159
 0.511893  0.331351

As part of these changes, CUDA.jl now defaults to using tensor cores. This may affect
accuracy; use math mode PEDANTIC if you want the old behavior.

Work is under way to extend these
capabilities to the rest of CUDA.jl, e.g., the CUDNN wrappers, or the native kernel
programming capabilities.

New default stream semantics

In CUDA.jl 2.0 we’re switching to CUDA’s
simplified stream programming
model
.
This simplifies working with multiple streams, and opens up more possibilities for
concurrent execution of GPU operations.

Multi-stream programming

In the old model, the default stream (used by all GPU operations unless specified otherwise)
was a special stream whose commands could not be executed concurrently with commands on
regular, explicitly-created streams. For example, if we interleave kernels executed on a
dedicated stream with ones on the default one, execution was serialized:

using CUDA

N = 1 << 20

function kernel(x, n)
    tid = threadIdx().x + (blockIdx().x-1) * blockDim().x
    for i = tid:blockDim().x*gridDim().x:n
        x[i] = CUDA.sqrt(CUDA.pow(3.14159f0, i))
    end
    return
end

num_streams = 8

for i in 1:num_streams
    stream = CuStream()

    data = CuArray{Float32}(undef, N)

    @cuda blocks=1 threads=64 stream=stream kernel(data, N)

    @cuda kernel(data, 0)
end
Multi-stream programming (old)

In the new model, default streams are regular streams and commands issued on them can
execute concurrently with those on other streams:

Multi-stream programming (new)

Multi-threading

Another consequence of the new stream model is that each thread gets its own default stream
(accessible as CuStreamPerThread()). Together with Julia’s treading capabilities, this
makes it trivial to group independent work in tasks, benefiting from concurrent execution on
the GPU where possible:

using CUDA

N = 1 << 20

function kernel(x, n)
    tid = threadIdx().x + (blockIdx().x-1) * blockDim().x
    for i = tid:blockDim().x*gridDim().x:n
        x[i] = CUDA.sqrt(CUDA.pow(3.14159f0, i))
    end
    return
end

Threads.@threads for i in 1:Threads.nthreads()
    data = CuArray{Float32}(undef, N)
    @cuda blocks=1 threads=64 kernel(data, N)
    synchronize(CuDefaultStream())
end
Multi-threading (new)

With the old model, execution would have been serialized because the default stream was the
same across threads:

Multi-threading (old)

Future improvements will make this behavior configurable, such that users can use a
different default stream per task.

Sparse array clean-up

As part of CUDA.jl 2.0, the sparse array support has been
refactored
, bringing them in line with other
array types and their expected behavior. For example, the custom switch2 methods have been
removed in favor of calls to convert and array constructors:

julia> using SparseArrays
julia> using CUDA, CUDA.CUSPARSE

julia> CuSparseMatrixCSC(CUDA.rand(2,2))
2×2 CuSparseMatrixCSC{Float32} with 4 stored entries:
  [1, 1]  =  0.124012
  [2, 1]  =  0.791714
  [1, 2]  =  0.487905
  [2, 2]  =  0.752466

julia> CuSparseMatrixCOO(sprand(2,2, 0.5))
2×2 CuSparseMatrixCOO{Float64} with 3 stored entries:
  [1, 1]  =  0.183183
  [2, 1]  =  0.966466
  [2, 2]  =  0.064101

julia> CuSparseMatrixCSR(ans)
2×2 CuSparseMatrixCSR{Float64} with 3 stored entries:
  [1, 1]  =  0.183183
  [2, 1]  =  0.966466
  [2, 2]  =  0.064101

Initial support for the COO sparse matrix type
has also been added, along with more better
support for sparse matrix-vector
multiplication
.

Support for CUDA 11.1

This release also features support for the brand-new CUDA 11.1. As there is no compatible
release of CUDNN or CUTENSOR yet, CUDA.jl won’t automatically select this version, but you
can force it to by setting the JULIA_CUDA_VERSION environment variable to 11.1:

julia> ENV["JULIA_CUDA_VERSION"] = "11.1"

julia> using CUDA

julia> CUDA.versioninfo()
CUDA toolkit 11.1.0, artifact installation

Libraries:
- CUDNN: missing
- CUTENSOR: missing

Minor changes

Many other changes are part of this release:

  • Views, reshapes and array reinterpretations are now
    represented
    by the Base array wrappers,
    simplifying the CuArray type definition.
  • Various optimizations to CUFFT and
    CUDNN library wrappers.
  • Support for LinearAlgebra.reflect! and
    rotate!
  • Initial support for calling CUDA libraries
    with strided inputs

Paper: Flexible Performant GEMM Kernels on GPUs

By: Blog on JuliaGPU

Re-posted from: https://juliagpu.org/2020-09-28-gemmkernels/

General Matrix Multiplication or GEMM kernels take center place in high performance
computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as
NVIDIA’s Tensor Cores. In this paper we show how it is possible to program these
accelerators from Julia, and present abstractions and interfaces that allow to do so
efficiently without sacrificing performance.

A pre-print of the paper has been published on arXiv:
arXiv:2009.12263.
The source code can be found on
GitHub:
thomasfaingnaert/GemmKernels.jl.

With the APIs from GemmKernels.jl, it is possible to instantiate GEMM kernels that perform
in the same ball park as, and sometimes even outperform state-of-the-art libraries like
CUBLAS and CUTLASS. For example, performing a mixed-precision multiplication of two 16-bit
matrixes into a 32-bit accumulator (on different combinations of layouts):

Performance of mixed-precision GEMM

The APIs are also highly flexible and allow customization of each step, e.g., to apply the
activation function max(x, 0) for implementing a rectified linear unit (ReLU):

a = CuArray(rand(Float16, (M, K)))
b = CuArray(rand(Float16, (K, N)))
c = CuArray(rand(Float32, (M, N)))
d = similar(c)

conf = GemmKernels.get_config(
    gemm_shape = (M = M, N = N, K = K),
    operator = Operator.WMMAOp{16, 16, 16},
    global_a_layout = Layout.AlignedColMajor{Float16},
    global_c_layout = Layout.AlignedColMajor{Float32})

GemmKernels.matmul(
    a, b, c, d, conf;
    transform_regs_to_shared_d = Transform.Elementwise(x -> max(x, 0)))

The GemmKernels.jl framework is written entirely in Julia, demonstrating the
high-performance GPU programming capabilities of this language, but at the same time keeping
the research accessible and easy to modify or repurpose by other Julia developers.

CUDA.jl 1.3 – Multi-device programming

By: Blog on JuliaGPU

Re-posted from: https://juliagpu.org/2020-07-18-cuda_1.3/

Today we’re releasing CUDA.jl 1.3, with several new features. The most prominent
change is support for multiple GPUs within a single process.

Multi-GPU programming

With CUDA.jl 1.3, you can finally use multiple CUDA GPUs within a single process. To switch
devices you can call device!, query the current device with device(), or reset it using
device_reset!():

julia> collect(devices())
9-element Array{CuDevice,1}:
 CuDevice(0): Tesla V100-PCIE-32GB
 CuDevice(1): Tesla V100-PCIE-32GB
 CuDevice(2): Tesla V100-PCIE-32GB
 CuDevice(3): Tesla V100-PCIE-32GB
 CuDevice(4): Tesla V100-PCIE-16GB
 CuDevice(5): Tesla P100-PCIE-16GB
 CuDevice(6): Tesla P100-PCIE-16GB
 CuDevice(7): GeForce GTX 1080 Ti
 CuDevice(8): GeForce GTX 1080 Ti

julia> device!(5)

julia> device()
CuDevice(5): Tesla P100-PCIE-16GB

Let’s define a kernel to show this really works:

julia> function kernel()
           dev = Ref{Cint}()
           CUDA.cudaGetDevice(dev)
           @cuprintln("Running on device $(dev[])")
           return
       end

julia> @cuda kernel()
Running on device 5

julia> device!(0)

julia> device()
CuDevice(0): Tesla V100-PCIE-32GB

julia> @cuda kernel()
Running on device 0

Memory allocations, like CuArrays, are implicitly bound to the device they
were allocated on. That means you should take care to only use an array when the
owning device is active, or you will run into errors:

julia> device()
CuDevice(0): Tesla V100-PCIE-32GB

julia> a = CUDA.rand(1)
1-element CuArray{Float32,1}:
 0.6322775

julia> device!(1)

julia> a
ERROR: CUDA error: an illegal memory access was encountered

Future improvements might make the array type device-aware.

Multitasking and multithreading

Dovetailing with the support for multiple GPUs, is the ability to use these GPUs
on separate Julia tasks and threads:

julia> device!(0)

julia> @sync begin
         @async begin
           device!(1)
           println("Working with $(device()) on $(current_task())")
           yield()
           println("Back to device $(device()) on $(current_task())")
         end
         @async begin
           device!(2)
           println("Working with $(device()) on $(current_task())")
         end
       end
Working with CuDevice(1) on Task @0x00007fc9e6a48010
Working with CuDevice(2) on Task @0x00007fc9e6a484f0
Back to device CuDevice(1) on Task @0x00007fc9e6a48010

julia> device()
CuDevice(0): Tesla V100-PCIE-32GB

Each task has its own local GPU state, such as the device it was bound to,
handles to libraries like CUBLAS or CUDNN (which means that each task can
configure libraries independently), etc.

Minor features

CUDA.jl 1.3 also features some minor changes:

  • Reinstated compatibility with Julia 1.3
  • Support for CUDA 11.0 Update 1
  • Support for CUDNN 8.0.2

Known issues

Several operations on sparse arrays have been broken since CUDA.jl 1.2, due to
the deprecations that were part of CUDA 11. The next version of CUDA.jl will
drop support for CUDA 10.0 or older, which will make it possible to use new
cuSPARSE APIs and add back missing functionality.