By: Tim Besard
Re-posted from: https://juliagpu.org/post/2023-02-01-cuda_4.0/index.html
CUDA.jl 4.0 is a breaking release that introduces the use of JLLs to provide the CUDA toolkit. This makes it possible to compile other binary libaries against the CUDA runtime, and use them together with CUDA.jl. The release also brings CUSPARSE improvements, the ability to limit memory use, and many bug fixes and performance improvements.
JLLs for CUDA artifacts
While CUDA.jl has been using binary artifacts for a while, it was manually managing installation and selection of them, i.e., not by using standardised JLL packages. This complicated use of the artifacts by other packages, and made it difficult to build other binary packages against the CUDA runtime.
With CUDA.jl 4.0, we now use JLLs to load the CUDA driver and runtime. Specifically, there are two JLLs in play: CUDA_Driver_jll
and CUDA_Runtime_jll
. The former is responsible for loading the CUDA driver library (possibly upgrading it using a forward-compatible version), and determining the CUDA version that your set-up supports:
❯ JULIA_DEBUG=CUDA_Driver_jll julia
julia> using CUDA_Driver_jll
┌ System CUDA driver found at libcuda.so.1, detected as version 12.0.0
└ @ CUDA_Driver_jll
┌ System CUDA driver is recent enough; not using forward-compatible driver
└ @ CUDA_Driver_jll
With the driver identified and loaded, CUDA_Runtime_jll
can select a compatible toolkit. By default, it uses the latest supported toolkit that is compatible with the driver:
julia> using CUDA_Runtime_jlljulia> CUDA_Runtime_jll.cuda_toolkits
10-element Vector{VersionNumber}:
v"10.2.0"
v"11.0.0"
v"11.1.0"
v"11.2.0"
v"11.3.0"
v"11.4.0"
v"11.5.0"
v"11.6.0"
v"11.7.0"
v"11.8.0"julia> CUDA_Runtime_jll.host_platform
Linux x86_64 {cuda=11.8}
As you can see, the selected CUDA runtime is encoded in the host platform. This makes it possible for Julia to automatically select compatible versions of other binary packages. For example, if we install and load SuiteSparse_GPU_jll
, which right now provides builds for CUDA 10.2, 11.0 and 12.0, the artifact resolution code knows to load the build for CUDA 11.0 which is compatible with the selected CUDA 11.8 runtime:
julia> using SuiteSparse_GPU_jlljulia> SuiteSparse_GPU_jll.best_wrapper
"~/.julia/packages/SuiteSparse_GPU_jll/.../x86_64-linux-gnu-cuda+11.0.jl"
The change to JLLs requires a breaking change: the JULIA_CUDA_VERSION
and JULIA_CUDA_USE_BINARYBUILDER
environment variables have been removed, and are replaced by preferences that are set in the current environment. For convenience, you can set these preferences by calling CUDA.set_runtime_version!
:
❯ julia --project
julia> using CUDA
julia> CUDA.runtime_version()
v"11.8.0"julia> CUDA.set_runtime_version!(v"11.7")
┌ Set CUDA Runtime version preference to 11.7,
└ please re-start Julia for this to take effect.❯ julia --project
julia> using CUDA
julia> CUDA.runtime_version()
v"11.7.0"julia> using CUDA_Runtime_jll
julia> CUDA_Runtime_jll.host_platform
Linux x86_64 {cuda=11.7}
The changed preference is reflected in the host platform, which means that you can use this mechanism to load a different builds of other binary packages. For example, if you rely on a package or JLL that does not yet have a build for CUDA 12, you could set the preference to v"11.x"
to load an available build.
For discovering a local runtime, you can set the version to "local"
, which will replace the use of CUDA_Runtime_jll
by CUDA_Runtime_discovery.jl
, an API-compatible package that replaces the JLL with a local runtime discovery mechanism:
❯ julia --project
julia> CUDA.set_runtime_version!("local")
┌ Set CUDA Runtime version preference to local,
└ please re-start Julia for this to take effect.❯ JULIA_DEBUG=CUDA_Runtime_Discovery julia --project
julia> using CUDA
┌ Looking for CUDA toolkit via environment variables CUDA_PATH
└ @ CUDA_Runtime_Discovery
┌ Looking for binary ptxas in /opt/cuda
│ all_locations =
│ 2-element Vector{String}:
│ "/opt/cuda"
│ "/opt/cuda/bin"
└ @ CUDA_Runtime_Discovery
┌ Debug: Found ptxas at /opt/cuda/bin/ptxas
└ @ CUDA_Runtime_Discovery
...
Memory limits
By popular demand, support for memory limits has been reinstated. This functionality had been removed after the switch to CUDA memory pools, as the memory pool allocator does not yet support memory limits. Awaiting improvements by NVIDIA, we have added functionality to impose memory limits from the Julia side, in the form of two environment variables:
-
JULIA_CUDA_SOFT_MEMORY_LIMIT
: This is an advisory limit, used to configure the memory pool, which will result in the pool being shrunk down to the requested limit at every synchronization point. That means that the pool may temporarily grow beyond the limit. This limit is unavailable when disabling memory pools (withJULIA_CUDA_MEMORY_POOL=none
). -
JULIA_CUDA_HARD_MEMORY_LIMIT
: This is a hard limit, checked before every allocation. Doing so is relatively expensive, so it is recommended to use the soft limit instead.
The value of these variables can be formatted as a numer of bytes, optionally followed by a unit, or as a percentage of the total device memory. Examples: 100M
, 50%
, 1.5GiB
, 10000
.
CUSPARSE improvements
Thanks to the work of @amontoison, the CUSPARSE interface has undergone many improvements:
-
Better support of the
CuSparseMatrixCOO
format with, in particular, the addition ofCuSparseMatrixCOO * CuVector
andCuSparseMatrixCOO * CuMatrix
products; -
Routines specialized for
-
,+
,*
operations between sparse matrices (CuSparseMatrixCOO
,CuSparseMatrixCSC
andCuSparseMatrixCSR
) have been interfaced; -
New generic routines for backward and forward sweeps with sparse triangular matrices are now used by
\
; -
CuMatrix * CuSparseVector
andCuMatrix * CuSparseMatrix
products have been added; -
Conversions between sparse and dense matrices have been updated for using more recent and optimized routines;
-
High-level Julia functions for the new set of sparse BLAS 1 routines such as dot products between
CuSparseVector
; -
Add missing dispatchs for
mul!
andldiv!
functions; -
Interfacing of almost all new CUSPARSE routines added by the CUDA toolkits
v"11.x"
.
Other changes
-
Removal of the CUDNN, CUTENSOR, CUTENSORNET and CUSTATEVEC submodules: These have been moved into their own packages, respectively cuDNN.jl, cuTENSOR.jl, cuTensorNet.jl and cuStateVec.jl (note the change in capitalization, now following NVIDIA's naming scheme);
-
Removal of the NVTX submodule: NVTX.jl should be used instead, which is a more complete implementation of the NVTX API;
-
Support for CUDA 11.8 (support for CUDA 12.0 is being worked on);
-
Support for Julia 1.9.
Backport releases
Because CUDA.jl 4.0 is a breaking release, two additional releases have been made that backport bugfixes and select features:
-
CUDA.jl 3.12.1 and 3.12.2: backports of bugfixes since 3.12
-
CUDA.jl 3.13.0: additionally adding the memory limit functionality