By: Tim Besard
Re-posted from: https://juliagpu.org/post/2021-08-13-cuda_3.4/index.html
The latest version of CUDA.jl brings several new features, from improved atomic operations to initial support for arrays with unified memory. The native random number generator introduced in CUDA.jl is now the default, and support for memory pools other than the CUDA stream-ordered one has been removed.
Streamlined atomic operations
In preparation of integrating with the new standard @atomic
macro introduced in Julia 1.7, we have streamlined the capabilities of atomic operations in CUDA.jl. The API is now split into two levels: low-level atomic_
methods for atomic functionality that's directly supported by the hardware, and a high-level @atomic
macro that tries to perform operations natively or falls back to a loop with compare-and-swap. This fall-back implementation makes it possible to use more complex operations that do not map onto a single atomic operation:
julia> a = CuArray([1]);julia> function kernel(a)
CUDA.@atomic a[] <<= 1
return
endjulia> @cuda threads=16 kernel(a)julia> a
1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
65536julia> 1<<16
65536
The only requirement is that the types being used are supported by CUDA.atomic_cas!
. This includes common types like 32 and 64-bit integers and floating-point numbers, as well as 16-bit numbers on devices with compute capability 7.0 or higher.
Note that on Julia 1.7 and higher, CUDA.jl does not export the @atomic
macro anymore to avoid conflicts with the version in Base. That means it is recommended to always fully specify uses of the macro, i.e., use CUDA.@atomic
as in the example above.
Arrays with unified memory
You may have noticed that the CuArray
type in the example above included an additional parameter, Mem.DeviceBuffer
. This has been introduced to support arrays backed by different kinds of buffers. By default, we will use an ordinary device buffer, but it's now possible to allocate arrays backed by unified buffers that can be used on multiple devices:
julia> a = cu([0]; unified=true)
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
0julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
1julia> device!(1)julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
2
Although all operations should work equally well with arrays backed by unified memory, they have not been optimized yet. For example, copying memory to the device could be avoided as the driver can automatically page in unified memory on-demand.
New default random number generator
CUDA.jl 3.0 introduced a new random number generator, and starting with CUDA.jl 3.2 performance and quality of this generator was improved up to the point it could be used by applications. A couple of features were still missing though, such as generating normally-distributed random numbers, or support for complex numbers. These features have been added in CUDA.jl 3.3, and the generator is now used as the default fallback when CURAND does not support the requested element types.
Both the performance and quality of this generator is much better than the previous, GPUArrays.jl-based one:
julia> using BenchmarkTools
julia> cuda_rng = CUDA.RNG();
julia> gpuarrays_rng = GPUArrays.default_rng(CuArray);
julia> a = CUDA.zeros(1024,1024);julia> @benchmark CUDA.@sync rand!($cuda_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 17.040 μs … 2.430 ms ┊ GC (min … max): 0.00% … 99.04%
Time (median): 18.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 20.604 μs ± 34.734 μs ┊ GC (mean ± σ): 1.17% ± 0.99% ▃▆█▇▇▅▄▂▁
▂▂▂▃▄▆███████████▇▆▆▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▄
17 μs Histogram: frequency by time 24.1 μs <julia> @benchmark CUDA.@sync rand!($gpuarrays_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 72.489 μs … 2.790 ms ┊ GC (min … max): 0.00% … 98.44%
Time (median): 74.479 μs ┊ GC (median): 0.00%
Time (mean ± σ): 81.211 μs ± 61.598 μs ┊ GC (mean ± σ): 0.67% ± 1.40% █ ▁
█▆▃▁▃▃▅▆▅▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▄▆▁▁▁▁▁▁▁▁▄▄▃▄▃▁▁▁▁▁▁▁▁▁▃▃▄▆▄▁▄▃▆ █
72.5 μs Histogram: log(frequency) by time 443 μs <
julia> using RNGTest
julia> test_cuda_rng = RNGTest.wrap(cuda_rng, UInt32);
julia> test_gpuarrays_rng = RNGTest.wrap(gpuarrays_rng, UInt32);julia> RNGTest.smallcrushTestU01(test_cuda_rng)
All tests were passedjulia> RNGTest.smallcrushTestU01(test_gpuarrays_rng)
The following tests gave p-values outside [0.001, 0.9990]: Test p-value
----------------------------------------------
1 BirthdaySpacings eps
2 Collision eps
3 Gap eps
4 SimpPoker 1.0e-4
5 CouponCollector eps
6 MaxOft eps
7 WeightDistrib eps
10 RandomWalk1 M 6.0e-4
----------------------------------------------
(eps means a value < 1.0e-300):
Removal of old memory pools
With the new stream-ordered allocator, caching memory allocations at the CUDA library level, much of the need for memory pools to cache memory allocations has disappeared. To simplify the allocation code, we have removed support for those Julia-managed memory pools (i.e., binned
, split
and simple
). You can now only use the cuda
memory pool, or use no pool at all by setting the JULIA_CUDA_MEMORY_POOL
environment variable to none
.
Not using a memory pool degrades performance, so if you are stuck on an NVIDIA driver that does not support CUDA 11.2, it is advised to remain on CUDA.jl 3.3 until you can upgrade.
Also note that the new stream-ordered allocator has turned out incompatible with legacy cuIpc APIs as used by OpenMPI. If that applies to you, consider disabling the memory pool or reverting to CUDA.jl 3.3 if your application's allocation pattern benefits from a memory pool.
Because of this, we will be maintaining CUDA.jl 3.3 longer than usual. All bug fixes in CUDA.jl 3.4 have already been backported to the previous release, which is currently at version 3.3.6.
Device capability-dependent kernel code
Some of the improvements in this release depend on the ability to write generic code that only uses certain hardware features when they are available. To facilitate writing such code, the compiler now embeds metadata in the generated code that can be used to branch on.
Currently, the device capability and PTX ISA version are embedded and made available using respectively the compute_capability
and ptx_isa_version
functions. A simplified version number type, constructable using the sv"..."
string macro, can be used to test against these properties. For example:
julia> function kernel(a)
a[] = compute_capability() >= sv"6.0" ? 1 : 2
return
end
kernel (generic function with 1 method)julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_1({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
top:
%1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
%2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
store float 1.000000e+00, float addrspace(1)* %2, align 4
ret void
}julia> capability(device!(1))
v"3.5.0"julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_2({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
top:
%1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
%2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
store float 2.000000e+00, float addrspace(1)* %2, align 4
ret void
}
The branch on the compute capability is completely optimized away. At the same time, this does not require re-inferring the function as the optimization happens at the LLVM level.