Author Archives: Tim Besard

CUDA.jl 4.0

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2023-02-01-cuda_4.0/index.html

CUDA.jl 4.0 is a breaking release that introduces the use of JLLs to provide the CUDA toolkit. This makes it possible to compile other binary libaries against the CUDA runtime, and use them together with CUDA.jl. The release also brings CUSPARSE improvements, the ability to limit memory use, and many bug fixes and performance improvements.

JLLs for CUDA artifacts

While CUDA.jl has been using binary artifacts for a while, it was manually managing installation and selection of them, i.e., not by using standardised JLL packages. This complicated use of the artifacts by other packages, and made it difficult to build other binary packages against the CUDA runtime.

With CUDA.jl 4.0, we now use JLLs to load the CUDA driver and runtime. Specifically, there are two JLLs in play: CUDA_Driver_jll and CUDA_Runtime_jll. The former is responsible for loading the CUDA driver library (possibly upgrading it using a forward-compatible version), and determining the CUDA version that your set-up supports:

❯ JULIA_DEBUG=CUDA_Driver_jll julia
julia> using CUDA_Driver_jll
┌ System CUDA driver found at libcuda.so.1, detected as version 12.0.0
└ @ CUDA_Driver_jll
┌ System CUDA driver is recent enough; not using forward-compatible driver
└ @ CUDA_Driver_jll

With the driver identified and loaded, CUDA_Runtime_jll can select a compatible toolkit. By default, it uses the latest supported toolkit that is compatible with the driver:

julia> using CUDA_Runtime_jlljulia> CUDA_Runtime_jll.cuda_toolkits
10-element Vector{VersionNumber}:
 v"10.2.0"
 v"11.0.0"
 v"11.1.0"
 v"11.2.0"
 v"11.3.0"
 v"11.4.0"
 v"11.5.0"
 v"11.6.0"
 v"11.7.0"
 v"11.8.0"julia> CUDA_Runtime_jll.host_platform
Linux x86_64 {cuda=11.8}

As you can see, the selected CUDA runtime is encoded in the host platform. This makes it possible for Julia to automatically select compatible versions of other binary packages. For example, if we install and load SuiteSparse_GPU_jll, which right now provides builds for CUDA 10.2, 11.0 and 12.0, the artifact resolution code knows to load the build for CUDA 11.0 which is compatible with the selected CUDA 11.8 runtime:

julia> using SuiteSparse_GPU_jlljulia> SuiteSparse_GPU_jll.best_wrapper
"~/.julia/packages/SuiteSparse_GPU_jll/.../x86_64-linux-gnu-cuda+11.0.jl"

The change to JLLs requires a breaking change: the JULIA_CUDA_VERSION and JULIA_CUDA_USE_BINARYBUILDER environment variables have been removed, and are replaced by preferences that are set in the current environment. For convenience, you can set these preferences by calling CUDA.set_runtime_version!:

❯ julia --project
julia> using CUDA
julia> CUDA.runtime_version()
v"11.8.0"julia> CUDA.set_runtime_version!(v"11.7")
┌ Set CUDA Runtime version preference to 11.7,
└ please re-start Julia for this to take effect.❯ julia --project
julia> using CUDA
julia> CUDA.runtime_version()
v"11.7.0"julia> using CUDA_Runtime_jll
julia> CUDA_Runtime_jll.host_platform
Linux x86_64 {cuda=11.7}

The changed preference is reflected in the host platform, which means that you can use this mechanism to load a different builds of other binary packages. For example, if you rely on a package or JLL that does not yet have a build for CUDA 12, you could set the preference to v"11.x" to load an available build.

For discovering a local runtime, you can set the version to "local", which will replace the use of CUDA_Runtime_jll by CUDA_Runtime_discovery.jl, an API-compatible package that replaces the JLL with a local runtime discovery mechanism:

❯ julia --project
julia> CUDA.set_runtime_version!("local")
┌ Set CUDA Runtime version preference to local,
└ please re-start Julia for this to take effect.❯ JULIA_DEBUG=CUDA_Runtime_Discovery julia --project
julia> using CUDA
┌ Looking for CUDA toolkit via environment variables CUDA_PATH
└ @ CUDA_Runtime_Discovery
┌ Looking for binary ptxas in /opt/cuda
│   all_locations =
│    2-element Vector{String}:
│     "/opt/cuda"
│     "/opt/cuda/bin"
└ @ CUDA_Runtime_Discovery
┌ Debug: Found ptxas at /opt/cuda/bin/ptxas
└ @ CUDA_Runtime_Discovery
...

Memory limits

By popular demand, support for memory limits has been reinstated. This functionality had been removed after the switch to CUDA memory pools, as the memory pool allocator does not yet support memory limits. Awaiting improvements by NVIDIA, we have added functionality to impose memory limits from the Julia side, in the form of two environment variables:

  • JULIA_CUDA_SOFT_MEMORY_LIMIT: This is an advisory limit, used to configure the memory pool, which will result in the pool being shrunk down to the requested limit at every synchronization point. That means that the pool may temporarily grow beyond the limit. This limit is unavailable when disabling memory pools (with JULIA_CUDA_MEMORY_POOL=none).

  • JULIA_CUDA_HARD_MEMORY_LIMIT: This is a hard limit, checked before every allocation. Doing so is relatively expensive, so it is recommended to use the soft limit instead.

The value of these variables can be formatted as a numer of bytes, optionally followed by a unit, or as a percentage of the total device memory. Examples: 100M, 50%, 1.5GiB, 10000.

CUSPARSE improvements

Thanks to the work of @amontoison, the CUSPARSE interface has undergone many improvements:

  • Better support of the CuSparseMatrixCOO format with, in particular, the addition of CuSparseMatrixCOO * CuVector and CuSparseMatrixCOO * CuMatrix products;

  • Routines specialized for -, +, * operations between sparse matrices (CuSparseMatrixCOO, CuSparseMatrixCSC and CuSparseMatrixCSR) have been interfaced;

  • New generic routines for backward and forward sweeps with sparse triangular matrices are now used by \;

  • CuMatrix * CuSparseVector and CuMatrix * CuSparseMatrix products have been added;

  • Conversions between sparse and dense matrices have been updated for using more recent and optimized routines;

  • High-level Julia functions for the new set of sparse BLAS 1 routines such as dot products between CuSparseVector;

  • Add missing dispatchs for mul! and ldiv! functions;

  • Interfacing of almost all new CUSPARSE routines added by the CUDA toolkits v"11.x".

Other changes

  • Removal of the CUDNN, CUTENSOR, CUTENSORNET and CUSTATEVEC submodules: These have been moved into their own packages, respectively cuDNN.jl, cuTENSOR.jl, cuTensorNet.jl and cuStateVec.jl (note the change in capitalization, now following NVIDIA's naming scheme);

  • Removal of the NVTX submodule: NVTX.jl should be used instead, which is a more complete implementation of the NVTX API;

  • Support for CUDA 11.8 (support for CUDA 12.0 is being worked on);

  • Support for Julia 1.9.

Backport releases

Because CUDA.jl 4.0 is a breaking release, two additional releases have been made that backport bugfixes and select features:

  • CUDA.jl 3.12.1 and 3.12.2: backports of bugfixes since 3.12

  • CUDA.jl 3.13.0: additionally adding the memory limit functionality

Technical preview: Programming Apple M1 GPUs in Julia with Metal.jl

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2022-06-24-metal/index.html

Julia has gained a new GPU back-end: Metal.jl, for working with Apple's M1 GPUs. The back-end is built on the same foundations that make up existing GPU packages like CUDA.jl and AMDGPU.jl, so it should be familiar to anybody who's already programmed GPUs in Julia. In the following post I'll demonstrate some of that functionality and explain how it works.

But first, note that Metal.jl is under heavy development: The package is considered experimental for now, as we're still working on squashing bugs and adding essential functionality. We also haven't optimized for performance yet. If you're interesting in using Metal.jl, please consider contributing to its development! Most of the package is written in Julia, and checking-out the source code is a single Pkg.develop away :-)

Quick start

Start by getting a hold of the upcoming Julia 1.8, launch it, and enter the package manager by pressing ]:

julia> ]pkg> add Metal
  Installed Metal

Installation is as easy as that, and we'll automatically download the necessary binary artifacts (a C wrapper for the Metal APIs, and an LLVM back-end). Then, leave the package manager by pressing backspace, import the Metal package, and e.g. call the versioninfo() method for some details on the toolchain:

julia> using Metaljulia> Metal.versioninfo()
macOS 13.0.0, Darwin 21.3.0Toolchain:
- Julia: 1.8.0-rc1
- LLVM: 13.0.11 device:
- Apple M1 Pro (64.000 KiB allocated)

And there we go! You'll note here that I'm using the upcoming macOS 13 (Ventura); this is currently the only supported operating system. We also only support M-series GPUs, even though Metal does support other GPUs. These choices were made to simplify development, and aren't technical limitations. In fact, Metal.jl does work on e.g. macOS Monterey with an Intel GPU, but it's an untested combination that may suffer from bugs.

Array programming

Just like our other GPU back-ends, Metal.jl offers an array abstraction that greatly simplifies GPU programming. The abstraction centers around the MtlArray type that can be used to manage memory and perform GPU computations:

# allocate + initialize
julia> a = MtlArray(rand(Float32, 2, 2))
2×2 MtlArray{Float32, 2}:
 0.158752  0.836366
 0.535798  0.153554# perform some GPU-accelerated operations
julia> b = a * a
2×2 MtlArray{Float32, 2}:
 0.473325  0.261202
 0.167333  0.471702# back to the CPU
julia> Array(b)
2×2 Matrix{Float32}:
 0.473325  0.261202
 0.167333  0.471702

Beyond these simple operations, Julia's higher-order array abstractions can be used to express more complex operations without ever having to write a kernel:

julia> mapreduce(sin, +, a; dims=1)
1×2 MtlArray{Float32, 2}:
 1.15276  0.584146julia> cos.(a .+ 2) .* 3
2×2 MtlArray{Float32, 2}:
 -2.0472   -1.25332
 -2.96594  -2.60351

Much of this functionality comes from the GPUArrays.jl package, which provides vendor-neutral implementations of common array operations. As a result, MtlArray is already pretty capable, and should be usable with realistic array-based applications.

Kernel programming

Metal.jl's array operations are implemented in Julia, using our native kernel programming capabilities and accompanying JIT-compiler. A small demonstration:

# a simple kernel that sets elements of an array to a value
function memset_kernel(array, value)
  i = thread_position_in_grid_1d()
  if i <= length(array)
    @inbounds array[i] = value
  end
  return
enda = MtlArray{Float32}(undef, 512)
@metal threads=512 grid=2 memset_kernel(a, 42)# verify
@assert all(isequal(42), Array(a))

As can be seen here, we've opted to deviate slightly from the Metal Shading Language, instead providing a programming experience that's similar to Julia's existing back-ends. Some key differences:

  • we use intrinsic functions instead of special kernel function arguments to access properties like the thread position, grid size, …;

  • all types of arguments (buffers, indirect buffers, value-typed inputs) are transparently converted to a GPU-compatible structure[1];

  • global (task-bound) state is used to keep track of the active device and a queue;

  • compute pipeline set-up and command encoding is hidden behind a single macro.

Behind the scenes, we compile Julia to LLVM IR and use a tiny LLVM back-end (based on @a2flo's libfloor) that (re)writes the bitcode to a Metal-compatible library containing LLVM 5 bitcode. You can inspect the generated IR using @device_code_metal:

julia> @device_code_metal @metal threads=512 grid=2 memset_kernel(a, 42)
[header]
program_count: 1
...[program]
name: julia_memset_kernel
type: kernel
...
target datalayout = "..."
target triple = "air64-apple-macosx13.0.0"; the (rewritten) kernel function:
;  - %value argument passed by reference
;  - %thread_position_in_grid argument added
;  - sitofp rewritten to AIR-specific intrinsic
define void @julia_memset_kernel(
    { i8 addrspace(1)*, [1 x i64] } addrspace(1)* %array,
    i64 addrspace(1)* %value,
    i32 %thread_position_in_grid) {
  ...
  %9 = tail call float @air.convert.f.f32.s.i64(i64 %7)
  ...
  ret void
}; minimal required argument metadata
!air.kernel = !{!10}
!10 = !{void ({ i8 addrspace(1)*, [1 x i64] } addrspace(1)*,
              i64 addrspace(1)*, i32)* @julia_memset_kernel, !11, !12}
!12 = !{!13, !14, !15}
!13 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1,
       !"air.read_write", !"air.address_space", i32 1,
       !"air.arg_type_size", i32 16, !"air.arg_type_align_size", i32 8}
!14 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1,
       !"air.read_write", !"air.address_space", i32 1,
       !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!15 = !{i32 0, !"air.thread_position_in_grid"}; other metadata not shown, for brevity

Shout-out to @max-Hawkins for exploring Metal code generation during his internship at Julia Computing!

Metal APIs in Julia

Lacking an Objective C or C++ FFI, we interface with the Metal libraries using a shim C library. Most users won't have to interface with Metal directly – the array abstraction is sufficient for many – but more experienced developers can make use of the high-level wrappers that we've designed for the Metal APIs:

julia> dev = MtlDevice(1)
MtlDevice:
  name:             Apple M1 Pro
  lowpower:         false
  headless:         false
  removable:        false
  unified memory:   truejulia> desc = MtlHeapDescriptor()
MtlHeapDescriptor:
  type:             MtHeapTypeAutomatic
  storageMode:      MtStorageModePrivate
  size:             0julia> desc.size = 16384
16384julia> heap = MtlHeap(dev, desc)
MtlHeap:
  type:                 MtHeapTypeAutomatic
  size:                 16384
  usedSize:             0
  currentAllocatedSize: 16384# etc

These wrappers are based on @PhilipVinc's excellent work on MetalCore.jl, which formed the basis for (and has been folded into) Metal.jl.

What's next?

The current release of Metal.jl focusses on code generation capabilities, and is meant as a preview for users and developers to try out on their system or with their specific GPU application. It is not production-ready yet, and is lacking some crucial features:

  • performance optimization

  • integration with Metal Performance Shaders

  • integration / documentation for use with Xcode tools

  • fleshing out the array abstraction based on user feedback

Please consider helping out with any of these! Since Metal.jl and its dependencies are almost entirely implemented in Julia, any experience with the language is sufficient to contribute. If you're not certain, or have any questions, please drop by the #gpu channel on the JuliaLang Slack, ask questions on our Discourse, or chat to us during the GPU office hours every other Monday.

If you encounter any bugs, feel free to let us know on the Metal.jl issue tracker. For information on upcoming releases, subscribe to this website's blog where we post about significant developments in Julia's GPU ecosystem.


[1] This relies on Metal 3 from macOS 13, which introduced bindless argument

buffers, as we didn't fully figure out how to reliably encode arbitrarily-nested indirect buffers in argument encoder metadata.

oneAPI.jl status update

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2022-04-06-oneapi_update/index.html

It has been over a year since the last update on oneAPI.jl, the Julia package for programming Intel GPUs (and other accelerators) using the oneAPI toolkit. Since then, the package has been under steady development, and several new features have been added to improve the developer experience and usability of the package.

@atomic intrinsics

oneAPI.jl now supports atomic operations, which are required to implement a variety of parallel algorithms. Low-level atomic functions (atomic_add!, atomic_xchg!, etc) are available as unexported methods in the oneAPI module:

a = oneArray(Int32[0])function kernel(a)
    oneAPI.atomic_add!(pointer(a), Int32(1))
    return
end@oneapi items=256 kernel(a)
@test Array(a)[1] == 256

Note that these methods are only available for those types that are supported by the underlying OpenCL intrinsics. For example, the atomic_add! from above can only be used with Int32 and UInt32 inputs.

Most users will instead rely on the higher-level @atomic macro, which can be easily put in front of many array operations to make them behave atomically. To avoid clashing with the new @atomic macro in Julia 1.7, this macro is also unexported:

a = oneArray(Int32[0])function kernel(a)
    oneAPI.@atomic a[1] += Int32(1)
    return
end@oneapi items=256 kernel(a)
@test Array(a)[1] == 512

When used with operations that are supported by OpenCL, this macro will lower to calls like atomic_add!. For other operations, a compare-and-exchange loop will be used. Note that for now, this is still restricted to 32-bit operations, as we do not support the cl_khr_int64_base_atomics extension for 64-bit atomics.

Initial integration with vendor libraries

One significant missing features is the integration with vendor libraries like oneMKL. These integrations are required to ensure good performance for important operations like matrix multiplication, which currently fall-back to generic implementations in Julia that may not always perform as good.

To improve this situation, we are working on a wrapper library that allows us to integrate with oneMKL and other oneAPI and SYCL libraries. Currently, only matrix multiplication is supported, but once the infrastructural issues are worked out we expect to quickly support many more operations.

If you need support for specific libraries, please have a look at this PR. As the API surface is significant, we will need help to extend the wrapper library and integrate it with high-level Julia libraries like LinearAlgebra.jl.

Correctness issues

In porting existing Julia GPU applications to oneAPI.jl, we fixed several issues that caused correctness issues when executing code on Intel GPUs:

  • when the garbage collector frees GPU memory, it now blocks until all outstanding commands (which may include uses of said memory) are completes

  • the barrier function to synchronize threads is now marked as convert to avoid LLVM miscompilations

Note that if you are using Tiger Lake hardware, there is currently a known issue in the back-end Intel compiler that affects oneAPI.jl, causing correctness issues that can be spotted by running the oneAPI.jl test suite.

Future work

To significantly improve usability of oneAPI.jl, we will add support to the KernelAbstraction.jl package. This library is used by many other packages for adding GPU acceleration to algorithms that cannot be easily expressed using only array operations. As such, support for oneAPI.jl will make it possible to use your oneAPI GPUs with all of these packages.