GPUArrays v11: Port to KernelAbstractions.jl

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2025-01-07-gpuarrays-11/index.html

The latest version of GPUArrays.jl involved a port of all vendor-neutral kernels to KernelAbstractions.jl. This should make it easier to add new functionality and improve the performance of existing kernels.

Vendor-neutral kernel DSL

Back in the day, we created GPUArrays.jl to avoid having to write separate kernels for each GPU back-end, by relying on a very simple vendor-neutral domain-specific language (DSL) that could be translated very easily to the back-end's native kernel language. As a simple example, the following kernel was used to compute the adjoint of a vector:

function LinearAlgebra.adjoint!(B::AbstractGPUMatrix, A::AbstractGPUVector)
    gpu_call(B, A) do ctx, B, A
        idx = @linearidx A
        @inbounds B[1, idx] = adjoint(A[idx])
        return
    end
    return B
end

This DSL was designed almost a decade ago, by Simon Danisch, and has served us well! Since then, KernelAbstractions.jl has been developed by Valentin Churavy, providing a more principled and powerful DSL. With many application developers switching to KernelAbstractions.jl, it was time to port GPUArrays.jl to this new DSL as well.

Thanks to the tireless work by James Schloss, GPUArrays.jl v11 now uses KernelAbstractions.jl for all vendor-neutral kernels. The aforementioned adjoint! kernel now looks like this:

function LinearAlgebra.adjoint!(B::AbstractGPUMatrix, A::AbstractGPUVector)
    @kernel function adjoint_kernel!(B, A)
        idx = @index(Global, Linear)
        @inbounds B[1, idx] = adjoint(A[idx])
    end
    adjoint_kernel!(get_backend(A))(B, A; ndrange=size(A))
    return B
end

As shown above, the KernelAbstractions.jl DSL is very similar to the old DSL, but it provides more flexibility and power (e.g., support for atomics through Atomix.jl). In addition, many more users are familiar with KernelAbstractions.jl, making it easier for them to contribute to GPUArrays.jl. A good first step here would be to port some of the vendor-specific kernels from CUDA.jl to GPUArrays.jl, making them available to all GPU back-ends. If you are interested in contributing, please reach out!

That said, the change is not without its challenges. The added flexibility offered by KernelAbstractions.jl with respect to indexing currently results in certain kernels being slower than before, specifically when there is not much computational complexity to amortise the cost of indexing (e.g., when doing very simple broadcasts). We are working on improving this, but it will take some time. Not to hold back the rest of the JuliaGPU ecosystem, we are releasing despite these performance issues. It's recommended to carefully benchmark your application after upgrading to v11, and to report any performance regressions

Back-end package versions

As GPUArrays.jl is not a direct dependency of most applications, the update will be pulled in by the following back-end package versions (some of which may not be released yet):

  • CUDA.jl v5.6

  • Metal.jl v1.5

  • oneAPI.jl v2.0

  • AMDGPU.jl v1.1