By: Tim Besard
Re-posted from: https://juliagpu.org/post/2022-06-24-metal/index.html
Julia has gained a new GPU back-end: Metal.jl, for working with Apple's M1 GPUs. The back-end is built on the same foundations that make up existing GPU packages like CUDA.jl and AMDGPU.jl, so it should be familiar to anybody who's already programmed GPUs in Julia. In the following post I'll demonstrate some of that functionality and explain how it works.
But first, note that Metal.jl is under heavy development: The package is considered experimental for now, as we're still working on squashing bugs and adding essential functionality. We also haven't optimized for performance yet. If you're interesting in using Metal.jl, please consider contributing to its development! Most of the package is written in Julia, and checking-out the source code is a single Pkg.develop
away :-)
Quick start
Start by getting a hold of the upcoming Julia 1.8, launch it, and enter the package manager by pressing ]
:
julia> ]pkg> add Metal
Installed Metal
Installation is as easy as that, and we'll automatically download the necessary binary artifacts (a C wrapper for the Metal APIs, and an LLVM back-end). Then, leave the package manager by pressing backspace, import the Metal package, and e.g. call the versioninfo()
method for some details on the toolchain:
julia> using Metaljulia> Metal.versioninfo()
macOS 13.0.0, Darwin 21.3.0Toolchain:
- Julia: 1.8.0-rc1
- LLVM: 13.0.11 device:
- Apple M1 Pro (64.000 KiB allocated)
And there we go! You'll note here that I'm using the upcoming macOS 13 (Ventura); this is currently the only supported operating system. We also only support M-series GPUs, even though Metal does support other GPUs. These choices were made to simplify development, and aren't technical limitations. In fact, Metal.jl does work on e.g. macOS Monterey with an Intel GPU, but it's an untested combination that may suffer from bugs.
Array programming
Just like our other GPU back-ends, Metal.jl offers an array abstraction that greatly simplifies GPU programming. The abstraction centers around the MtlArray
type that can be used to manage memory and perform GPU computations:
# allocate + initialize
julia> a = MtlArray(rand(Float32, 2, 2))
2×2 MtlArray{Float32, 2}:
0.158752 0.836366
0.535798 0.153554# perform some GPU-accelerated operations
julia> b = a * a
2×2 MtlArray{Float32, 2}:
0.473325 0.261202
0.167333 0.471702# back to the CPU
julia> Array(b)
2×2 Matrix{Float32}:
0.473325 0.261202
0.167333 0.471702
Beyond these simple operations, Julia's higher-order array abstractions can be used to express more complex operations without ever having to write a kernel:
julia> mapreduce(sin, +, a; dims=1)
1×2 MtlArray{Float32, 2}:
1.15276 0.584146julia> cos.(a .+ 2) .* 3
2×2 MtlArray{Float32, 2}:
-2.0472 -1.25332
-2.96594 -2.60351
Much of this functionality comes from the GPUArrays.jl package, which provides vendor-neutral implementations of common array operations. As a result, MtlArray
is already pretty capable, and should be usable with realistic array-based applications.
Kernel programming
Metal.jl's array operations are implemented in Julia, using our native kernel programming capabilities and accompanying JIT-compiler. A small demonstration:
# a simple kernel that sets elements of an array to a value
function memset_kernel(array, value)
i = thread_position_in_grid_1d()
if i <= length(array)
@inbounds array[i] = value
end
return
enda = MtlArray{Float32}(undef, 512)
@metal threads=512 grid=2 memset_kernel(a, 42)# verify
@assert all(isequal(42), Array(a))
As can be seen here, we've opted to deviate slightly from the Metal Shading Language, instead providing a programming experience that's similar to Julia's existing back-ends. Some key differences:
-
we use intrinsic functions instead of special kernel function arguments to access properties like the thread position, grid size, …;
-
all types of arguments (buffers, indirect buffers, value-typed inputs) are transparently converted to a GPU-compatible structure[1];
-
global (task-bound) state is used to keep track of the active device and a queue;
-
compute pipeline set-up and command encoding is hidden behind a single macro.
Behind the scenes, we compile Julia to LLVM IR and use a tiny LLVM back-end (based on @a2flo's libfloor) that (re)writes the bitcode to a Metal-compatible library containing LLVM 5 bitcode. You can inspect the generated IR using @device_code_metal
:
julia> @device_code_metal @metal threads=512 grid=2 memset_kernel(a, 42)
[header]
program_count: 1
...[program]
name: julia_memset_kernel
type: kernel
...
target datalayout = "..."
target triple = "air64-apple-macosx13.0.0"; the (rewritten) kernel function:
; - %value argument passed by reference
; - %thread_position_in_grid argument added
; - sitofp rewritten to AIR-specific intrinsic
define void @julia_memset_kernel(
{ i8 addrspace(1)*, [1 x i64] } addrspace(1)* %array,
i64 addrspace(1)* %value,
i32 %thread_position_in_grid) {
...
%9 = tail call float @air.convert.f.f32.s.i64(i64 %7)
...
ret void
}; minimal required argument metadata
!air.kernel = !{!10}
!10 = !{void ({ i8 addrspace(1)*, [1 x i64] } addrspace(1)*,
i64 addrspace(1)*, i32)* @julia_memset_kernel, !11, !12}
!12 = !{!13, !14, !15}
!13 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1,
!"air.read_write", !"air.address_space", i32 1,
!"air.arg_type_size", i32 16, !"air.arg_type_align_size", i32 8}
!14 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1,
!"air.read_write", !"air.address_space", i32 1,
!"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!15 = !{i32 0, !"air.thread_position_in_grid"}; other metadata not shown, for brevity
Shout-out to @max-Hawkins for exploring Metal code generation during his internship at Julia Computing!
Metal APIs in Julia
Lacking an Objective C or C++ FFI, we interface with the Metal libraries using a shim C library. Most users won't have to interface with Metal directly – the array abstraction is sufficient for many – but more experienced developers can make use of the high-level wrappers that we've designed for the Metal APIs:
julia> dev = MtlDevice(1)
MtlDevice:
name: Apple M1 Pro
lowpower: false
headless: false
removable: false
unified memory: truejulia> desc = MtlHeapDescriptor()
MtlHeapDescriptor:
type: MtHeapTypeAutomatic
storageMode: MtStorageModePrivate
size: 0julia> desc.size = 16384
16384julia> heap = MtlHeap(dev, desc)
MtlHeap:
type: MtHeapTypeAutomatic
size: 16384
usedSize: 0
currentAllocatedSize: 16384# etc
These wrappers are based on @PhilipVinc's excellent work on MetalCore.jl, which formed the basis for (and has been folded into) Metal.jl.
What's next?
The current release of Metal.jl focusses on code generation capabilities, and is meant as a preview for users and developers to try out on their system or with their specific GPU application. It is not production-ready yet, and is lacking some crucial features:
-
performance optimization
-
integration with Metal Performance Shaders
-
integration / documentation for use with Xcode tools
-
fleshing out the array abstraction based on user feedback
Please consider helping out with any of these! Since Metal.jl and its dependencies are almost entirely implemented in Julia, any experience with the language is sufficient to contribute. If you're not certain, or have any questions, please drop by the #gpu
channel on the JuliaLang Slack, ask questions on our Discourse, or chat to us during the GPU office hours every other Monday.
If you encounter any bugs, feel free to let us know on the Metal.jl issue tracker. For information on upcoming releases, subscribe to this website's blog where we post about significant developments in Julia's GPU ecosystem.
[1] | This relies on Metal 3 from macOS 13, which introduced bindless argument |
buffers, as we didn't fully figure out how to reliably encode arbitrarily-nested indirect buffers in argument encoder metadata.