CUDA.jl 5.2 and 5.3: Maintenance releases

By: Tim Besard

Re-posted from: https://juliagpu.org/post/2024-04-26-cuda_5.2_5.3/index.html

CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug fixes and minor improvements, but also come with a number of interesting new features. This blog post summarizes the changes in these releases.

Profiler improvements

CUDA.jl 5.1 introduced a new native profiler, which can be used to profile Julia GPU applications without having to use NSight Systems or other external tools. The tool has seen continued development, mostly improving its robustness, but CUDA.jl now also provides a @bprofile equivalent that runs your application multiple times and reports on the time distribution of individual events:

julia> CUDA.@bprofile CuArray([1]) .+ 1
Profiler ran for 1.0 s, capturing 1427349 events.Host-side activity: calling CUDA APIs took 792.95 ms (79.29% of the trace)
┌──────────┬────────────┬────────┬───────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │  Calls │ Time distribution                     │ Name                    │
├──────────┼────────────┼────────┼───────────────────────────────────────┼─────────────────────────┤
│   19.27% │  192.67 ms │ 109796 │   1.75 µs ± 10.19  (  0.95 ‥ 1279.83) │ cuMemAllocFromPoolAsync │
│   17.08% │   170.8 ms │  54898 │   3.11 µs ± 0.27   (  2.15 ‥ 23.84)   │ cuLaunchKernel          │
│   16.77% │  167.67 ms │  54898 │   3.05 µs ± 0.24   (  0.48 ‥ 16.69)   │ cuCtxSynchronize        │
│   14.11% │  141.12 ms │  54898 │   2.57 µs ± 0.79   (  1.67 ‥ 70.57)   │ cuMemcpyHtoDAsync       │
│    1.70% │   17.04 ms │  54898 │ 310.36 ns ± 132.89 (238.42 ‥ 5483.63) │ cuStreamSynchronize     │
└──────────┴────────────┴────────┴───────────────────────────────────────┴─────────────────────────┘Device-side activity: GPU was busy for 87.38 ms (8.74% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name               │
├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────┤
│    6.66% │   66.61 ms │ 54898 │   1.21 µs ± 0.16   (  0.95 ‥ 1.67)    │ kernel             │
│    2.08% │   20.77 ms │ 54898 │ 378.42 ns ± 147.66 (238.42 ‥ 1192.09) │ [copy to device]   │
└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────┘NVTX ranges:
┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                │
├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
│   98.99% │  989.94 ms │ 54898 │  18.03 µs ± 49.88  ( 15.26 ‥ 10731.22) │ @bprofile.iteration │
└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘

By default, CUDA.@bprofile runs the application for 1 second, but this can be adjusted using the time keyword argument.

Display of the time distribution isn't limited to CUDA.@bprofile, and will also be used by CUDA.@profile when any operation is called more than once. For example, with the broadcasting example from above we allocate both the input CuArray and the broadcast result, which results in two calls to the allocator:

julia> CUDA.@profile CuArray([1]) .+ 1Host-side activity:
┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                    │
├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
│   99.92% │   99.42 ms │     1 │                                     │ cuMemcpyHtoDAsync       │
│    0.02% │   21.22 µs │     2 │  10.61 µs ± 6.57   (  5.96 ‥ 15.26) │ cuMemAllocFromPoolAsync │
│    0.02% │   17.88 µs │     1 │                                     │ cuLaunchKernel          │
│    0.00% │  953.67 ns │     1 │                                     │ cuStreamSynchronize     │
└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘

Kernel launch debugging

A common issue with CUDA programming is that kernel launches may fail when exhausting certain resources, such as shared memory or registers. This typically results in a cryptic error message, but CUDA.jl will now try to diagnose launch failures and provide a more helpful error message, as suggested by @simonbyrne:

For example, when using more parameter memory than allowed by the architecture:

julia> kernel(x) = nothing
julia> @cuda kernel(ntuple(_->UInt64(1), 2^13))
ERROR: Kernel invocation uses too much parameter memory.
64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.2.

Or when using an invalid launch configuration, violating a device limit:

julia> @cuda threads=2000 identity(nothing)
ERROR: Number of threads in x-dimension exceeds device limit (2000 > 1024).
caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

We also diagnose launch failures that involve kernel-specific limits, such as exceeding the number of threads that are allowed in a block (e.g., because of register use):

julia> @cuda threads=1024 heavy_kernel()
ERROR: Number of threads per block exceeds kernel limit (1024 > 512).
caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

Sorting improvements

Thanks to @xaellison, our bitonic sorting implementation now supports sorting specific dimensions, making it possible to implement sortperm for multi-dimensional arrays:

julia> A = cu([8 7; 5 6])
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
 8  7
 5  6julia> sortperm(A, dims = 1)
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
 2  4
 1  3julia> sortperm(A, dims = 2)
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
 3  1
 2  4

The bitonic kernel is now used for all sorting operations, in favor of the often slower quicksort implementation:

# before (quicksort)
julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
  2.760 ms (30 allocations: 1.02 KiB)# after (bitonic sort)
julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
  246.386 μs (567 allocations: 13.66 KiB)# reference CPU time
julia> @btime sort($(rand(Float32, 1024, 1024)); dims=1)
  4.795 ms (1030 allocations: 5.07 MiB)

Unified memory fixes

CUDA.jl 5.1 greatly improved support for unified memory, and this has continued in CUDA.jl 5.2 and 5.3. Most notably, when broadcasting CuArrays we now correctly preserve the memory type of the input arrays. This means that if you broadcast a CuArray that is allocated as unified memory, the result will also be allocated as unified memory. In case of a conflict, e.g. broadcasting a unified CuArray with one backed by device memory, we will prefer unified memory:

julia> cu([1]; host=true) .+ 1
1-element CuArray{Int64, 1, Mem.HostBuffer}:
 2julia> cu([1]; host=true) .+ cu([2]; device=true)
1-element CuArray{Int64, 1, Mem.UnifiedBuffer}:
 3

Software updates

Finally, we also did routine updates of the software stack, support the latest and greatest by NVIDIA. This includes support for CUDA 12.4 (Update 1), cuDNN 9, and cuTENSOR 2.0. This latest release of cuTENSOR is noteworthy as it revamps the API in a backwards-incompatible way, and CUDA.jl has opted to follow this change. For more details, refer to the cuTENSOR 2 migration guide by NVIDIA.

Of course, cuTENSOR.jl also provides a high-level Julia API which has been mostly unaffected by these changes:

using CUDA
A = CUDA.rand(7, 8, 3, 2)
B = CUDA.rand(3, 2, 2, 8)
C = CUDA.rand(3, 3, 7, 2)using cuTENSOR
tA = CuTensor(A, ['a', 'f', 'b', 'e'])
tB = CuTensor(B, ['c', 'e', 'd', 'f'])
tC = CuTensor(C, ['b', 'c', 'a', 'd'])using LinearAlgebra
mul!(tC, tA, tB)

This API is still quite underdeveloped, so if you are a user of cuTENSOR.jl and have to adapt to the new API, now is a good time to consider improving the high-level interface instead!

Future releases

The next release of CUDA.jl is gearing up to be a much larger release, with significant changes to both the API and internals of the package. Although the intent is to keep these changes non-breaking, it is always possible that some code will be affected in unexpected ways, so we encourage users to test the upcoming release by simply running ] add CUDA#master and report any issues.

Onboarding DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/19/starting.html

Introduction

Working with data frames is one of the basic needs of any data scientist.
In the Julia ecosystem DataFrames.jl is a package providing support
for these operations. It was designed to be efficient and flexible.

Sometimes, however, novice users can be overwhelmed by the syntax due to its flexibility.
Therefore data scientists often find it useful to use the
packages that make it easier to do transformations of data frames.

Interestingly, these packages use metaprogramming, which might sound
to novices as something scary, while in reality it is the opposite. Metaprogramming
is used to make them easier to use.

Today I want do do a quick review of the main
metaprogramming packages that are available in the ecosystem.
I will not go into the details functionality and syntax of the packages, but rather just
present them briefly and give my personal (opinionated) view of their status.

This post is written under Julia 1.10.1, DataFrames.jl 1.6.1, Chain.jl 0.5.0, DataFramesMeta.jl 0.15.2,
DataFrameMacros.jl 0.4.1, and TidyData.jl 0.15.1.

A basic example

Let us start with a basic example of DataFrames.jl syntax, which we will later rewrite using metaprogramming:

julia> using Statistics

julia> using DataFrames

julia> df = DataFrame(id=[1, 2, 1, 2], v=1:4)
4×2 DataFrame
 Row │ id     v
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4

julia> transform(groupby(df, :id), :v => (x -> x .- mean(x)) => :v100)
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

The syntax looks complex and might be scary. Let us see if we can make it simpler.

Chain.jl

The first functionality we might want to use is to put the operations in a pipe. This is achieved with the Chain.jl package:

julia> using Chain

julia> @chain df begin
           groupby(:id)
           transform(:v => (x -> x .- mean(x)) => :v100)
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

We have achieved the benefit of a better visual separation of operations. In my opinion Chain.jl can be considered
as a currently mostly accepted approach to piping operations in Julia (there are alternatives in the ecosystem
but as far as I can tell they have lower adoption level).

DataFramesMeta.jl

Still the transform(:v => (x -> x .- mean(x)) => :v100) part looks verbose. Let us start by showing
how it can be made simpler using DataFramesMeta.jl:

julia> using DataFramesMeta

julia> @chain df begin
           groupby(:id)
           @transform(:v100 = :v .- mean(:v))
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

In my opinion the code is now really easy to read.

Here is the status of DataFramesMeta.jl:

  • It is actively maintained.
  • Its syntax is close to DataFrames.jl.
  • It uses : to signal that some name is a column of a data frame.

DataFrameMacros.jl

The DataFrameMacros.jl is another package that is closely tied to DataFrames.jl. Let us see how we can use it.
Note that you need to restart the Julia session before running the code as the macro names are overlapping with DataFramesMeta.jl:

julia> using DataFrameMacros

julia> @chain df begin
           groupby(:id)
           @transform(:v100 = @bycol :v .- mean(:v))
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

Note the difference with the @bycol expression. It is needed because in DataFrameMacros.jl @transform by default vectorizes operations.
This is often more convenient for users, but sometimes (like in this case), one wants to suppress vectorization.

What is the status of DataFramesMeta.jl?

  • It is maintained but less actively developed than DataFramesMeta.jl.
  • Its syntax is close to DataFrames.jl, but several macros, for user convenience, vectorize operations by default (as opposed to Base Julia).
  • It uses : to signal that some text is a column of a data frame.

TidierData.jl

Now let us see the TidierData.jl package that is designed to follow dplyr from R:

julia> using TidierData

julia> @chain df begin
           @group_by(id)
           @mutate(v100 = v - mean(v))
           @ungroup
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     1      3      1.0
   3 │     2      2     -1.0
   4 │     2      4      1.0

If you know dplyr you should be at home with this syntax.

What is the status of DataFramesMeta.jl:

  • It is actively maintained.
  • It tries to guess as much as possible; the package automatically decides which functions should be vectorized (in our example - was vectorized but mean was not).
  • You do not need a : prefix in column names, the package uses scoping similar to R to resolve variable names.

As you can see, the R-style syntax is designed for maximum convenience, at the expense of control (a lot of “magic” happens behind the scenes;
admittedly most of the time this magic is what novice users would want).

Conclusions

Here is a recap of what we have discussed:

  • Meta-packages are here to make life easier for users. There is no need to be afraid of them.
  • For piping I recommend using Chain.jl.
  • Use plain DataFrames.jl if you are a die-hard Julia user and want all your code to be valid Julia syntax (I prefer it when writing production stuff).
  • Use DataFramesMeta.jl if you want an experience most consistent with Base Julia (this is my personal preference for interactive sessions, but it requires most knowledge of Julia).
  • DataFrameMacros.jl is an in-between package, it adds some more convenience (e.g. vectorization by default), but does not push it to the extreme
    (it also has a super convenient {} notation which you might find useful; I decided to skip it to keep the post simple to follow).
  • TidyData.jl goes for maximum convenience. It follows R-style and tries to guess what you most likely wanted to do. Users with dplyr should be able to start using it immediately.