Re-posted from: http://randyzwitch.com/cudanative-jl-julia/
Hacker News discussion: link
Code as Julia Jupyter Notebook
Julia has the reputation as a “fast” language in that it’s possible to write high-performing code. However, what I appreciate most about Julia is not just that the code is fast, but rather that Julia makes high-performance concepts accessible without having to have a deep computer science or compiled language background (neither of which I possess!)
For version 0.6 of Julia, another milestone has been reached in the “accessible” high-performance category: the ability to run Julia code natively on NVIDIA GPUs through the CUDAnative.jl package. While CUDAnative.jl is still very much in its development stages, the package is already far-enough along that within a few hours, as a complete beginner to GPU programming, I was able to see in excess of 20x speedups for my toy example to calculate haversine distance.
Getting Started
The CUDAnative.jl introduction blog post and documentation cover the installation process in-depth, so I won’t repeat the details here. I’m already a regular compile-from-source Julia user and I found the installation process pretty easy on my CUDA-enabled Ubuntu workstation. If you can already do TensorFlow, Keras or other GPU tutorials on your computer, getting CUDAnative.jl to work shouldn’t take more than 10-15 minutes.
Julia CPU Implementation
To get a feel for what sort of speedup I could expect from using a GPU, I wrote a naive implementation of a distance matrix calculation in Julia:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#https://github.com/quinnj/Rosetta-Julia/blob/master/src/Haversine.jl
haversine(lat1::Float32,lon1::Float32,lat2::Float32,lon2::Float32) = 2 * 6372.8 * asin(sqrt(sind((lat2-lat1)/2)^2 + cosd(lat1) * cosd(lat2) * sind((lon2 - lon1)/2)^2))
function pairwise_dist(lat::Vector{Float32}, lon::Vector{Float32})
#Pre-allocate, since size is known
n = length(lat)
result = Array{Float32}(n, n)
#Brute force fill in each cell, ignore that distance [i,j] = distance [j,i]
for i in 1:n
for j in 1:n
@inbounds result[i, j] = haversine(lat[i], lon[i], lat[j], lon[j])
end
end
return result
end
#Example benchmark call
lat10000 = rand(Float32, 10000) .* 45
lon10000 = rand(Float32, 10000) .* -120
@time native_julia_cellwise = pairwise_dist(lat10000, lon10000)
The above code takes a pair of lat/lon values, then calculates the haversine distance between the two points. This algorithm is naive in that a distance matrix is symmetric (i.e. the distance between A to B is the same from B to A), so I could’ve done half the work by setting result[i,j]
and result[j,i]
to the same value, but as a measure of work for a benchmark this toy example is fine. Also note that this implementation runs on a single core, no CPU-core-level parallelization has been implemented.
Or to put all that another way: if someone wanted to tackle this problem without thinking very hard, the implementation might look like this.
CUDAnative.jl Implementation
There are two parts to the CUDAnative.jl implementation: the kernel (i.e. the actual calculation) and the boilerplate code for coordinating the writing to/from the CPU and GPU.
Kernel Code
The kernel code has similarities to the CPU implementation, with a few key differences:
- Method signature is one lat/lon point vs. the lat/lon vectors, rather than a pairwise distance calculation
- Boilerplate code for thread index on the GPU (0-indexed vs. normal Julia 1-indexing)
- The trigonometric functions need to be prepended with
CUDAnative.
, to differentiate that the GPU functions aren’t the same as the functions from Base Julia - Rather than return an array as part of the function return, we use the
out
keyword argument to write directly to the GPU memory
1
2
3
4
5
6
7
8
9
10
11
12
13
14
using CUDAnative, CUDAdrv
#Calculate one point vs. all other points simultaneously
function kernel_haversine(latpoint::Float32, lonpoint::Float32, lat::AbstractVector{Float32}, lon::AbstractVector{Float32}, out::AbstractVector{Float32})
#Thread index
#Need to do the n-1 dance, since CUDA expects 0 and Julia does 1-indexing
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
out[i] = 2 * 6372.8 * CUDAnative.asin(CUDAnative.sqrt(CUDAnative.sind((latpoint-lat[i])/2)^2 + CUDAnative.cosd(lat[i]) * CUDAnative.cosd(latpoint) * CUDAnative.sind((lonpoint - lon[i])/2)^2))
#Return nothing, since we're writing directly to the out array allocated on GPU
return nothing
end
Coordination Code
The coordination code is similar to what you might see in a main()
function in C or Java, where the kernel is applied to the input data. I am using the dev
keyword with the default value of CuDevice(0)
to indicate that the code should be run on the first (in my case, only) GPU device.
The remainder of the code has comments on its purpose, primarily:
- Transfer Julia CPU arrays to GPU arrays (
CuArray
) - Set number of threads/blocks
- Calculate distance between a point and all other points in the array, write back to CPU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#validated kernel_haversine/distmat returns same answer as CPU haversine method (not shown)
function distmat(lat::Vector{Float32}, lon::Vector{Float32}; dev::CuDevice=CuDevice(0))
#Create a context
ctx = CuContext(dev)
#Change to objects with CUDA context
n = length(lat)
d_lat = CuArray(lat)
d_lon = CuArray(lon)
d_out = CuArray(Vector{Float32}(n))
#Calculate number of calculations, threads, blocks
len = n
threads = min(len, 1024)
blocks = Int(ceil(len/threads))
#Julia side accumulation of results to relieve GPU memory pressure
accum = Array{Float32}(n, n)
# run and time the test
secs = CUDAdrv.@elapsed begin
for i in 1:n
@cuda (blocks, threads) kernel_haversine(lat[i], lon[i], d_lat, d_lon, d_out)
accum[:, i] = Vector{Float32}(d_out)
end
end
#Clean up context
destroy!(ctx)
#Return timing and bring results back to Julia
return (secs, accum)
end
#Example benchmark call
timing, result = distmat(lat10000, lon10000)
result ≈ native_julia_cellwise #validate results equivalent CPU and GPU
The code is written to process one row of the distance matrix at a time to minimize GPU memory usage. By writing out the results to the CPU after each loop iteration, I have n-1
extra CPU transfers, which is less performant than calculating all the distances first then transferring, but my consumer-grade GPU with 6GB of RAM would run out of GPU memory before completing the calculation otherwise.
Performance
The performance characteristics of the CPU and GPU calculations are below for various sizes of distance matrices. Having not done any GPU calculations before, I was surprised to see how much of a penalty there is writing back and forth to the GPU. As you can see from the navy-blue line, the execution time is fixed for matrices of size 1 to 1000, representing the fixed cost of moving the data from the CPU to the GPU.
Of course, once we get above 1000×1000 matrices, the GPU really starts to shine. Due to the log scale, it’s a bit hard to see the magnitude differences, but at 100000×100000 there is a 23x reduction in execution time (565.008s CPU vs. 24.32s GPU).
What I Learned
There are myriad things I learned from this project, but most important is that GPGPU processing can be accessible for people like myself without a CS background. Julia isn’t the first high-level language to provide CUDA functionality, but the fact that the code is so similar to native Julia makes GPU computing something I can include in my toolbox today.
Over time, I’m sure I’ll get better results as I learn more about CUDA, as CUDAnative.jl continues to smooth out the rough edges, etc. But the fact that as a beginner that I could achieve such large speedups in just an hour or two of coding and sparse CUDAnative.jl documentation bodes well for the future of GPU computing in Julia.