Triangle frenzy

By: Can Candan's Blog

Re-posted from: https://cancandan.github.io/julia/graphics/cuda/2022/05/07/triangles.html

Suppose we want to draw a batch of images, where each image is made up of randomly positioned and colored triangles, blending, so it will look like this:

triangles

and then find the euclidean distance of each such image to a given target image. Why on earth am I doing this? Well this turns into an interesting optimization problem of finding the closest triangle image and also an excuse to practise Julia. The inspiration is from this repo.

Towards framing this as an optimization problem, we will represent a triangle as a vector of size 10, made up of floating point numbers between 0 and 1. Four numbers of this vector are for the color of the triangle; r,g,b and alpha, and for the three vertices of the triangle we need 6 numbers, (x1,y1), (x2,y2), (x3,y3). Hence, if we want to draw M images, each image having N triangles, we need a matrix of size (10,N,M), which will be our parameters matrix. I want to randomly create such a matrix and min-max scale it along the triangle dimension, by which I mean, for each image, I first find the minimum and maximum of a triangle parameter among the N triangles, and then subtract from the parameter this minimum and then divide the result by the difference between the maximum and the minimum. I want to end up with an array of size (3,w,h,M) for the images, where w is width and h is height, and an array of size M for the distances. Let’s see how fast we can do this.

First order of business is setting this up, note that I am scaling the numbers to the given width and height:

using Images
using Statistics
using Cairo
using BenchmarkTools
using Random: seed!

function prepare()
    seed!(123)
    w,h=128,128
    num_params=10
    num_triangles=50
    num_images=256
    target = rand(Float32, 3, h, w)
    prms = rand(Float32, num_params, num_triangles, num_images);
    prms .= (prms .- minimum(prms, dims=2)) ./ (maximum(prms, dims=2) .- minimum(prms, dims=2))   
    prms[collect(1:2:6),:,:] .*= w
    prms[collect(2:2:6),:,:] .*= h         
    return prms, target, num_images, num_triangles, w, h
end

The first thing that comes to mind is to use a 2d graphics library for drawing, and since the Cairo lib is available, let’s try that, the function below is drawing the triangles on a blank white Cairo canvas, and copying it to the img array at the end:

@views function renderCairo(img, prms, num_triangles, w, h)                  
    # blank white canvas
    buffer = ones(UInt32, w, h) * typemax(UInt32)    
    c = Cairo.CairoImageSurface(buffer, Cairo.FORMAT_ARGB32, flipxy=false)
    cr = CairoContext(c);        
    
    for i in 1:num_triangles
        q = prms[:,i]
        set_source_rgba(cr,q[7], q[8], q[9], q[10])        
        move_to(cr, q[1],q[2]);
        line_to(cr, q[3],q[4]);
        line_to(cr, q[5],q[6]);
        close_path(cr);
        fill(cr);            
    end        
            
    resultimg = reinterpret(ARGB32, permutedims(c.data, (2, 1)))
    resultchn = Float32.(channelview(RGB.(resultimg)))                
    img .= resultchn
    Cairo.finish(c)
    Cairo.destroy(c)        
end

Now let’s draw each image in this fashion:

@views function withCairo() 
    prms, target, num_images, num_triangles, w, h = prepare() 

    imgs = Array{Float32}(undef, 3, h, w, num_images)    
    for i in 1:num_images
        img = imgs[:,:,:,i]        
        renderCairo(img, prms[:,:,i], num_triangles, w, h)        
    end
    dists = reshape(mean((imgs .- target) .^2, dims=(1,2,3)), num_images)
    return imgs, 1 .- dists
end

Benchmarking this with @btime withCairo();

I see 428.101 ms (3157 allocations: 205.09 MiB).

The cross product method

Now something really cool. Move your mouse inside and outside of the triangle below, you will see a bar chart depicting the magnitude and direction of the 3rd components of the cross products of vectors AB with AP (reds), BC with BP (greens) and CA with CP (blues). Observe that all those bars point same direction only inside the triangle!

Whats great about this is that cross products are just multiplications and subtractions, perfect job to parallelize with a GPU.
What needs to be done is clear, for each of the M images, for each of the N triangles, our operation is to update a pixel color to blend with the current triangles color if that pixel is inside the triangle. We will parallelize this operation with a CUDA kernel, shown below:

function puttri(prms, imgs, tri, ins)    
    idx = (blockIdx().x - 1) * blockDim().x + threadIdx().x  
    idy = (blockIdx().y - 1) * blockDim().y + threadIdx().y
    
    abx = prms[3,tri,ins] - prms[1,tri,ins]
    aby = prms[4,tri,ins] - prms[2,tri,ins]
    apx = idx - prms[1,tri,ins]
    apy = idy - prms[2,tri,ins]
    cr1 = apx * aby - apy * abx

    bcx = prms[5,tri,ins] - prms[3,tri,ins]
    bcy = prms[6,tri,ins] - prms[4,tri,ins]
    bpx = idx - prms[3,tri,ins]
    bpy = idy - prms[4,tri,ins]
    cr2 = bpx * bcy - bpy * bcx

    cax = prms[1,tri,ins] - prms[5,tri,ins]
    cay = prms[2,tri,ins] - prms[6,tri,ins]
    cpx = idx - prms[5,tri,ins]
    cpy = idy - prms[6,tri,ins]
    cr3 = cpx * cay - cpy * cax

    if ((cr1>=0) & (cr2>=0) & (cr3>=0)) | ((cr1<=0) & (cr2<=0) & (cr3<=0))
        oneMinusAlpha = (1.0f0-prms[10,tri,ins])        
        imgs[1,idx,idy,ins] = imgs[1,idx,idy,ins] * oneMinusAlpha + prms[7,tri,ins] * prms[10,tri,ins]
        imgs[2,idx,idy,ins] = imgs[2,idx,idy,ins] * oneMinusAlpha + prms[8,tri,ins] * prms[10,tri,ins]
        imgs[3,idx,idy,ins] = imgs[3,idx,idy,ins] * oneMinusAlpha + prms[9,tri,ins] * prms[10,tri,ins]
    end
    return
end

We will need to pass our parameters and target array to the GPU, and then call the kernel with @cuda. We can create white canvases with CUDA.ones here, so no need to pass it.

function withGpu()
    prms, target, num_images, num_triangles, w, h = prepare()     

    prms = CuArray(prms)    
    imgs = CUDA.ones(3, h, w, num_images)
    target = CuArray(target)
    for tri in 1:num_triangles
        for i in 1:num_images
            @cuda threads=(32,32) blocks=(8,8) puttri(prms, imgs, tri, i)        
        end                                
    end
    gpufitnesses = 1.0f0 .- reshape(mean((imgs .- target) .^ 2, dims=(1,2,3)),num_images)
    return Array(imgs), Array(gpufitnesses)
end

Benchmarking this I see 120.315 ms (38689 allocations: 52.53 MiB)

Awesome, that’s about 4x speedup. Note that I benchmarked with --check-bounds=no, which is a startup option that you pass when launching julia that disables the bounds checking. In the next post, I will talk about the very cool PGPE algorithm used in evojax to steer these images towards a target image. You can see one example of this here.

I am pretty new to Julia, please let me know if you have any comments, suggestions.