By: John Myles White
Re-posted from: http://www.johnmyleswhite.com/notebook/2013/12/22/the-relationship-between-vectorized-and-devectorized-code/
Introduction
Some people have come to believe that Julia’s vectorized code is unusably slow. To correct this misconception, I outline a naive benchmark below that suggests that Julia’s vectorized code is, in fact, noticeably faster than R’s vectorized code. When experienced Julia programmers suggest that newcomers should consider devectorizing code, we’re not trying to beat R’s speed — our vectorized code does that already. Instead, we’re trying to match C’s speed.
As the examples below indicate, a little bit of devectorization goes a long way towards this loftier goal. In the specific examples I show, I find that:
- Julia’s vectorized code is 2x faster than R’s vectorized code
- Julia’s devectorized code is 140x faster than R’s vectorized code
- Julia’s devectorized code is 1350x faster than R’s devectorized code
Examples of Vectorized and Devectorized Code in R
Let’s start by contrasting two pieces of R code: a vectorized and a devectorized implementation of a trivial snippet of code that does repeated vector addition.
First, we consider an example of idiomatic, vectorized R code:
vectorized <- function()
{
a <- c(1, 1)
b <- c(2, 2)
x <- c(NaN, NaN)
for (i in 1:1000000)
{
x <- a + b
}
return()
}
time <- function (N)
{
timings <- rep(NA, N)
for (itr in 1:N)
{
start <- Sys.time()
vectorized()
end <- Sys.time()
timings[itr] <- end - start
}
return(timings)
}
mean(time(10))
This code takes, on average, 0.49 seconds per iteration to compute 1,000,000 vector additions.
Having considered the vectorized implementation, we can then consider an unidiomatic devectorized implementation of the same operation in R:
devectorized <- function()
{
a <- c(1, 1)
b <- c(2, 2)
x <- c(NaN, NaN)
for (i in 1:1000000)
{
for (index in 1:2)
{
x[index] <- a[index] + b[index]
}
}
return()
}
time <- function (N)
{
timings <- rep(NA, N)
for (itr in 1:N)
{
start <- Sys.time()
devectorized()
end <- Sys.time()
timings[itr] <- end - start
}
return(timings)
}
mean(time(10))
This takes, on average, 4.72 seconds per iteration to compute 1,000,000 vector additions.
Examples of Vectorized and Devectorized Code in Julia
Let’s now consider two Julia implementations of this same snippet of code. We’ll start with a vectorized implementation:
function vectorized()
a = [1.0, 1.0]
b = [2.0, 2.0]
x = [NaN, NaN]
for i in 1:1000000
x = a + b
end
return
end
function time(N)
timings = Array(Float64, N)
# Force compilation
vectorized()
for itr in 1:N
timings[itr] = @elapsed vectorized()
end
return timings
end
mean(time(10))
This takes, on average, 0.236 seconds per iteration to compute 1,000,000 vector additions.
Next, let’s consider a devectorized implementation of this same snippet:
function devectorized()
a = [1.0, 1.0]
b = [2.0, 2.0]
x = [NaN, NaN]
for i in 1:1000000
for index in 1:2
x[index] = a[index] + b[index]
end
end
return
end
function time(N)
timings = Array(Float64, N)
# Force compilation
devectorized()
for itr in 1:N
timings[itr] = @elapsed devectorized()
end
return timings
end
mean(time(10))
This takes, on average, 0.0035 seconds per iteration to compute 1,000,000 vector additions.
Comparing Performance in R and Julia
We can summarize the results of the four examples above in a single table:
Approach | Language | Average Time |
Vectorized | R | 0.49 |
Devectorized | R | 4.72 |
Vectorized | Julia | 0.24 |
Devectorized | Julia | 0.0035 |
All of these examples were timed on my 2.9 GHz Intel Core i7 MacBook Pro. The results are quite striking: Julia is uniformly faster than R. And a very small bit of devectorization produces huge performance improvements. Of course, it would be nice if Julia’s compiler could optimize vectorized code as well as it optimizes devectorized code. But doing so requires a substantial amount of work.
Why is Optimizing Vectorized Code Hard?
What makes automatic devectorization tricky to get right is that even minor variants of the snippet shown above have profoundly different optimization strategies. Consider, for example, the following two snippets of code:
function vectorized2()
a = [1.0, 1.0]
b = [2.0, 2.0]
res = {}
for i in 1:1000000
x = [rand(), rand()]
x += a + b
push!(res, x)
end
return res
end
function time(N)
timings = Array(Float64, N)
# Force compilation
vectorized2()
for itr in 1:N
timings[itr] = @elapsed vectorized2()
end
return timings
end
mean(time(10))
This first snippet takes 1.29 seconds on average.
function devectorized2()
a = [1.0, 1.0]
b = [2.0, 2.0]
res = {}
for i in 1:1000000
x = [rand(), rand()]
for dim in 1:2
x[dim] += a[dim] + b[dim]
end
push!(res, x)
end
return res
end
function time(N)
timings = Array(Float64, N)
# Force compilation
devectorized2()
for itr in 1:N
timings[itr] = @elapsed devectorized2()
end
return timings
end
mean(time(10))
This second snippet takes, on average, 0.27 seconds.
The gap between vectorized and devectorized code is much smaller here because this second set of code snippets uses memory in a very different way than our original snippets did. In the first set of snippets, it was possible to entirely avoid allocating any memory for storing changes to x
. The devectorized code for the first set of snippets explicitly made clear to the compiler that no memory needed to be allocated. The vectorized code did not make this clear. Making it clear that no memory needed to be allocated led to a 75x speedup. Explicitly telling the compiler what it can avoid spending time on goes a long way.
In contrast, in the second set of snippets, a new chunk of memory has to be allocated for every x
vector that gets created. And the result is that even the devectorized variant of our second snippet cannot offer much of a performance boost over its vectorized analogue. The devectorized variant is slightly faster because it avoids allocating any memory during the steps in which x
has a
and b
added to it, but this makes less of a difference when there is still a lot of other work being done that cannot be avoided by devectorizing operations.
This reflects a more general statement: the vectorization/devectorization contrast is only correlated, not causally related, with the actual performance characteristics of code. What matters for computations that take place on modern computers is the efficient utilization of processor cycles and memory. In many real examples of vectorized code, it is memory management, rather than vectorization per se, that is the core causal factor responsible for performance.
The Reversed Role of Vectorization in R and Julia
Part of what makes it difficult to have a straightforward discussion about vectorization is that vectorization in R conflates issues that are logically unrelated. In R, vectorization is often done for both (a) readability and (b) performance. In Julia, vectorization is only used for readability; it is devectorization that offers superior performance.
This confuses some people who are not familiar with the internals of R. It is therefore worth noting how one improves the speed of R code. The process of performance improvement is quite simple: one starts with devectorized R code, then replaces it with vectorized R code and then finally implements this vectorized R code in devectorized C code. This last step is unfortunately invisible to many R users, who therefore think of vectorization per se as a mechanism for increasing performance. Vectorization per se does not help make code faster. What makes vectorization in R effective is that it provides a mechanism for moving computations into C, where a hidden layer of devectorization can do its mgic.
In other words, R is doing exactly what Julia is doing to get better performance. R’s vectorized code is simply a thin wrapper around completely devectorized C code. If you don’t believe me, go read the C code for something like R’s distance
function, which involves calls to functions like the following:
static double R_euclidean(double *x, int nr, int nc, int i1, int i2)
{
double dev, dist;
int count, j;
count= 0;
dist = 0;
for(j = 0 ; j < nc ; j++) {
if(both_non_NA(x[i1], x[i2])) {
dev = (x[i1] - x[i2]);
if(!ISNAN(dev)) {
dist += dev * dev;
count++;
}
}
i1 += nr;
i2 += nr;
}
if(count == 0) return NA_REAL;
if(count != nc) dist /= ((double)count/nc);
return sqrt(dist);
}
It is important to keep this sort of thing in mind: the term vectorization in R actually refers to a step in which you write devectorized code in C. Vectorization, per se, is a red herring when reasoning about performance.
To finish this last point, let’s summarize the performance hierarchy for R and Julia code in a simple table:
Worst Case | Typical Case | Best Case |
Julia Vectorized Code | Julia Devectorized Code | |
R Devectorized Code | R Vectorized Code | C Devectorized Code |
It is the complete absence of one column for Julia that makes it difficult to compare vectorization across the two languages. Nothing in Julia is as bad as R’s devectorized code. On the other end of the spectrum, the performance of Julia’s devectorized code simply has no point of comparison in pure R: it is more similar to the C code used to power R behind the scenes.
Conclusion
Julia aims to (and typically does) provide vectorized code that is efficient as the vectorized code available in other high-level languages. What sets Julia apart is the possibility of writing, in pure Julia, high performance code that uses CPU and memory resources as effectively as can be done in C.
In particular, vectorization and devectorization stand in the opposite relationship to one another in Julia as they do in R. In R, devectorization makes code unusably slow: R code must be vectorized to perform at an acceptable level. In contrast, Julia programmers view vectorized code as a convenient prototype that can be modified with some clever devectorization to produce production-performance code. Of course, we would like prototype code to perform better. But no popular language offers that kind of functionality. What Julia offers isn’t the requirement for devectorization, but the possibility of doing it in Julia itself, rather than in C.