Swift as an Arrow.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/11/06/arrow.html

Introduction

The Julia language is renowned for being swift (yet it is clearly not Swift).

Recently its data reading and writing capabilities are ultra swift mainly
thanks to Jacob Quinn building on earlier efforts of ExpandingMan.

The game changer is Arrow.jl package which is not only fast, but also
it is an implementation of Apache Arrow format. This means that we have
a great format for in-memory and on-disk data exchange with C, C++, C#, Go, Java,
JavaScript, MATLAB, Python, R, Ruby, and Rust.

Recently a very nice blog post that presents how Arrow.jl is used
in practice was written by Jacob Zelko.

In this blog I want to do some performance benchmarking and give recommendations
for people working with DataFrames.jl.

The post is written under Linux, Julia 1.5.2 and the following package setup:

(@v1.5) pkg> status Arrow CSV DataFrames
Status `~/.julia/environments/v1.5/Project.toml`
  [69666777] Arrow v0.4.1
  [336ed68f] CSV v0.7.7
  [a93c6f00] DataFrames v0.21.8

Arrow.jl test drive

First we create some data frame we will work with:

julia> using Arrow, DataFrames

julia> df = DataFrame(["x$i" => i:10^7+i for i in 1:10])
10000001×10 DataFrame
│ Row      │ x1       │ x2       │ x3       │ x4       │ x5       │ x6       │ x7       │ x8       │ x9       │ x10      │
│          │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │
├──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1        │ 1        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │
│ 2        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │
│ 3        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │
│ 4        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │
│ 5        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │
│ 6        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │
│ 7        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │ 16       │
⋮
│ 9999994  │ 9999994  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │
│ 9999995  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │
│ 9999996  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │
│ 9999997  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │
│ 9999998  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │
│ 9999999  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │
│ 10000000 │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │
│ 10000001 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │ 10000010 │

And benchmark the writing-to-disk speed (I run all timings twice to capture both
time of the first run, and time after things are compiled, as both are relevant
in practice):

julia> using Arrow, DataFrames

julia> @time Arrow.write("test.arrow", df)
  2.792842 seconds (7.61 M allocations: 386.438 MiB)
"test.arrow"

julia> @time Arrow.write("test.arrow", df)
  2.222653 seconds (436 allocations: 35.953 KiB)
"test.arrow"

julia> stat("test.arrow").size / 10^6
800.001826

The performance looks really good given the final file has 800 MB.
Also we see that compilation latency is low.

Now we read the file back:

julia> @time df2 = Arrow.Table("test.arrow") |> DataFrame;
  0.831660 seconds (2.44 M allocations: 123.515 MiB)

julia> @time df2 = Arrow.Table("test.arrow") |> DataFrame
  0.000320 seconds (649 allocations: 40.047 KiB)
10000001×10 DataFrame
│ Row      │ x1       │ x2       │ x3       │ x4       │ x5       │ x6       │ x7       │ x8       │ x9       │ x10      │
│          │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │
├──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1        │ 1        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │
│ 2        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │
│ 3        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │
│ 4        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │
│ 5        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │
│ 6        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │
│ 7        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │ 16       │
⋮
│ 9999994  │ 9999994  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │
│ 9999995  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │
│ 9999996  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │
│ 9999997  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │
│ 9999998  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │
│ 9999999  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │
│ 10000000 │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │
│ 10000001 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │ 10000010 │

and here the magic happens. When Arrow.Table reads the file from the disk
it does memory mapping so reading it is almost instant (again, as when writing
the file the compilation latency is low, which is nice).

There is one cost of this speed, Arrow.jl uses its own custom vector type and it
is additionally read only when memory mapped, as we can see here:

julia> typeof(df2.x1)
Arrow.Primitive{Int64,Array{Int64,1}}

julia> df2.x1[1] = 100
ERROR: ReadOnlyMemoryError()

Fortunately, it is easily fixed by making a copy of a DataFrame:

julia> @time df3 = copy(df2);
  0.522758 seconds (287.94 k allocations: 777.914 MiB, 9.34% gc time)

julia> @time df3 = copy(df2);
  0.559576 seconds (44 allocations: 762.943 MiB, 36.20% gc time)

julia> typeof(df3.x1)
Array{Int64,1}

And it costs around 0.5 second in our case, which is not that bad I think.

If we want to avoid memory mapping, we can read a file from an IO like this:

julia> @time df2 = open("test.arrow") do io
           return Arrow.Table(io) |> DataFrame
       end;
  0.338193 seconds (42.84 k allocations: 765.156 MiB)

julia> @time df2 = open("test.arrow") do io
           return Arrow.Table(io) |> DataFrame
       end
  0.323562 seconds (8.50 k allocations: 763.373 MiB)
10000001×10 DataFrame
│ Row      │ x1       │ x2       │ x3       │ x4       │ x5       │ x6       │ x7       │ x8       │ x9       │ x10      │
│          │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │
├──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1        │ 1        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │
│ 2        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │
│ 3        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │
│ 4        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │
│ 5        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │
│ 6        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │
│ 7        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │ 16       │
⋮
│ 9999994  │ 9999994  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │
│ 9999995  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │
│ 9999996  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │
│ 9999997  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │
│ 9999998  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │
│ 9999999  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │
│ 10000000 │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │
│ 10000001 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │ 10000010 │

julia> typeof(df2.x1)
Arrow.Primitive{Int64,Array{Int64,1}}

julia> df2.x1[1] = 100
100

julia> push!(df2, 1:10)
ERROR: MethodError: no method matching resize!(::Arrow.Primitive{Int64,Array{Int64,1}}, ::Int64)

As you can see this time the vectors are mutable, but not re-sizable. Having
mutability costs around 0.3 second of extra read time over memory mapping.

To get a relative feeling about these timings let us try CSV.jl
(we use a single thread):

julia> using CSV

julia> @time CSV.write("test.csv", df)
 14.644919 seconds (501.00 M allocations: 11.977 GiB, 8.92% gc time)
"test.csv"

julia> @time CSV.write("test.csv", df)
 15.076190 seconds (499.98 M allocations: 11.925 GiB, 8.93% gc time)
"test.csv"

julia> stat("test.csv").size / 10^6
788.889406

julia> @time df2 = CSV.read("test.csv");
  7.391509 seconds (6.01 M allocations: 2.909 GiB, 3.92% gc time)

julia> @time df2 = CSV.read("test.csv");
  3.292178 seconds (2.94 k allocations: 2.612 GiB, 6.21% gc time)

So we see that indeed both reading and writing is much faster (although the size
of the file on disk is comparable in both approaches).

Concluding remarks

I did not want to go into the details of Arrow.jl usage in this post,
but rather show a high level view how it works and present how to use it with
DataFrames.jl (depending on what level of mutability one wants).

Maybe as a final comment let me just highlight the three options you have to
create the Arrow.Table table object. You can use data from:

  • from a file on a disk, in which case you have an advantage of being able to use
    memory mapping;
  • from an IO (so this means you can ingest data from any external source);
  • from a Vector{UInt8} (so you can easily process data passed as a pointer,
    or e.g. byes read from a HTTP request).

I think it is really great as it covers virtually any use case one might
encounter in practice.