Re-posted from: http://alex.mellnik.net/why-im-still-a-nullable-luddite/
Much of my Julia work involves manipulating data in the form of DataFrames. DataFrames are pretty handy — it’s easy to get data into them from file formats like csv and feather, as well as all types of databases. I can split-apply-combine to my heart’s content with do
blocks, and while only basic joins are possible using join
, it’s easy to form more complex ones with just a few lines of code. It’s reasonably fast to work with them, but … it could be a whole lot faster. That’s because DataFrames, as we know them today, have columns of type DataArray, which help deal with missing values. A DataArray of some type T
has elements which are either of type T
, or if the element is missing, of type NAType
. This type instability means that the Julia JIT can’t easily determine the type of the elements, which is a big performance killer.1
This is all old news, and everyone agrees that that we need to move toward a better solution: Nullables. Rather than encoding your possibly-missing values in a a DataArray, you can make a normal Array of type Nullable{T}
.2 This has the advantage that when you access elements of Array you always get the same type, and the JIT is happy. 3
Sounds grand, right? This is the reason that many packages that work with DataFrames have started moving from DataArrays to normal arrays of Nullables. Unfortunately, it turns out that Nullables are a pain to work with, and because of this I generally take the performance hit and convert everything back to DataArrays. How much of a pain are they to work with? Here’s a list of some fairly reasonable things that you can’t do with Nullables (on 0.4.5):
# Needs to be get(Nullable(1)) + get(Nullable(2)) Nullable(1) + Nullable(2)
# You can't add arrays of Nullables to containers. df = DataFrame() df[:col] = NullableArray([1,2])
# This returns false, although it may be fixed soon. Nullable(0, true) !== Nullable{Int}()
Most importantly, Nullable(T)
is a vastly different creature than T
. You can’t take it bowling and buy it a beer: any time you want to hand it off to pretty much any package you need to convert it back to the original type to have it function as expected.
Because of these issues, I’m planning on sticking with DataArrays until the Nullable ecosystem is in a more mature state. I have high hopes that this will coincide with the release of 0.5.0, but others are not as optimistic. We’ll see! Despite these growing pains, it’s a very exciting time to work in Julia.
- John Myles White has a nice post about this here. ↩
- There’s also NullableArrays.jl which cleans things up a bit, but most of the issues I discuss below hold regardless. ↩
- There’s still some lingering performance issues related to the DataFrame container. ↩