Re-posted from: https://bkamins.github.io/julialang/2021/07/02/dataframes-1.2.0.html
Introduction
DataFrames.jl version 1.2.0 has just been released. In this post I want
to discuss the main new user visible features we have introduced.
The codes were run under Julia 1.6.1 and DataFrames.jl 1.2.0.
New functionalities
There are three major new functionalities introduced by the 1.2.0 release. Let
me explain them one by one.
matchmissing=:notequal
keyword argument in joins
Before 1.2.0 release missing
values in on
-columns in joins either were
considered to be equal (when matchmissing=:equal
was passed) or produced an
error (when matchmissing=:error
, this is a default behavior). Now you can
also pass matchmissing=:notequal
in which case missing
values are considered
as not matching. Here is a simple example comparing the three options:
julia> using DataFrames
julia> df1 = DataFrame(id=[1, missing, 3], left=1:3)
3×2 DataFrame
Row │ id left
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ missing 2
3 │ 3 3
julia> df2 = DataFrame(id=[1, missing, missing], right=1:3)
3×2 DataFrame
Row │ id right
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ missing 2
3 │ missing 3
julia> innerjoin(df1, df2, on=:id)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error
julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
3×3 DataFrame
Row │ id left right
│ Int64? Int64 Int64
─────┼───────────────────────
1 │ 1 1 1
2 │ missing 2 2
3 │ missing 2 3
julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
1×3 DataFrame
Row │ id left right
│ Int64? Int64 Int64
─────┼──────────────────────
1 │ 1 1 1
A new syntax for column expansion in transformation functions
Users often store nested data structures in columns of a data frame.
In such cases, a frequent request is to unnest such a column.
Before 1.2.0 release one had to perform this operation like this:
julia> df = DataFrame(col=[Dict("a"=>1, "b"=>2), Dict("a"=>3, "b"=>4)])
2×1 DataFrame
Row │ col
│ Dict…
─────┼──────────────────────
1 │ Dict("b"=>2, "a"=>1)
2 │ Dict("b"=>4, "a"=>3)
julia> transform(df, :col => identity => AsTable)
2×3 DataFrame
Row │ col b a
│ Dict… Int64 Int64
─────┼────────────────────────────────────
1 │ Dict("b"=>2, "a"=>1) 2 1
2 │ Dict("b"=>4, "a"=>3) 4 3
Now, a simpler syntax is allowed, that does not require the user to write
identity
part of the transformation specification (just like in column
renaming syntax), so the following code works
julia> transform(df, :col => AsTable)
2×3 DataFrame
Row │ col b a
│ Dict… Int64 Int64
─────┼────────────────────────────────────
1 │ Dict("b"=>2, "a"=>1) 2 1
2 │ Dict("b"=>4, "a"=>3) 4 3
and produces the same result.
subset!
now correctly updates passed GroupedDataFrame
The subset!
function was a new addition in 1.0.0 release. Therefore,
given the user feedback, we are adding some polishing touches to it.
Before 1.2.0 passing a GroupedDataFrame
to subset!
produced a correct
result, but could potentially corrupt the passed GroupedDataFrame
(a proper
information about this was given in the documentation; such a design
was chosen to improve performance). However, such a behavior was found to be
error prone. Therefore in 1.2.0 release an efficient algorithm updating not
only the parent data frame but also GroupedDataFrame
itself was implemented.
Here is an example of the current behavior:
julia> using Statistics
julia> df = DataFrame(id=repeat([1, 2], 4), x=1:8)
8×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 1 3
4 │ 2 4
5 │ 1 5
6 │ 2 6
7 │ 1 7
8 │ 2 8
julia> gd = groupby(df, :id)
GroupedDataFrame with 2 groups based on key: id
First Group (4 rows): id = 1
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 3
3 │ 1 5
4 │ 1 7
⋮
Last Group (4 rows): id = 2
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 2 2
2 │ 2 4
3 │ 2 6
4 │ 2 8
julia> subset!(gd, :x => x -> x .> mean(x)) # pick rows with :x above group mean
4×2 DataFrame
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 5
2 │ 2 6
3 │ 1 7
4 │ 2 8
julia> gd
GroupedDataFrame with 2 groups based on key: id
First Group (2 rows): id = 1
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 1 5
2 │ 1 7
⋮
Last Group (2 rows): id = 2
Row │ id x
│ Int64 Int64
─────┼──────────────
1 │ 2 6
2 │ 2 8
In the above operation both df
and gd
get properly updated in-place
(previously only df
was changed, but gd
was left unchanged, and thus
it was corrupted).
Deprecated functionality
In the beginning of development of DataFrames.jl the design of DataFrame
was very close to a matrix. Over the years the consensus was reached that
we should rather treat it as Tables.jl table. However, the legacy thinking
was still reflected in the design of hcat
function, which allowed horizontal
concatenation of a data frame with a vector, just like it is allowed for
matrices. Unfortunately this approach conflicts with the fact that currently
many vectors are supporting Tables.jl table interface and when doing hcat
users would prefer them to be treated as such.
I think the issue is easiest explained with an example. The Julia session
shown below was started with --depwarn=yes
flag:
julia> using DataFrames
julia> df = DataFrame(col1='a':'c')
3×1 DataFrame
Row │ col1
│ Char
─────┼──────
1 │ a
2 │ b
3 │ c
julia> hcat(df, [(x=i, y=10+i) for i in 1:3])
┌ Warning: horizontal concatenation of data frame with a vector is deprecated. Pass DataFrame(x1=x) instead.
│ caller = ip:0x0
└ @ Core :-1
3×2 DataFrame
Row │ col1 x1
│ Char NamedTup…
─────┼───────────────────────
1 │ a (x = 1, y = 11)
2 │ b (x = 2, y = 12)
3 │ c (x = 3, y = 13)
As you can see a vector of NamedTuples
, although it is a Tables.jl table,
is just horizontally concatenated to df
as a new column with auto generated
name :x1
.
However, most likely the user expected the following result (but without having
to use DataFrame
constructor):
julia> hcat(df, DataFrame([(x=i, y=10+i) for i in 1:3]))
3×3 DataFrame
Row │ col1 x y
│ Char Int64 Int64
─────┼────────────────────
1 │ a 1 11
2 │ b 2 12
3 │ c 3 13
In order to allow this behavior in the future, as you see above, passing a
vector to hcat
when the other argument is a data frame is currently deprecated.
Conclusions
I hope you will enjoy the new features we have shipped in the 1.2.0 release of
the DataFrames.jl.
Apart from the changes discussed above several minor ones, mostly in the areas
of performance, display, and documentation have been made. You can find a more
detailed list of things changed in the 1.2.0 release notes.
Also remember that the NEWS.md file in the project repository is maintained
to give synthetic information of the most important changes introduced in
the releases of DataFrames.jl.