Advanced reshaping in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/05/28/pivot.html

In DataFrames.jl currently you have two functions that you can use
for reshaping your data: stack and unstack. Their design goals are very
simple:

  • stack allows you to go from wide to long data format;
  • unstck works the other way around and takes data in long format producing
    a wide table.

In this post I want to focus on the unstack function. Starting from explaining
its basic usage and then covering three common more complex scenarios.

This post was written under Julia 1.6.1 and DataFrames.jl 1.1.1.

Preparing the data

Consider the following data
(and practice a bit basic data transformation skills):

julia> using DataFrames

julia> sales = DataFrame(year=repeat(2001:2003, inner=4),
                         quarter=repeat(1:4, outer=3),
                         north=1:12, south=21:32)
12×4 DataFrame
 Row │ year   quarter  north  south
     │ Int64  Int64    Int64  Int64
─────┼──────────────────────────────
   1 │  2001        1      1     21
   2 │  2001        2      2     22
   3 │  2001        3      3     23
   4 │  2001        4      4     24
   5 │  2002        1      5     25
   6 │  2002        2      6     26
   7 │  2002        3      7     27
   8 │  2002        4      8     28
   9 │  2003        1      9     29
  10 │  2003        2     10     30
  11 │  2003        3     11     31
  12 │  2003        4     12     32

julia> costs = select(sales, :year, :quarter, [:north, :south] .=> x -> x/2,
                      renamecols=false)
12×4 DataFrame
 Row │ year   quarter  north    south
     │ Int64  Int64    Float64  Float64
─────┼──────────────────────────────────
   1 │  2001        1      0.5     10.5
   2 │  2001        2      1.0     11.0
   3 │  2001        3      1.5     11.5
   4 │  2001        4      2.0     12.0
   5 │  2002        1      2.5     12.5
   6 │  2002        2      3.0     13.0
   7 │  2002        3      3.5     13.5
   8 │  2002        4      4.0     14.0
   9 │  2003        1      4.5     14.5
  10 │  2003        2      5.0     15.0
  11 │  2003        3      5.5     15.5
  12 │  2003        4      6.0     16.0

julia> long_sales = stack(sales, [:north, :south], [:year, :quarter],
                          variable_name=:region, value_name=:sales)
24×4 DataFrame
 Row │ year   quarter  region  sales
     │ Int64  Int64    String  Int64
─────┼───────────────────────────────
   1 │  2001        1  north       1
   2 │  2001        2  north       2
   3 │  2001        3  north       3
   4 │  2001        4  north       4
   5 │  2002        1  north       5
   6 │  2002        2  north       6
   7 │  2002        3  north       7
   8 │  2002        4  north       8
   9 │  2003        1  north       9
  10 │  2003        2  north      10
  11 │  2003        3  north      11
  12 │  2003        4  north      12
  13 │  2001        1  south      21
  14 │  2001        2  south      22
  15 │  2001        3  south      23
  16 │  2001        4  south      24
  17 │  2002        1  south      25
  18 │  2002        2  south      26
  19 │  2002        3  south      27
  20 │  2002        4  south      28
  21 │  2003        1  south      29
  22 │  2003        2  south      30
  23 │  2003        3  south      31
  24 │  2003        4  south      32

julia> long_costs = stack(costs, [:north, :south], [:year, :quarter],
                          variable_name=:region, value_name=:costs)
24×4 DataFrame
 Row │ year   quarter  region  costs
     │ Int64  Int64    String  Float64
─────┼─────────────────────────────────
   1 │  2001        1  north       0.5
   2 │  2001        2  north       1.0
   3 │  2001        3  north       1.5
   4 │  2001        4  north       2.0
   5 │  2002        1  north       2.5
   6 │  2002        2  north       3.0
   7 │  2002        3  north       3.5
   8 │  2002        4  north       4.0
   9 │  2003        1  north       4.5
  10 │  2003        2  north       5.0
  11 │  2003        3  north       5.5
  12 │  2003        4  north       6.0
  13 │  2001        1  south      10.5
  14 │  2001        2  south      11.0
  15 │  2001        3  south      11.5
  16 │  2001        4  south      12.0
  17 │  2002        1  south      12.5
  18 │  2002        2  south      13.0
  19 │  2002        3  south      13.5
  20 │  2002        4  south      14.0
  21 │  2003        1  south      14.5
  22 │  2003        2  south      15.0
  23 │  2003        3  south      15.5
  24 │  2003        4  south      16.0

julia> long = innerjoin(long_sales, long_costs, on=[:year, :quarter, :region])
24×5 DataFrame
 Row │ year   quarter  region  sales  costs
     │ Int64  Int64    String  Int64  Float64
─────┼────────────────────────────────────────
   1 │  2001        1  north       1      0.5
   2 │  2001        2  north       2      1.0
   3 │  2001        3  north       3      1.5
   4 │  2001        4  north       4      2.0
   5 │  2002        1  north       5      2.5
   6 │  2002        2  north       6      3.0
   7 │  2002        3  north       7      3.5
   8 │  2002        4  north       8      4.0
   9 │  2003        1  north       9      4.5
  10 │  2003        2  north      10      5.0
  11 │  2003        3  north      11      5.5
  12 │  2003        4  north      12      6.0
  13 │  2001        1  south      21     10.5
  14 │  2001        2  south      22     11.0
  15 │  2001        3  south      23     11.5
  16 │  2001        4  south      24     12.0
  17 │  2002        1  south      25     12.5
  18 │  2002        2  south      26     13.0
  19 │  2002        3  south      27     13.5
  20 │  2002        4  south      28     14.0
  21 │  2003        1  south      29     14.5
  22 │  2003        2  south      30     15.0
  23 │  2003        3  south      31     15.5
  24 │  2003        4  south      32     16.0

The basics of unstack

Assume we want to get the sales table back. We need to unstack our long
table putting :year and :quarter in rows and :region in columns, while
taking :sales as values:

julia> unstack(long, [:year, :quarter], :region, :sales)
12×4 DataFrame
 Row │ year   quarter  north   south
     │ Int64  Int64    Int64?  Int64?
─────┼────────────────────────────────
   1 │  2001        1       1      21
   2 │  2001        2       2      22
   3 │  2001        3       3      23
   4 │  2001        4       4      24
   5 │  2002        1       5      25
   6 │  2002        2       6      26
   7 │  2002        3       7      27
   8 │  2002        4       8      28
   9 │  2003        1       9      29
  10 │  2003        2      10      30
  11 │  2003        3      11      31
  12 │  2003        4      12      32

We also check that we have recovered what we wanted:

julia> unstack(long, [:year, :quarter], :region, :sales) == sales
true

However, now try to put only :year in rows. If we want to drop :quarter then
we get:

julia> unstack(long, :year, :region, :sales)
ERROR: ArgumentError: Duplicate entries in unstack at row 2 for key (2001,) and variable north. Pass allowduplicates=true to allow them.

julia> unstack(long, :year, :region, :sales, allowduplicates=true)
3×3 DataFrame
 Row │ year   north   south
     │ Int64  Int64?  Int64?
─────┼───────────────────────
   1 │  2001       4      24
   2 │  2002       8      28
   3 │  2003      12      32

Clearly even if we pass allowduplicates=true we do not get what we most likely
wanted. This leads us to the first case.

Pivot tables with unstack

Most likely we want to aggregate sales per year using the sum function. This
is a classic pivot table task. In DataFrames.jl currently one does it in two
steps: first aggregate, then reshape. Here is how you can do it (I am showing
two separate steps, but you could use e.g. Chain.jl to streamline
the processing):

julia> tmp1 = combine(groupby(long, [:year, :region]), :sales => sum => :sales)
6×3 DataFrame
 Row │ year   region  sales
     │ Int64  String  Int64
─────┼──────────────────────
   1 │  2001  north      10
   2 │  2001  south      90
   3 │  2002  north      26
   4 │  2002  south     106
   5 │  2003  north      42
   6 │  2003  south     122

julia> unstack(tmp1, :year, :region, :sales)
3×3 DataFrame
 Row │ year   north   south
     │ Int64  Int64?  Int64?
─────┼───────────────────────
   1 │  2001      10      90
   2 │  2002      26     106
   3 │  2003      42     122

Multiple variables put in columns

What if we wanted to put only :year in rows, but both :quarter and :region
in columns?

In this case we need to create a temporary column which we combine the
:quarter and :region. Here is a simple example:

julia> tmp2 = transform(long, [:region, :quarter] => ByRow(string) => :rq)
24×6 DataFrame
 Row │ year   quarter  region  sales  costs    rq
     │ Int64  Int64    String  Int64  Float64  String
─────┼────────────────────────────────────────────────
   1 │  2001        1  north       1      0.5  north1
   2 │  2001        2  north       2      1.0  north2
   3 │  2001        3  north       3      1.5  north3
   4 │  2001        4  north       4      2.0  north4
   5 │  2002        1  north       5      2.5  north1
   6 │  2002        2  north       6      3.0  north2
   7 │  2002        3  north       7      3.5  north3
   8 │  2002        4  north       8      4.0  north4
   9 │  2003        1  north       9      4.5  north1
  10 │  2003        2  north      10      5.0  north2
  11 │  2003        3  north      11      5.5  north3
  12 │  2003        4  north      12      6.0  north4
  13 │  2001        1  south      21     10.5  south1
  14 │  2001        2  south      22     11.0  south2
  15 │  2001        3  south      23     11.5  south3
  16 │  2001        4  south      24     12.0  south4
  17 │  2002        1  south      25     12.5  south1
  18 │  2002        2  south      26     13.0  south2
  19 │  2002        3  south      27     13.5  south3
  20 │  2002        4  south      28     14.0  south4
  21 │  2003        1  south      29     14.5  south1
  22 │  2003        2  south      30     15.0  south2
  23 │  2003        3  south      31     15.5  south3
  24 │  2003        4  south      32     16.0  south4

julia> unstack(tmp2, :year, :rq, :sales)
3×9 DataFrame
 Row │ year   north1  north2  north3  north4  south1  south2  south3  south4
     │ Int64  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?
─────┼───────────────────────────────────────────────────────────────────────
   1 │  2001       1       2       3       4      21      22      23      24
   2 │  2002       5       6       7       8      25      26      27      28
   3 │  2003       9      10      11      12      29      30      31      32

Note that this additional step is only required for columns as for rows
unstack accepts multiple columns as shown above.

Multiple value variables

Now we get to my favorite Chekhov’s gun element of our story. Why do we
have a :costs column in our long table? The reason is that now we will
discuss how one can unstack a data frame on multiple value columns.

Here you have three options how you want to store the values:

  • use a nested field;
  • stack them vertically;
  • merge them horizontally.

Let me now discuss the three options. Nesting the field can be done e.g.
in the following way:

julia> tmp3 = transform(long, AsTable([:sales, :costs]) =>
                              ByRow(identity) =>
                              :indicators)
24×6 DataFrame
 Row │ year   quarter  region  sales  costs    indicators
     │ Int64  Int64    String  Int64  Float64  NamedTupl…
─────┼────────────────────────────────────────────────────────────────────
   1 │  2001        1  north       1      0.5  (sales = 1, costs = 0.5)
   2 │  2001        2  north       2      1.0  (sales = 2, costs = 1.0)
   3 │  2001        3  north       3      1.5  (sales = 3, costs = 1.5)
   4 │  2001        4  north       4      2.0  (sales = 4, costs = 2.0)
   5 │  2002        1  north       5      2.5  (sales = 5, costs = 2.5)
   6 │  2002        2  north       6      3.0  (sales = 6, costs = 3.0)
   7 │  2002        3  north       7      3.5  (sales = 7, costs = 3.5)
   8 │  2002        4  north       8      4.0  (sales = 8, costs = 4.0)
   9 │  2003        1  north       9      4.5  (sales = 9, costs = 4.5)
  10 │  2003        2  north      10      5.0  (sales = 10, costs = 5.0)
  11 │  2003        3  north      11      5.5  (sales = 11, costs = 5.5)
  12 │  2003        4  north      12      6.0  (sales = 12, costs = 6.0)
  13 │  2001        1  south      21     10.5  (sales = 21, costs = 10.5)
  14 │  2001        2  south      22     11.0  (sales = 22, costs = 11.0)
  15 │  2001        3  south      23     11.5  (sales = 23, costs = 11.5)
  16 │  2001        4  south      24     12.0  (sales = 24, costs = 12.0)
  17 │  2002        1  south      25     12.5  (sales = 25, costs = 12.5)
  18 │  2002        2  south      26     13.0  (sales = 26, costs = 13.0)
  19 │  2002        3  south      27     13.5  (sales = 27, costs = 13.5)
  20 │  2002        4  south      28     14.0  (sales = 28, costs = 14.0)
  21 │  2003        1  south      29     14.5  (sales = 29, costs = 14.5)
  22 │  2003        2  south      30     15.0  (sales = 30, costs = 15.0)
  23 │  2003        3  south      31     15.5  (sales = 31, costs = 15.5)
  24 │  2003        4  south      32     16.0  (sales = 32, costs = 16.0)

julia> unstack(tmp3, [:year, :quarter], :region, :indicators)
12×4 DataFrame
 Row │ year   quarter  north                      south
     │ Int64  Int64    NamedTup…?                 NamedTup…?
─────┼───────────────────────────────────────────────────────────────────────
   1 │  2001        1  (sales = 1, costs = 0.5)   (sales = 21, costs = 10.5)
   2 │  2001        2  (sales = 2, costs = 1.0)   (sales = 22, costs = 11.0)
   3 │  2001        3  (sales = 3, costs = 1.5)   (sales = 23, costs = 11.5)
   4 │  2001        4  (sales = 4, costs = 2.0)   (sales = 24, costs = 12.0)
   5 │  2002        1  (sales = 5, costs = 2.5)   (sales = 25, costs = 12.5)
   6 │  2002        2  (sales = 6, costs = 3.0)   (sales = 26, costs = 13.0)
   7 │  2002        3  (sales = 7, costs = 3.5)   (sales = 27, costs = 13.5)
   8 │  2002        4  (sales = 8, costs = 4.0)   (sales = 28, costs = 14.0)
   9 │  2003        1  (sales = 9, costs = 4.5)   (sales = 29, costs = 14.5)
  10 │  2003        2  (sales = 10, costs = 5.0)  (sales = 30, costs = 15.0)
  11 │  2003        3  (sales = 11, costs = 5.5)  (sales = 31, costs = 15.5)
  12 │  2003        4  (sales = 12, costs = 6.0)  (sales = 32, costs = 16.0)

The second option is vertical stacking:

julia> vcat(unstack(long, [:year, :quarter], :region, :sales),
            unstack(long, [:year, :quarter], :region, :costs),
            source=:indicator=>["sales", "costs"])
24×5 DataFrame
 Row │ year   quarter  north     south     indicator
     │ Int64  Int64    Float64?  Float64?  String
─────┼───────────────────────────────────────────────
   1 │  2001        1       1.0      21.0  sales
   2 │  2001        2       2.0      22.0  sales
   3 │  2001        3       3.0      23.0  sales
   4 │  2001        4       4.0      24.0  sales
   5 │  2002        1       5.0      25.0  sales
   6 │  2002        2       6.0      26.0  sales
   7 │  2002        3       7.0      27.0  sales
   8 │  2002        4       8.0      28.0  sales
   9 │  2003        1       9.0      29.0  sales
  10 │  2003        2      10.0      30.0  sales
  11 │  2003        3      11.0      31.0  sales
  12 │  2003        4      12.0      32.0  sales
  13 │  2001        1       0.5      10.5  costs
  14 │  2001        2       1.0      11.0  costs
  15 │  2001        3       1.5      11.5  costs
  16 │  2001        4       2.0      12.0  costs
  17 │  2002        1       2.5      12.5  costs
  18 │  2002        2       3.0      13.0  costs
  19 │  2002        3       3.5      13.5  costs
  20 │  2002        4       4.0      14.0  costs
  21 │  2003        1       4.5      14.5  costs
  22 │  2003        2       5.0      15.0  costs
  23 │  2003        3       5.5      15.5  costs
  24 │  2003        4       6.0      16.0  costs

or

julia> unstack(stack(long, [:sales, :costs], [:year, :quarter, :region],
                     variable_name=:indicator),
               [:year, :quarter, :indicator], :region, :value)
24×5 DataFrame
 Row │ year   quarter  indicator  north     south
     │ Int64  Int64    String     Float64?  Float64?
─────┼───────────────────────────────────────────────
   1 │  2001        1  sales           1.0      21.0
   2 │  2001        2  sales           2.0      22.0
   3 │  2001        3  sales           3.0      23.0
   4 │  2001        4  sales           4.0      24.0
   5 │  2002        1  sales           5.0      25.0
   6 │  2002        2  sales           6.0      26.0
   7 │  2002        3  sales           7.0      27.0
   8 │  2002        4  sales           8.0      28.0
   9 │  2003        1  sales           9.0      29.0
  10 │  2003        2  sales          10.0      30.0
  11 │  2003        3  sales          11.0      31.0
  12 │  2003        4  sales          12.0      32.0
  13 │  2001        1  costs           0.5      10.5
  14 │  2001        2  costs           1.0      11.0
  15 │  2001        3  costs           1.5      11.5
  16 │  2001        4  costs           2.0      12.0
  17 │  2002        1  costs           2.5      12.5
  18 │  2002        2  costs           3.0      13.0
  19 │  2002        3  costs           3.5      13.5
  20 │  2002        4  costs           4.0      14.0
  21 │  2003        1  costs           4.5      14.5
  22 │  2003        2  costs           5.0      15.0
  23 │  2003        3  costs           5.5      15.5
  24 │  2003        4  costs           6.0      16.0

Finally we might want to perform horizontal merging which can be done e.g. like
this:

julia> outerjoin(unstack(long, [:year, :quarter], :region, :sales),
                 unstack(long, [:year, :quarter], :region, :costs),
                 on=[:year, :quarter], renamecols="_sales" => "_costs")
12×6 DataFrame
 Row │ year   quarter  north_sales  south_sales  north_costs  south_costs
     │ Int64  Int64    Int64?       Int64?       Float64?     Float64?
─────┼────────────────────────────────────────────────────────────────────
   1 │  2001        1            1           21          0.5         10.5
   2 │  2001        2            2           22          1.0         11.0
   3 │  2001        3            3           23          1.5         11.5
   4 │  2001        4            4           24          2.0         12.0
   5 │  2002        1            5           25          2.5         12.5
   6 │  2002        2            6           26          3.0         13.0
   7 │  2002        3            7           27          3.5         13.5
   8 │  2002        4            8           28          4.0         14.0
   9 │  2003        1            9           29          4.5         14.5
  10 │  2003        2           10           30          5.0         15.0
  11 │  2003        3           11           31          5.5         15.5
  12 │  2003        4           12           32          6.0         16.0

Concluding remmarks

Today I have focused mostly on the unstack function, and only mentioned
stack in a few places.

However, it is also worth to know that there are two other functions that are
very often handy and easy to forget about. One is good old permutedims
(transposing a data frame) and the other is flatten(flattening nested
columns). If you want to widen your DataFrames.jl related arsenal of tricks I
recommend you to check out their documentation.