Re-posted from: https://bkamins.github.io/julialang/2021/05/28/pivot.html

In DataFrames.jl currently you have two functions that you can use
for reshaping your data: stack and unstack. Their design goals are very
simple:

stack allows you to go from wide to long data format;
unstck works the other way around and takes data in long format producing
a wide table.

In this post I want to focus on the unstack function. Starting from explaining
its basic usage and then covering three common more complex scenarios.

This post was written under Julia 1.6.1 and DataFrames.jl 1.1.1.

Preparing the data

Consider the following data
(and practice a bit basic data transformation skills):

julia> using DataFrames

julia> sales = DataFrame(year=repeat(2001:2003, inner=4),
                         quarter=repeat(1:4, outer=3),
                         north=1:12, south=21:32)
12×4 DataFrame
 Row │ year   quarter  north  south
     │ Int64  Int64    Int64  Int64
─────┼──────────────────────────────
│  2001        1      1     21
│  2001        2      2     22
│  2001        3      3     23
│  2001        4      4     24
│  2002        1      5     25
│  2002        2      6     26
│  2002        3      7     27
│  2002        4      8     28
│  2003        1      9     29
│  2003        2     10     30
│  2003        3     11     31
│  2003        4     12     32

julia> costs = select(sales, :year, :quarter, [:north, :south] .=> x -> x/2,
                      renamecols=false)
12×4 DataFrame
 Row │ year   quarter  north    south
     │ Int64  Int64    Float64  Float64
─────┼──────────────────────────────────
│  2001        1      0.5     10.5
│  2001        2      1.0     11.0
│  2001        3      1.5     11.5
│  2001        4      2.0     12.0
│  2002        1      2.5     12.5
│  2002        2      3.0     13.0
│  2002        3      3.5     13.5
│  2002        4      4.0     14.0
│  2003        1      4.5     14.5
│  2003        2      5.0     15.0
│  2003        3      5.5     15.5
│  2003        4      6.0     16.0

julia> long_sales = stack(sales, [:north, :south], [:year, :quarter],
                          variable_name=:region, value_name=:sales)
24×4 DataFrame
 Row │ year   quarter  region  sales
     │ Int64  Int64    String  Int64
─────┼───────────────────────────────
│  2001        1  north       1
│  2001        2  north       2
│  2001        3  north       3
│  2001        4  north       4
│  2002        1  north       5
│  2002        2  north       6
│  2002        3  north       7
│  2002        4  north       8
│  2003        1  north       9
│  2003        2  north      10
│  2003        3  north      11
│  2003        4  north      12
│  2001        1  south      21
│  2001        2  south      22
│  2001        3  south      23
│  2001        4  south      24
│  2002        1  south      25
│  2002        2  south      26
│  2002        3  south      27
│  2002        4  south      28
│  2003        1  south      29
│  2003        2  south      30
│  2003        3  south      31
│  2003        4  south      32

julia> long_costs = stack(costs, [:north, :south], [:year, :quarter],
                          variable_name=:region, value_name=:costs)
24×4 DataFrame
 Row │ year   quarter  region  costs
     │ Int64  Int64    String  Float64
─────┼─────────────────────────────────
│  2001        1  north       0.5
│  2001        2  north       1.0
│  2001        3  north       1.5
│  2001        4  north       2.0
│  2002        1  north       2.5
│  2002        2  north       3.0
│  2002        3  north       3.5
│  2002        4  north       4.0
│  2003        1  north       4.5
│  2003        2  north       5.0
│  2003        3  north       5.5
│  2003        4  north       6.0
│  2001        1  south      10.5
│  2001        2  south      11.0
│  2001        3  south      11.5
│  2001        4  south      12.0
│  2002        1  south      12.5
│  2002        2  south      13.0
│  2002        3  south      13.5
│  2002        4  south      14.0
│  2003        1  south      14.5
│  2003        2  south      15.0
│  2003        3  south      15.5
│  2003        4  south      16.0

julia> long = innerjoin(long_sales, long_costs, on=[:year, :quarter, :region])
24×5 DataFrame
 Row │ year   quarter  region  sales  costs
     │ Int64  Int64    String  Int64  Float64
─────┼────────────────────────────────────────
│  2001        1  north       1      0.5
│  2001        2  north       2      1.0
│  2001        3  north       3      1.5
│  2001        4  north       4      2.0
│  2002        1  north       5      2.5
│  2002        2  north       6      3.0
│  2002        3  north       7      3.5
│  2002        4  north       8      4.0
│  2003        1  north       9      4.5
│  2003        2  north      10      5.0
│  2003        3  north      11      5.5
│  2003        4  north      12      6.0
│  2001        1  south      21     10.5
│  2001        2  south      22     11.0
│  2001        3  south      23     11.5
│  2001        4  south      24     12.0
│  2002        1  south      25     12.5
│  2002        2  south      26     13.0
│  2002        3  south      27     13.5
│  2002        4  south      28     14.0
│  2003        1  south      29     14.5
│  2003        2  south      30     15.0
│  2003        3  south      31     15.5
│  2003        4  south      32     16.0

The basics of `unstack`

Assume we want to get the sales table back. We need to unstack our long
table putting :year and :quarter in rows and :region in columns, while
taking :sales as values:

julia> unstack(long, [:year, :quarter], :region, :sales)
12×4 DataFrame
 Row │ year   quarter  north   south
     │ Int64  Int64    Int64?  Int64?
─────┼────────────────────────────────
│  2001        1       1      21
│  2001        2       2      22
│  2001        3       3      23
│  2001        4       4      24
│  2002        1       5      25
│  2002        2       6      26
│  2002        3       7      27
│  2002        4       8      28
│  2003        1       9      29
│  2003        2      10      30
│  2003        3      11      31
│  2003        4      12      32

We also check that we have recovered what we wanted:

julia> unstack(long, [:year, :quarter], :region, :sales) == sales
true

However, now try to put only :year in rows. If we want to drop :quarter then
we get:

julia> unstack(long, :year, :region, :sales)
ERROR: ArgumentError: Duplicate entries in unstack at row 2 for key (2001,) and variable north. Pass allowduplicates=true to allow them.

julia> unstack(long, :year, :region, :sales, allowduplicates=true)
3×3 DataFrame
 Row │ year   north   south
     │ Int64  Int64?  Int64?
─────┼───────────────────────
   1 │  2001       4      24
   2 │  2002       8      28
   3 │  2003      12      32

Clearly even if we pass allowduplicates=true we do not get what we most likely
wanted. This leads us to the first case.

Pivot tables with `unstack`

Most likely we want to aggregate sales per year using the sum function. This
is a classic pivot table task. In DataFrames.jl currently one does it in two
steps: first aggregate, then reshape. Here is how you can do it (I am showing
two separate steps, but you could use e.g. Chain.jl to streamline
the processing):

julia> tmp1 = combine(groupby(long, [:year, :region]), :sales => sum => :sales)
6×3 DataFrame
 Row │ year   region  sales
     │ Int64  String  Int64
─────┼──────────────────────
   1 │  2001  north      10
   2 │  2001  south      90
   3 │  2002  north      26
   4 │  2002  south     106
   5 │  2003  north      42
   6 │  2003  south     122

julia> unstack(tmp1, :year, :region, :sales)
3×3 DataFrame
 Row │ year   north   south
     │ Int64  Int64?  Int64?
─────┼───────────────────────
   1 │  2001      10      90
   2 │  2002      26     106
   3 │  2003      42     122

Multiple variables put in columns

What if we wanted to put only :year in rows, but both :quarter and :region
in columns?

In this case we need to create a temporary column which we combine the
:quarter and :region. Here is a simple example:

julia> tmp2 = transform(long, [:region, :quarter] => ByRow(string) => :rq)
24×6 DataFrame
 Row │ year   quarter  region  sales  costs    rq
     │ Int64  Int64    String  Int64  Float64  String
─────┼────────────────────────────────────────────────
│  2001        1  north       1      0.5  north1
│  2001        2  north       2      1.0  north2
│  2001        3  north       3      1.5  north3
│  2001        4  north       4      2.0  north4
│  2002        1  north       5      2.5  north1
│  2002        2  north       6      3.0  north2
│  2002        3  north       7      3.5  north3
│  2002        4  north       8      4.0  north4
│  2003        1  north       9      4.5  north1
│  2003        2  north      10      5.0  north2
│  2003        3  north      11      5.5  north3
│  2003        4  north      12      6.0  north4
│  2001        1  south      21     10.5  south1
│  2001        2  south      22     11.0  south2
│  2001        3  south      23     11.5  south3
│  2001        4  south      24     12.0  south4
│  2002        1  south      25     12.5  south1
│  2002        2  south      26     13.0  south2
│  2002        3  south      27     13.5  south3
│  2002        4  south      28     14.0  south4
│  2003        1  south      29     14.5  south1
│  2003        2  south      30     15.0  south2
│  2003        3  south      31     15.5  south3
│  2003        4  south      32     16.0  south4

julia> unstack(tmp2, :year, :rq, :sales)
3×9 DataFrame
 Row │ year   north1  north2  north3  north4  south1  south2  south3  south4
     │ Int64  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?  Int64?
─────┼───────────────────────────────────────────────────────────────────────
│  2001       1       2       3       4      21      22      23      24
│  2002       5       6       7       8      25      26      27      28
│  2003       9      10      11      12      29      30      31      32

Note that this additional step is only required for columns as for rows
unstack accepts multiple columns as shown above.

Multiple value variables

Now we get to my favorite Chekhov’s gun element of our story. Why do we
have a :costs column in our long table? The reason is that now we will
discuss how one can unstack a data frame on multiple value columns.

Here you have three options how you want to store the values:

use a nested field;
stack them vertically;
merge them horizontally.

Let me now discuss the three options. Nesting the field can be done e.g.
in the following way:

julia> tmp3 = transform(long, AsTable([:sales, :costs]) =>
                              ByRow(identity) =>
                              :indicators)
24×6 DataFrame
 Row │ year   quarter  region  sales  costs    indicators
     │ Int64  Int64    String  Int64  Float64  NamedTupl…
─────┼────────────────────────────────────────────────────────────────────
│  2001        1  north       1      0.5  (sales = 1, costs = 0.5)
│  2001        2  north       2      1.0  (sales = 2, costs = 1.0)
│  2001        3  north       3      1.5  (sales = 3, costs = 1.5)
│  2001        4  north       4      2.0  (sales = 4, costs = 2.0)
│  2002        1  north       5      2.5  (sales = 5, costs = 2.5)
│  2002        2  north       6      3.0  (sales = 6, costs = 3.0)
│  2002        3  north       7      3.5  (sales = 7, costs = 3.5)
│  2002        4  north       8      4.0  (sales = 8, costs = 4.0)
│  2003        1  north       9      4.5  (sales = 9, costs = 4.5)
│  2003        2  north      10      5.0  (sales = 10, costs = 5.0)
│  2003        3  north      11      5.5  (sales = 11, costs = 5.5)
│  2003        4  north      12      6.0  (sales = 12, costs = 6.0)
│  2001        1  south      21     10.5  (sales = 21, costs = 10.5)
│  2001        2  south      22     11.0  (sales = 22, costs = 11.0)
│  2001        3  south      23     11.5  (sales = 23, costs = 11.5)
│  2001        4  south      24     12.0  (sales = 24, costs = 12.0)
│  2002        1  south      25     12.5  (sales = 25, costs = 12.5)
│  2002        2  south      26     13.0  (sales = 26, costs = 13.0)
│  2002        3  south      27     13.5  (sales = 27, costs = 13.5)
│  2002        4  south      28     14.0  (sales = 28, costs = 14.0)
│  2003        1  south      29     14.5  (sales = 29, costs = 14.5)
│  2003        2  south      30     15.0  (sales = 30, costs = 15.0)
│  2003        3  south      31     15.5  (sales = 31, costs = 15.5)
│  2003        4  south      32     16.0  (sales = 32, costs = 16.0)

julia> unstack(tmp3, [:year, :quarter], :region, :indicators)
12×4 DataFrame
 Row │ year   quarter  north                      south
     │ Int64  Int64    NamedTup…?                 NamedTup…?
─────┼───────────────────────────────────────────────────────────────────────
│  2001        1  (sales = 1, costs = 0.5)   (sales = 21, costs = 10.5)
│  2001        2  (sales = 2, costs = 1.0)   (sales = 22, costs = 11.0)
│  2001        3  (sales = 3, costs = 1.5)   (sales = 23, costs = 11.5)
│  2001        4  (sales = 4, costs = 2.0)   (sales = 24, costs = 12.0)
│  2002        1  (sales = 5, costs = 2.5)   (sales = 25, costs = 12.5)
│  2002        2  (sales = 6, costs = 3.0)   (sales = 26, costs = 13.0)
│  2002        3  (sales = 7, costs = 3.5)   (sales = 27, costs = 13.5)
│  2002        4  (sales = 8, costs = 4.0)   (sales = 28, costs = 14.0)
│  2003        1  (sales = 9, costs = 4.5)   (sales = 29, costs = 14.5)
│  2003        2  (sales = 10, costs = 5.0)  (sales = 30, costs = 15.0)
│  2003        3  (sales = 11, costs = 5.5)  (sales = 31, costs = 15.5)
│  2003        4  (sales = 12, costs = 6.0)  (sales = 32, costs = 16.0)

The second option is vertical stacking:

julia> vcat(unstack(long, [:year, :quarter], :region, :sales),
            unstack(long, [:year, :quarter], :region, :costs),
            source=:indicator=>["sales", "costs"])
24×5 DataFrame
 Row │ year   quarter  north     south     indicator
     │ Int64  Int64    Float64?  Float64?  String
─────┼───────────────────────────────────────────────
│  2001        1       1.0      21.0  sales
│  2001        2       2.0      22.0  sales
│  2001        3       3.0      23.0  sales
│  2001        4       4.0      24.0  sales
│  2002        1       5.0      25.0  sales
│  2002        2       6.0      26.0  sales
│  2002        3       7.0      27.0  sales
│  2002        4       8.0      28.0  sales
│  2003        1       9.0      29.0  sales
│  2003        2      10.0      30.0  sales
│  2003        3      11.0      31.0  sales
│  2003        4      12.0      32.0  sales
│  2001        1       0.5      10.5  costs
│  2001        2       1.0      11.0  costs
│  2001        3       1.5      11.5  costs
│  2001        4       2.0      12.0  costs
│  2002        1       2.5      12.5  costs
│  2002        2       3.0      13.0  costs
│  2002        3       3.5      13.5  costs
│  2002        4       4.0      14.0  costs
│  2003        1       4.5      14.5  costs
│  2003        2       5.0      15.0  costs
│  2003        3       5.5      15.5  costs
│  2003        4       6.0      16.0  costs

julia> unstack(stack(long, [:sales, :costs], [:year, :quarter, :region],
                     variable_name=:indicator),
               [:year, :quarter, :indicator], :region, :value)
24×5 DataFrame
 Row │ year   quarter  indicator  north     south
     │ Int64  Int64    String     Float64?  Float64?
─────┼───────────────────────────────────────────────
│  2001        1  sales           1.0      21.0
│  2001        2  sales           2.0      22.0
│  2001        3  sales           3.0      23.0
│  2001        4  sales           4.0      24.0
│  2002        1  sales           5.0      25.0
│  2002        2  sales           6.0      26.0
│  2002        3  sales           7.0      27.0
│  2002        4  sales           8.0      28.0
│  2003        1  sales           9.0      29.0
│  2003        2  sales          10.0      30.0
│  2003        3  sales          11.0      31.0
│  2003        4  sales          12.0      32.0
│  2001        1  costs           0.5      10.5
│  2001        2  costs           1.0      11.0
│  2001        3  costs           1.5      11.5
│  2001        4  costs           2.0      12.0
│  2002        1  costs           2.5      12.5
│  2002        2  costs           3.0      13.0
│  2002        3  costs           3.5      13.5
│  2002        4  costs           4.0      14.0
│  2003        1  costs           4.5      14.5
│  2003        2  costs           5.0      15.0
│  2003        3  costs           5.5      15.5
│  2003        4  costs           6.0      16.0

Finally we might want to perform horizontal merging which can be done e.g. like
this:

julia> outerjoin(unstack(long, [:year, :quarter], :region, :sales),
                 unstack(long, [:year, :quarter], :region, :costs),
                 on=[:year, :quarter], renamecols="_sales" => "_costs")
12×6 DataFrame
 Row │ year   quarter  north_sales  south_sales  north_costs  south_costs
     │ Int64  Int64    Int64?       Int64?       Float64?     Float64?
─────┼────────────────────────────────────────────────────────────────────
   1 │  2001        1            1           21          0.5         10.5
   2 │  2001        2            2           22          1.0         11.0
   3 │  2001        3            3           23          1.5         11.5
   4 │  2001        4            4           24          2.0         12.0
   5 │  2002        1            5           25          2.5         12.5
   6 │  2002        2            6           26          3.0         13.0
   7 │  2002        3            7           27          3.5         13.5
   8 │  2002        4            8           28          4.0         14.0
   9 │  2003        1            9           29          4.5         14.5
  10 │  2003        2           10           30          5.0         15.0
  11 │  2003        3           11           31          5.5         15.5
  12 │  2003        4           12           32          6.0         16.0

Concluding remmarks

Today I have focused mostly on the unstack function, and only mentioned
stack in a few places.

However, it is also worth to know that there are two other functions that are
very often handy and easy to forget about. One is good old permutedims
(transposing a data frame) and the other is flatten(flattening nested
columns). If you want to widen your DataFrames.jl related arsenal of tricks I
recommend you to check out their documentation.

juliabloggers.com

A Julia Language Blog Aggregator

Advanced reshaping in DataFrames.jl

Preparing the data

The basics of `unstack`

Pivot tables with `unstack`

Multiple variables put in columns

Multiple value variables

Concluding remmarks

Related

Preparing the data

The basics of unstack

Pivot tables with unstack

Multiple variables put in columns

Multiple value variables

Concluding remmarks

Related

The basics of `unstack`

Pivot tables with `unstack`