Re-posted from: https://bkamins.github.io/julialang/2020/09/04/separate.html
Introduction
Recently several users have asked how one can split one column into several
columns in DataFrames.jl.
A functionality of this kind is provided e.g. in dplyr by the spread
function, but we are currently missing it in DataFrames.jl.
In this post I comment how to obtain the expected behavior in the current version
of DataFrames.jl (i.e. v21) with a little help of the SplitApplyCombine.jl
pakage.
The codes were tested under Julia 1.5, DataFrames.jl 0.21,
and SplitApplyCombine.jl 1.1.
Spreading a column
For our example I use the data I have found in this post about most
popular first and last name combinations in the United States:
julia> using DataFrames
julia> df = DataFrame(name=["James Smith", "Michael Smith", "Robert Smith",
"Maria Garcia", "David Smith", "Maria Rodriguez",
"Mary Smith", "Maria Hernandez", "Maria Martinez",
"James Johnson"],
freq=[38_313, 34_810, 34_269, 32_092, 31_294,
30_507, 28_692, 27_836, 26_956, 26_850])
10×2 DataFrame
│ Row │ name │ freq │
│ │ String │ Int64 │
├─────┼─────────────────┼───────┤
│ 1 │ James Smith │ 38313 │
│ 2 │ Michael Smith │ 34810 │
│ 3 │ Robert Smith │ 34269 │
│ 4 │ Maria Garcia │ 32092 │
│ 5 │ David Smith │ 31294 │
│ 6 │ Maria Rodriguez │ 30507 │
│ 7 │ Mary Smith │ 28692 │
│ 8 │ Maria Hernandez │ 27836 │
│ 9 │ Maria Martinez │ 26956 │
│ 10 │ James Johnson │ 26850 │
We want to spread :name
column into :first
and :last
columns holding first
and last name respectively.
Here is how one can do it using just Julia Base:
julia> tmp = split.(df.name)
10-element Array{Array{SubString{String},1},1}:
["James", "Smith"]
["Michael", "Smith"]
["Robert", "Smith"]
["Maria", "Garcia"]
["David", "Smith"]
["Maria", "Rodriguez"]
["Mary", "Smith"]
["Maria", "Hernandez"]
["Maria", "Martinez"]
["James", "Johnson"]
julia> insertcols!(df, [n => getindex.(tmp, i) for (i, n) in
enumerate([:first, :last])]...)
10×4 DataFrame
│ Row │ name │ freq │ first │ last │
│ │ String │ Int64 │ SubStri… │ SubStrin… │
├─────┼─────────────────┼───────┼──────────┼───────────┤
│ 1 │ James Smith │ 38313 │ James │ Smith │
│ 2 │ Michael Smith │ 34810 │ Michael │ Smith │
│ 3 │ Robert Smith │ 34269 │ Robert │ Smith │
│ 4 │ Maria Garcia │ 32092 │ Maria │ Garcia │
│ 5 │ David Smith │ 31294 │ David │ Smith │
│ 6 │ Maria Rodriguez │ 30507 │ Maria │ Rodriguez │
│ 7 │ Mary Smith │ 28692 │ Mary │ Smith │
│ 8 │ Maria Hernandez │ 27836 │ Maria │ Hernandez │
│ 9 │ Maria Martinez │ 26956 │ Maria │ Martinez │
│ 10 │ James Johnson │ 26850 │ James │ Johnson │
The code is a bit verbose and uses a temporaty variable. We could have written the
second step also e.g. like this:
for (i, n) in enumerate([:first, :last])
df[!, n] = getindex.(tmp, i)
end
but it is still quite verbose.
We can have a shorter code and avoid a temporary variable using the invert
function from SplitApplyCombine.jl:
julia> using SplitApplyCombine
julia> insertcols!(df, ([:first, :last] .=> invert(split.(df.name)))...,
makeunique=true)
10×6 DataFrame
│ Row │ name │ freq │ first │ last │ first_1 │ last_1 │
│ │ String │ Int64 │ SubStri… │ SubStrin… │ SubStri… │ SubStrin… │
├─────┼─────────────────┼───────┼──────────┼───────────┼──────────┼───────────┤
│ 1 │ James Smith │ 38313 │ James │ Smith │ James │ Smith │
│ 2 │ Michael Smith │ 34810 │ Michael │ Smith │ Michael │ Smith │
│ 3 │ Robert Smith │ 34269 │ Robert │ Smith │ Robert │ Smith │
│ 4 │ Maria Garcia │ 32092 │ Maria │ Garcia │ Maria │ Garcia │
│ 5 │ David Smith │ 31294 │ David │ Smith │ David │ Smith │
│ 6 │ Maria Rodriguez │ 30507 │ Maria │ Rodriguez │ Maria │ Rodriguez │
│ 7 │ Mary Smith │ 28692 │ Mary │ Smith │ Mary │ Smith │
│ 8 │ Maria Hernandez │ 27836 │ Maria │ Hernandez │ Maria │ Hernandez │
│ 9 │ Maria Martinez │ 26956 │ Maria │ Martinez │ Maria │ Martinez │
│ 10 │ James Johnson │ 26850 │ James │ Johnson │ James │ Johnson │
In this call I have used makeunique=true
as we update the df
data frame in
place and it already contains :first
and :last
columns.
So the code is not that long, but admittedly spread
in dplyr is shorter.
Before we finish let us see what the invert
function produces when applied
to the tmp
variable we have created above:
julia> invert(tmp)
2-element Array{Array{SubString{String},1},1}:
["James", "Michael", "Robert", "Maria", "David", "Maria", "Mary", "Maria",
"Maria", "James"]
["Smith", "Smith", "Smith", "Garcia", "Smith", "Rodriguez", "Smith",
"Hernandez", "Martinez", "Johnson"]
As you can see it takes a container of containers and reverses the order
of nesting.
Conclusion
If you feel that it would be good to have an in-built function in DataFrames.jl
that would do splitting of columns in a data frame please leave a comment in
this issue.