Re-posted from: https://bkamins.github.io/julialang/2021/02/06/colsel.html
Introduction
In my last post I have discussed row selector rules for data frames
in DataFrames.jl.
This time I will cover available column selector options.
This post was written under Julia 1.5.3 and DataFrames 0.22.5.
First we set up the environment:
julia> using DataFrames
julia> df = DataFrame(a=1, b=2, x1=3, x2=4, y1=5, y2=6)
1×6 DataFrame
Row │ a b x1 x2 y1 y2
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────
1 │ 1 2 3 4 5 6
Column selection in indexing syntax
A starting point of column selection syntax is indexing. I will go through
all available options. In later sections of this post I discuss other functions
that support column indexing as they build upon the indexing syntax.
Selecting a single column
You can use either: an integer, a Symbol
or a string. Also begin
and end
is supported just like for arrays. Here are some examples:
julia> df[:, 1]
1-element Array{Int64,1}:
1
julia> df[:, :a]
1-element Array{Int64,1}:
1
julia> df[:, "a"]
1-element Array{Int64,1}:
1
julia> df[:, begin]
1-element Array{Int64,1}:
1
julia> df[:, end]
1-element Array{Int64,1}:
6
Note that if in indexing a single column is selected then it is extracted out
from a data frame as a vector. In all other column selector options described
below you always get a data frame as a result of the operation.
Selecting multiple columns using vectors
You can use the vector selectors: all integers, all Bool
, all Symbol
,
or all string (mixing styles is not allowed). Note that you can also use begin
and end
to define a vector (e.g. in ranges). Here are some examples:
julia> df[:, [1, 2]]
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df[:, [trues(2); falses(4)]]
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df[:, [:a, :b]]
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df[:, ["a", "b"]]
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df[:, begin:2]
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df[:, 2:end]
1×5 DataFrame
Row │ b x1 x2 y1 y2
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 2 3 4 5 6
Special selectors
There are the following special selectors available:
:
andAll()
allowing you to select all columns:
julia> df[:, :]
1×6 DataFrame
Row │ a b x1 x2 y1 y2
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────
1 │ 1 2 3 4 5 6
julia> df[:, All()]
1×6 DataFrame
Row │ a b x1 x2 y1 y2
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────
1 │ 1 2 3 4 5 6
- a regular expression picking columns whose name matches it:
julia> df[:, r"1"]
1×2 DataFrame
Row │ x1 y1
│ Int64 Int64
─────┼──────────────
1 │ 3 5
julia> df[:, r"x"]
1×2 DataFrame
Row │ x1 x2
│ Int64 Int64
─────┼──────────────
1 │ 3 4
Between
selector allowing you to specify a range of columns (you can specify
the start and stop column using any of the single column selector syntaxes):
julia> df[:, Between(3, :y1)]
1×3 DataFrame
Row │ x1 x2 y1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 4 5
julia> df[:, Between(begin, "x1")]
1×3 DataFrame
Row │ a b x1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
Not
selector allowing you to specify the columns you want to exclude from
the resulting data frames. You can put any valid other column selector
insideNot
:
julia> df[:, Not("a")]
1×5 DataFrame
Row │ b x1 x2 y1 y2
│ Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────
1 │ 2 3 4 5 6
julia> df[:, Not(r"x")]
1×4 DataFrame
Row │ a b y1 y2
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 5 6
julia> df[:, Not(Between(:x1, end))]
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
Cols
selector picking a union of other selectors passed as its arguments:
julia> df[:, Cols()]
0×0 DataFrame
julia> df[:, Cols(:x1, 1)]
1×2 DataFrame
Row │ x1 a
│ Int64 Int64
─────┼──────────────
1 │ 3 1
julia> df[:, Cols(r"1", r"x")]
1×3 DataFrame
Row │ x1 y1 x2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 5 4
Note that the order of columns in the produced data frame reflects the order
in which ther are selected by consecutive selectors passed inside Cols
.
Selecting just column names
Sometimes one only wants to pick names of columns meeting some condition
without actually creating a new data frame. This can be achieved using the
names
function.
If you just write names(df)
you get a list of all column names of a data frame
as a vector of strings:
julia> names(df)
6-element Array{String,1}:
"a"
"b"
"x1"
"x2"
"y1"
"y2"
You can also pass any column selector expression we described above as a second
argument to names
to get a subset of the column names, e.g.:
julia> names(df, r"x")
2-element Array{String,1}:
"x1"
"x2"
julia> names(df, Not(Between("b", "x2")))
3-element Array{String,1}:
"a"
"y1"
"y2"
Additionally you can pass a type as a second argument to the names
function,
and in this case you will get a vector of column names whose element type is
a subtype of passed argument. Here is an example:
julia> df2 = DataFrame([[1, 2], [1.0, 2.0], ["1", "2"], [1, missing]], :auto)
2×4 DataFrame
Row │ x1 x2 x3 x4
│ Int64 Float64 String Int64?
─────┼─────────────────────────────────
1 │ 1 1.0 1 1
2 │ 2 2.0 2 missing
julia> names(df2, Int)
1-element Array{String,1}:
"x1"
julia> names(df2, Union{Int, Missing})
2-element Array{String,1}:
"x1"
"x4"
julia> names(df2, Real)
2-element Array{String,1}:
"x1"
"x2"
There is also an upcoming feature that will be soon available in DataFrames.jl 1.0
release. You are going to be able to pass as a second argument to the names
function a predicate taking the column name as a string and returning true
for:
columns that should be kept. Here are some examples (note that this code snippet
will only work if you ceck out main
branch of DataFrames.jl from its GitHub
repository):
julia> df3 = DataFrame(x1=1, x2=2, y=3)
1×3 DataFrame
Row │ x1 x2 y
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> names(df3, startswith('x'))
2-element Array{String,1}:
"x1"
"x2"
julia> names(df3, x -> length(x) == 1)
1-element Array{String,1}:
"y"
The names
function is quite often used if you want to perform multiple
transformations of columns using select
, transform
or combine
, e.g.:
julia> select(df, r"x", names(df, r"x") .=> ByRow(sqrt))
1×4 DataFrame
Row │ x1 x2 x1_sqrt x2_sqrt
│ Int64 Int64 Float64 Float64
─────┼────────────────────────────────
1 │ 3 4 1.73205 2.0
Other functions that support passing column selectors
Here is a list of functions that accept the column selectors we discussed in the
indexing section above:
describe
takescols
keyword argument allowing you to choose for which
columns descriptive statistics should be computed;combine
,select
,select!
,transform
,transform!
, allowing you to
just keep the passed columns;groupby
that takes a second positional argument a column selector;flatten
taking a column selector as a second positional argument giving
informations which columns should be flattened;stack
allowing you to specify measure variables and id variables using
the column selectors;unstack
where you can pass columns that specify row keys for unstacking;issorted
,sortperm
,sort
andsort!
where the second positional argument
is a column selector (they also allow some additional syntax to specify how
sorting should be performed but this is a separate topic);unique
,unique!
, andnonunique
where second positional argument is a
column selector signaling which column should be checked for uniqueness;allowmissing
,allowmissing!
,disallowmissing
,disallowmissing!
,
completecases
,dropmissing
,dropmissing!
which all take as a second
argument a column selector indicating which columns should be taken into
account in transformation performed.
Conclusions
As you can see in this post DataFrames.jl provides a very flexible system of
allowed column selectors. Also, good understanding column selector semantics
will boost your efficiency when working with DataFrames.jl as we consistently
support the same patterns across whole package.