Re-posted from: https://bkamins.github.io/julialang/2023/09/08/dropcols.html
Introduction
One of the common tasks when working with a data frame is dropping some of its columns.
There are two ways to do it. You can either specify
which columns you want to keep or which columns you want to drop.
One of the frequent questions I get is how to do these operations with
[DataFrames.jl][tables] in case the list of columns to keep or drop might not
be a subset of columns of the data frame. This is the topic I want to cover in my today’s post.
The post was tested using Julia 1.9.2 and DataFrames.jl 1.6.1.
Standard column selection
First, create an example data frame:
julia> using DataFrames
julia> df = DataFrame(a=1, b=2, c=3, d=4)
1×4 DataFrame
Row │ a b c d
│ Int64 Int64 Int64 Int64
─────┼────────────────────────────
1 │ 1 2 3 4
Now assume I want to keep columns :a
and :c
from it. You can do it by writing, for example:
julia> select(df, :a, :c)
1×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 3
You could also pass the columns as a variable using e.g., a vector:
julia> keep1 = [:a, :c]
2-element Vector{Symbol}:
:a
:c
julia> select(df, keep1)
1×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 3
Now, let us discuss dropping columns. Assume we want to keep all columns except columns :b
and :d
.
We can achieve this by using the Not
command:
julia> select(df, Not(:b, :d))
1×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 3
Also in this case we can use a helper variable:
julia> drop1 = [:b, :d]
2-element Vector{Symbol}:
:b
:d
julia> select(df, Not(drop1))
1×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 3
The problematic case: the selected column is not present in a data frame
In some scenarios, we might want to provide a list of columns of which not
all are present in the data frame. For example, assume we want to keep
columns :a
and :x
. We see that the :x
column is not present in our
df
data frame.
Before we move forward, let me comment when such a situation occurs most often.
Assume you have 100 data frames that describe your data. Each data frame is similar,
but not identical. For example, a single data frame might represent data from one country
and the list of information for the countries does not have to be identical (for some
countries we might have more information, which results in more columns in a data frame).
When processing such data we might want to write one general condition on which columns
we want to keep or drop, and some of these columns might be present in only a subset of
all data frames.
Now let us go back to our example. Let us try keeping columns :a
and :x
:
julia> select(df, :a, :x)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"
julia> keep2 = [:a, :x]
2-element Vector{Symbol}:
:a
:x
julia> select(df, keep2)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"
We get an error. DataFrames.jl is designed to check, by default, that the operation you want to perform
on your data frame is valid. This is a conscious design decision. The reason is that in production application
settings most often when you say that you want to keep columns :a
and :x
you assume that they are present in df
.
Thus you want to get an error if they would not be all present in it.
The same behavior can be observed for dropping columns. Assume we want to drop columns :b
and :x
:
julia> select(df, Not(:b, :x))
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"
julia> drop2 = [:b, :x]
2-element Vector{Symbol}:
:b
:x
julia> select(df, drop2)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"
So what we saw here is a default behavior that was designed to be safe.
In what follows let me discuss how to perform a flexible selection.
Performing column selection when some of them are not present in a data frame
There are several solutions for column selection when some of them are not present in a data frame.
Let me present the one that I find the most convenient.
For this operation I typically use the Cols
selector. The reason is that you can pass
a condition function (a predicate) as an argument to Cols
that will select columns
whose names meet a passed condition.
Therefore the following operation:
julia> select(df, keep1)
1×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 3
Is the same as:
julia> keep1s = string.(keep1)
2-element Vector{String}:
"a"
"c"
julia> select(df, Cols(in(keep1s)))
1×2 DataFrame
Row │ a c
│ Int64 Int64
─────┼──────────────
1 │ 1 3
Note what we did here. The in(keep1s)
expression produces a function that checks if a value passed to it is in the keep1s
vector.
It is important to note that although column selection in DataFrames.jl accepts both Symbol
(like :a
) and strings (like "a"
)
as column names the Cols
-based selector will perform the check against strings. Therefore I had to convert the keep1
vector
of symbols to a keep1s
vector of strings.
So far the select(df, Cols(in(keep1s)))
is more verbose than just writing select(df, keep1)
. However, the benefit of Cols
is that when the in(keep1s)
check is done we can have in keep1s
vector whatever values we like, in particular,
they do not have be valid column names of our df
.
Therefore to keep columns :a
and :x
, if we are unsure if these columns are present in df
we can write:
julia> select(df, Cols(in(["a", "x"])))
1×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
Note that this time the operation works without an error. And again, please keep in mind that in this selector we need to pass
column names as strings.
Now you can probably already tell how to pass a list of columns to drop, without requiring that they are present in df
.
The only thing to do is to use the !in
function (not-in) instead of in
. Let us drop columns :b
and :x
from our
data frame (keeping in mind that :x
is not present in it):
julia> select(df, Cols(!in(["b", "x"])))
1×3 DataFrame
Row │ a c d
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 3 4
All worked as expected – the :b
column was dropped and the :x
column was ignored in the column dropping operation.
Conclusions
I hope you find the examples I gave today useful.
In general, the whole design of DataFrames.jl is similar to
what was discussed in this post. The default behavior is picked to be safe (as in our example: by default,
select
checks if the columns you pass are present in a data frame), but it is possible to switch to an unsafe
mode relatively easily (in our example: using Cols
with a predicate function).