Re-posted from: https://bkamins.github.io/julialang/2022/10/28/indexing.html
Introduction
In DataFrames.jl 1.4 release we reached the target state of data frame indexing
that we aimed for when we designed 1.0 release.
In this post I want to present you what mental model you should have when
thinking about indexing into a data frame. The reason why learning this is
important is that we needed to extend standard indexing rules that are defined
for arrays in Julia to cover all scenarios that users need when working with
data frames.
I will focus on working with a single column of a data frame as this is the most
common indexing scenario.
The code was tested under Julia 1.8.2 and DataFrames.jl 1.4.2.
Rule 1: indexing always requires passing both row and column selector
When you index into a data frame you must pass exactly two dimensions: a row
selector and a column selector like df[row_selector, column_selector]
.
When indexing you can think of a data frame as of a matrix.
Rule 2: all indexing that works on matrices works the same way for data frames
The benefit of the rule that data frame follows matrix indexing is that it is
easy to remember. If you know how matrix indexing works then all translates
directly to a data frame.
Here are some examples of extracting a column or its fragment from a data frame.
julia> using DataFrames
julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 4
2 │ 2 5
3 │ 3 6
julia> df[1, 1]
1
julia> df[1:2, 1]
2-element Vector{Int64}:
1
2
julia> df[:, 2]
3-element Vector{Int64}:
4
5
6
(As I commented in the Introduction in this post I concentrate on getting a
single column from a data frame so the second index was always an integer.)
One important rule of Julia Base indexing is that when you get a column or its
part from a matrix a copy is made, except if you extract a single element. This
is exactly what happens for data frame: df[:, 2]
is a copy of a second
column stored in df
. You can check it by writing e.g.:
julia> df[:, 2] === eachcol(df)[2]
false
julia> df[:, 2] == eachcol(df)[2]
true
We see that we get the same data, but not the same object.
If you instead want a view of a column without copying the data, you can use
@view
exactly like you would do it with matrices:
julia> @view df[1, 1]
0-dimensional view(::Vector{Int64}, 1) with eltype Int64:
1
julia> @view df[1:2, 1]
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
1
2
julia> @view df[:, 2]
3-element view(::Vector{Int64}, :) with eltype Int64:
4
5
6
As you can see, again all worked as if df
were a matrix.
The same rules that worked for getting data from a data frame work for setting
data in a data frame. You have two options here: normal assignment and
broadcasted assignment.
julia> df[1:2, 1] = [11, 12]
2-element Vector{Int64}:
11
12
julia> df[:, 2] .= 100
3-element view(::Vector{Int64}, :) with eltype Int64:
100
100
100
julia> df
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 11 100
2 │ 12 100
3 │ 3 100
These operations, again, work exactly like for matrices. In particular they
are in-place, that is no memory is allocated when performing them. The data
is written into already allocated column. This rule is important as it means
that by such assignment you cannot change the element type of a column:
julia> df[:, 1] = ["a", "b", "c"]
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
julia> df[:, 2] .= "x"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
In summary, standard array indexing works the same way for matrices and for data
frames.
In what follows I describe the extensions to the indexing rules that are
DataFrames.jl specific.
Rule 3: you can use strings or symbols to pass column names
The first extension is related to the fact that standard matrices have to be
indexed by integers. In DataFrames.jl you can alternatively use string or symbol
to select a column when indexing. Here are some basic examples:
julia> df[1:2, "a"]
2-element Vector{Int64}:
11
12
julia> df[:, :b] .= 1000
3-element view(::Vector{Int64}, :) with eltype Int64:
1000
1000
1000
This is intuitive so far. However, this rule leads to one extension. It is
related to the fact that you can pass a column name that does not exist yet
in a data frame. In this case if you pass :
as a column selector a new column
in a data frame will be allocated (that is copy of the source data will be
performed) and the new column will be added at the end of the data frame:
julia> df[:, "d"] = [-1, -2, -3]
3-element Vector{Int64}:
-1
-2
-3
julia> df
3×3 DataFrame
Row │ a b d
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 11 1000 -1
2 │ 12 1000 -2
3 │ 3 1000 -3
This rule has two additional special cases. The first is when a data frame
has no columns yet. In such a situation you can add any vector to a data frame:
julia> df2 = DataFrame()
0×0 DataFrame
julia> df2[:, :x] = [1, 2]
2-element Vector{Int64}:
1
2
julia> df2
2×1 DataFrame
Row │ x
│ Int64
─────┼───────
1 │ 1
2 │ 2
What is non-standard with this rule? We allow changing the number of rows
in a data frame from zero to whatever is needed. For a data frame having some
columns changing their number of rows is not allowed with indexing.
The second special case is when you try to create a new column in a view of
a data frame. It is not allowed in general, but if the view was created with
:
as a column selector we accept it (the reason is that in this case view
subsets only rows, but does not change columns; this is a common use case
in practice). Here is an example:
julia> df
3×3 DataFrame
Row │ a b d
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 11 1000 -1
2 │ 12 1000 -2
3 │ 3 1000 -3
julia> dfv = @view df[[3, 1], :]
2×3 SubDataFrame
Row │ a b d
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 3 1000 -3
2 │ 11 1000 -1
julia> dfv[:, :e] = [1, 2]
2-element Vector{Int64}:
1
2
julia> dfv
2×4 SubDataFrame
Row │ a b d e
│ Int64 Int64 Int64 Int64?
─────┼─────────────────────────────
1 │ 3 1000 -3 1
2 │ 11 1000 -1 2
julia> df
3×4 DataFrame
Row │ a b d e
│ Int64 Int64 Int64 Int64?
─────┼──────────────────────────────
1 │ 11 1000 -1 2
2 │ 12 1000 -2 missing
3 │ 3 1000 -3 1
As you can see in this case a new column gets missing
for rows that are not
present in the dfv
data frame.
Rule 4: You can use Not to negate row selection
This is a pretty simple rule. Instead of selecting rows, you can use Not
to say which rows you want to drop when indexing:
julia> df
3×4 DataFrame
Row │ a b d e
│ Int64 Int64 Int64 Int64?
─────┼──────────────────────────────
1 │ 11 1000 -1 2
2 │ 12 1000 -2 missing
3 │ 3 1000 -3 1
julia> df[Not(2), :d]
2-element Vector{Int64}:
-1
-3
Rule 5: There is a special ! row selector
The special !
row selector selects all rows similarly to :
, but it has
a different behavior in relation to column allocation.
When extracting the data from a data frame if you use !
no copying of data is
made. Instead, you just get a column as it is stored in the source data frame
(or an appropriate view if you work with a view of a data frame):
julia> df[!, 1]
3-element Vector{Int64}:
11
12
3
julia> df[!, 1] === eachcol(df)[1]
true
julia> dfv[!, 1]
2-element view(::Vector{Int64}, [3, 1]) with eltype Int64:
3
11
The reason why !
is allowed for when extracting a column from a data frame
is performance. In general it is not safe to use !
. You should prefer :
as copying data (in R like style) is safer. However, in some cases, when your
operations are memory-bound or you need to maximize performance, you have !
at hand to avoid unnecessary operations.
For writing data into a data frame if you use !
selector it has also a
different behavior than :
. Recall, that :
is always in-place. Instead
!
stores a fresh column in a data frame.
julia> df
3×4 DataFrame
Row │ a b d e
│ Int64 Int64 Int64 Int64?
─────┼──────────────────────────────
1 │ 11 1000 -1 2
2 │ 12 1000 -2 missing
3 │ 3 1000 -3 1
julia> df[!, "a"] = ["a", "b", "c"]
3-element Vector{String}:
"a"
"b"
"c"
julia> dfv[!, "e"] .= "x"
2-element Vector{String}:
"x"
"x"
julia> df
3×4 DataFrame
Row │ a b d e
│ String Int64 Int64 Any
─────┼───────────────────────────────
1 │ a 1000 -1 x
2 │ b 1000 -2 missing
3 │ c 1000 -3 x
How the fresh column is stored depends on the type of data frame you use and the
type of operation:
- if you use standard assignment on a data frame (
df[!, "a"] = ["a", "b", "c"]
)
then source vector is not copied, but is stored as-is (so this kind of storage
is the fastest way to store a column in a data frame). - if you use broadcasted assignment (
df[!, "a"] .= ["a", "b", "c"]
) or assign
into a view of a data frame (dfv[!, "a"] = ["x", "x"]
) then a new column is
freshly allocated (this kind of behavior is needed if you want to change
element type of the column already present in a data frame, but want a copy
of the source vector to be made for safety reasons).
So for assignment with !
the basic rule to remember is that it replaces the
existing column in a data frame. Then the additional rule is that normal
assignment on a data frame is non-copying, and broadcasted assignment or
using a view of a data frame allocates a copy.
These rules were designed to cover all possible scenarios that users might need
when working with columns of a data frame.
Rule 6: property access works the same as if you used ! as row selector
With this rule we are getting to the point that changed in DataFrames.jl 1.4
(but was planned and announced much earlier).
When you write df.a
it is always treated the same as df[!, "a"]
, for
extracting a column from a data frame, for assignment, and for broadcasted
assignment.
Since DataFrames.jl 1.4 this is all you need to remember about property access
to a data frame. We decided on this rule as it is easy to remember and it does
not add any new concepts or exceptions for users to learn.
However, unfortunately, this is the place where we had to deviate from how
property access works in Base Julia. Let me explain the issue step by step.
If you write df.a = ["a", "b", "c"]
then you expect that column a
in df
data frame is replaced by ["a", "b", "c"]
. And this is what happens. Recall
that this is what df[!, "a"] = ["a", "b", "c"]
also does.
Now this means that, if we want df.a .= ["a", "b", "c"]
to work the same as
df[!, "a"] .= ["a", "b", "c"]
, then both operations replace column a
in df
.
And here we have a slight inconsistency, as in Base Julia users would expect
that df.a .= ["a", "b", "c"]
would work in-place, like
df[:, "a"] .= ["a", "b", "c"]
. This is not the case.
Therefore you need to remember that in DataFrames.jl if you use property access
syntax for setting data it always replaces the colum: both when doing assignment
and broadcasted assignment.
The reason why we decided on this design is twofold. First is learning. We have
a simple rule: !
selector and property access work the same way always.
The second reason is convenience. Most users preferred the following operation:
julia> df3 = DataFrame(c=1:3)
3×1 DataFrame
Row │ c
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> df3.c .= 'x'
3-element Vector{Char}:
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
julia> df3
3×1 DataFrame
Row │ c
│ Char
─────┼──────
1 │ x
2 │ x
3 │ x
to replace column c
with a vector of 'x'
values. Note that instead if you
use :
as row selector you get a bit surprising result:
julia> df3 = DataFrame(c=1:3)
3×1 DataFrame
Row │ c
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
julia> df3[:, "c"] .= 'x'
3-element view(::Vector{Int64}, :) with eltype Int64:
120
120
120
julia> df3
3×1 DataFrame
Row │ c
│ Int64
─────┼───────
1 │ 120
2 │ 120
3 │ 120
The reason of this behavior is that Char
is silently converted to Int
:
julia> convert(Int, 'x')
120
What is the general benefit of the behavior we adopted? When you write
df[!, "a"] .= 'x'
or df.a .= 'x'
you are sure that as a result you will
have a freshly allocated column "a"
in df
containing 'x'
values as all
its elements, independent of the fact if column "a"
was already present
in the data frame or not. So this is like git push --force
operation.
It is guaranteed to succeed no matter what the starting state of the data
frame was (of course assuming that the object on the right hand side supports
broadcasting and has proper dimensions).
Why has it taken us so long to reach this state?
The reason why we landed with these rules only in DataFrames.jl 1.4 is that
before Julia 1.7 it was not possible to support the behavior we wanted.
Therefore we had to wait till Julia 1.7 to be released to achieve consistency
between property access and !
row selector behavior in DataFrames.jl in all
cases.
So people porting code from DataFrames.jl earlier than 1.4 or from Julia earlier
than 1.7 will notice a change that previously df.col .= value
was in-place,
and currently it allocates a new column. We have considered the risks that this
change will be problematic for users, and the conclusion was that:
- In a vast majority of code this change will not affect the result and users
will not even notice this change. - It might affect the code that is performance critical (as an extra copy is now
made). However, if this is the case it is easy to fix the issue (and it will
not cause the code to be broken). - It might affect the code that relied on the fact that previously a conversion
during in-place assignment was made (and e.g. relied on the fact that such
assignment does not change element type of the column). In this case such code
would indeed be broken. In this situation the conclusion was that such cases,
although possible in practice, are most likely quite rare and if present would
affect only experienced Julia users who have a good grasp of conversion upon
assignment rules in Base Julia. We concluded that such users would be able
to identify the problematic cases and changedf.col .= value
to
df[:, :col] .= value
in their code to get back the old behavior.
Conclusions
I hope you found this post useful in building your understanding how
the details of indexing in DataFrames.jl work and what are their intended
use cases.
Additionally, wanted to present you the mental process we went through when we
were making hard design decisions in DataFrames.jl development team. In this
process we had to balance three things:
- user convenience (especially taking into account the fact that target audience
of DataFrames.jl are data scientists, who sometimes are not computer science
experts); - internal (within DataFrames.jl) consistency of rules that one needs to learn;
- minimization of cases when we diverge from what Base Julia defines for similar
objects (the challenge was that data frame is neither a matrix nor a struct,
and has different design requirements, but we borrow the syntax from both).
In this post I have not covered all the details of DataFrames.jl indexing rules.
If you want to learn about the indexing behavior every supported scenario please
check the Indexing section of the DataFrames.jl manual.