Re-posted from: https://bkamins.github.io/julialang/2021/01/30/bang.html
Introduction
I recently see that DataFrames.jl use !
as a row selector for a data
frame a lot.
Over a year ago, when we have taken data frames indexing seriously, there was a
very big debate if !
should be allowed in expressions like df[!, :a]
to get
an :a
column without copying. The conclusion was that we need to have it, but
our intention was that it would be reserved for advanced uses only, while
in normal circumstances a user would not need to even know that it exists.
In this post let me review the use-cases of !
and comment on its alternatives.
This post was written under Julia 1.5.3 and DataFrames 0.22.4.
First we set up the environment:
julia> using DataFrames
julia> df = DataFrame(col1=1:3, col2='a':'c')
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
Reading a single column from a data frame
If you want to get a single column :col1
from a data frame df
you have the
following options:
df[!, :col1]
,df[!, "col1"]
,df.col1
, anddf."col1"
: get you the column
without copying;df[:, :col1]
anddf[:, "col1"]
: gets you a copy of the column.
As you see to get a single column without copying it is usually much easier to
rwiere df.col1
than e.g. df[!, :col1]
and the operation has exactly the same
result.
The only case when df[!, :col1]
is more convenient is when you have a column
name stored in a variable. Then the following are equivalent:
julia> v = :col1
:col1
julia> df[!, v]
3-element Array{Int64,1}:
1
2
3
julia> getproperty(df, v)
3-element Array{Int64,1}:
1
2
3
and indeed using !
is a big more convenient in this case, as you cannot pass
variable v
to an expression like df.col1
.
Reading multiple columns from a data frame
If you want to get a two columns [:col1, :col2]
from a data frame df
you
have the following options (I am leaving out the sting version and other column
selectors we support for simplicity):
df[!, [:col1, :col2]]
andselect(df, [:col1, :col2], copycols=false)
:
creates you a new data frame (a fresh wrapper object is allocated) but the
columns of the new data frame are taken fromdf
;df[:, [:col1, :col2]]
andselect(df, [:col1, :col2])
: gets you a new data
frame with columns copied.
Note that for multiple column selection you can alternatively use the select
function. The difference between select
and indexing is that select
returns
a data frame even if a single column is selected, e.g. like this:
julia> select(df, 1)
3×1 DataFrame
Row │ col1
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 3
while as we have explained above we have:
julia> df[!, 1]
3-element Array{Int64,1}:
1
2
3
Note that as in the df[!, [:col1, :col2]]
syntax copying of columns is not
done this operation is generally not recommended. Using such a data frame often
leads to very hard-to-find bugs as when you modify contents of the columns of
the newly created data frame also the source is mutated.
Making a view of a data frame
In this case we have:
julia> view(df, !, :col1)
3-element view(::Array{Int64,1}, :) with eltype Int64:
1
2
3
julia> view(df, !, [:col1, :col2])
3×2 SubDataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
and the views are exactly the same as if we used view(df, :, :col1)
and
view(df, :, [:col1, :col2])
respectively.
In this case !
is supported mainly to allow an easy annotation of whole
expressions using data frame indexing with @views
, e.g. imagine you have
the following code:
julia> x = [1, 2, 3, 4]
4-element Array{Int64,1}:
1
2
3
4
julia> df[!, 1] + x[1:3]
3-element Array{Int64,1}:
2
4
6
and in order to avoid copying x
you want to annotate the whole expression with
@views
. Thanks to the fact that !
is supported with view
you can just write:
julia> @views df[!, 1] + x[1:3]
3-element Array{Int64,1}:
2
4
6
Assigning to a single column
The difference between df[!, :co11] = 11:13
and df[:, :col1] = 11:13
is that
using !
puts a new column passed on the right hand side to the data frame
without copying it (no matter if the column exists or not in the data frame),
while :
assigns to an existing column in-place.
Therefore df[!, :co11] = 11:13
is equivalent to df.col1 = 11:13
. On the other
hand df[:, :co11] = 11:13
is equivalent to df.col1[:] = 11:13
, if the column
:col1
is present in the data frame.
Here is an example:
julia> df2 = copy(df)
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> col1 = df2.col1
3-element Array{Int64,1}:
1
2
3
julia> df2[!, :col1] = 11:13
11:13
julia> col1
3-element Array{Int64,1}:
1
2
3
vs.
julia> df2 = copy(df)
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia>
julia> col1 = df2.col1
3-element Array{Int64,1}:
1
2
3
julia> df2[:, :col1] = 11:13
11:13
julia> col1
3-element Array{Int64,1}:
11
12
13
You might have noticed that when I described :
I have added a condition that
it is equivalen to getproperty
syntax only when the column is present in the
data frame. The reason is that if column is not present in a data frame
then we have:
julia> df
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> newcol = [11, 12, 13]
3-element Array{Int64,1}:
11
12
13
julia> df[:, :newcol] = newcol
3-element Array{Int64,1}:
11
12
13
julia> df
3×3 DataFrame
Row │ col1 col2 newcol
│ Int64 Char Int64
─────┼─────────────────────
1 │ 1 a 11
2 │ 2 b 12
3 │ 3 c 13
julia> df.newcol === newcol
false
So instead of an in-place operation (which is not possible as the column is not
present in the data frame), we get a copy operation.
On the other hand:
julia> df.newcol2[:] = newcol
ERROR: ArgumentError: column name :newcol2 not found in the data frame; existing most similar names are: :newcol
just fails as there is no column to index into.
The other special case is SubDataFrame
, where using !
for assignment is not
allowed, just like for getproperty
syntax:
julia> dfv = view(df, :, :)
3×3 SubDataFrame
Row │ col1 col2 newcol
│ Int64 Char Int64
─────┼─────────────────────
1 │ 1 a 11
2 │ 2 b 12
3 │ 3 c 13
julia> dfv[!, :col1] = 1:3
ERROR: ArgumentError: setting index of SubDataFrame using ! as row selector is not allowed
julia> dfv.col1 = 1:3
ERROR: ArgumentError: Replacing or adding of columns of a SubDataFrame is not allowed. Instead use `df[:, col_ind] = v` or `df[:, col_ind] .= v` to perform an in-place assignment.
Assigning to multiple columns
This case is a bit simpler than assigning to a single column case above. The
reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] = new_values
replaces
columns :col1
and :col2
in df
, while df[:, [:col1, :col2]] = new_values
updates them in-place.
Note that new_values
must be either a data frame or a matrix, and for !
the
columns in df
will be always freshly allocated.
Broadcasting assignment to a single column
This is the point where a bit of complexity is introduced, as now getproperty
syntax (i.e. df.col
) behaves similarly to :
indexing and not to !
indexig.
The rules are the following:
df[!, :col] .= v
allocates a new column and replaces the old one or if:col
is not present indf
allocates and adds it;df[:, :col] .= v
updates the column in-place or allocates or if:col
is not present indf
allocates adds it;df.col .= v
is only allowed ifcol
is present indf
and operates in-place.
Note that if :col
is not present in df
then using !
and :
are equivalent.
Also note that in SubDataFrame
it is not allowed to add new columns and !
syntax is not allowed.
Broadcasting assignment to multiple columns
Again this case is simpler than broadcasting assigning to a single column case above.
The reason is that we do not allow to create new columns when multiple columns are
selected. Therefore the rule is: df[!, [:col1, :col2]] .= new_values
replaces
columns :col1
and :col2
in df
, while df[:, [:col1, :col2]] = new_values
updates them in-place.
Summary of the cases
Wrapping up the cases we see that !
means the following:
- in selection context: get me a column or a data frame without copying columns.
- in views: make me a view (the same as
:
row selector); - in assignment to a single column: replace or add the column to a data frame
without copying; - in assignment to a multiple columns: replace the colums in a data frame
with copying; - in broadcasting assignment: allocate a new column and store it (and in the case
of a single column selector optionally add it if it is missing);
And :
means the following:
- in selection context: get me a column or data frame with copying of columns.
- in views: make me a view (the same as
:
row selector); - in assignment to a single column: change the column in-place or add the column
to a data frame with copying; - in assignment to a multiple columns: change the colums in-place in a data frame;
- in broadcasting assignment: perform in-place update of columns (and in the case
of a single column selector optionally allocate and add it if it is missing);
Finally getproperty
(the df.col
style) means the following:
- in selection context: get me a column without copying.
- in assignment: replace or add the column to a data frame without copying;
- in broadcasting assignment: update an existing column in-place.
In short (simplifying a bit):
!
gets you columns without copying and when setting columns it replaces them;:
gets you columns with copying and when setting columns it does this in-place;getproperty
gets you columns without copying and setting columns it replaces
them, except for broadcasting assignment, when it updates them in-place.
From a practical perspective the major difference between in-place and replace
operations is that replacing columns is needed if new values have a different
type than the old ones.
For instance here !
works and :
fails:
julia> df
3×2 DataFrame
Row │ col1 col2
│ Int64 Char
─────┼─────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
julia> df[:, :col1] .= "a"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64
julia> df[!, :col1] .= "a"
3-element Array{String,1}:
"a"
"a"
"a"
julia> df
3×2 DataFrame
Row │ col1 col2
│ String Char
─────┼──────────────
1 │ a a
2 │ a b
3 │ a c
Another practical limitation is that broadcasting assignment like df.col .= v
is not allowed when :col
is not present in a data frame (there is a chance that
in the future it will be allowed, see here).
Conclusions
As you can see there are cases when !
row selector is needed to cover all
potential use-cases. However, most common operations are done on a single
column and in this case:
- for getting a column or assigning to a column instead of
df[!, :col]
and
df[!, :col] = v
it is usually better to just writedf.col
and
df.col = v
respectively as it is the same and simpler to type and read; - currently the case where
!
is really needed is broacasting assignment context
wheredf[!, :col] .= v
is the only relatively nice way to freshly allocate
a column withv
broadcasted into it (but when I look at the codes of
DataFrames.jl users this pattern is used much less frequently than we
expected when we designed the ecosystem).
I hope this post was helpful. If you are interested in a definitive
specification of all the indexing rules in DataFrames.jl you can find them
here.