Re-posted from: https://bkamins.github.io/julialang/2023/02/24/dfrows.html
Introduction
In my recent post I have discussed what interfaces Julia defines for
working with containers. Today I want to make a closer look at data frame
objects that are defined in DataFrames.jl.
Before I move forward I want to make a small announcement. On my blog I have
recently added Learning section, where I collect a list of learning
materials that I find useful for doing data science with Julia. If you would
like to have some position added to this list please contact me.
This post was written under Julia 1.9.0-beta4 and DataFrames.jl 1.5.0.
Interfaces refresher
Let me start with recalling the discussion we had in this post about
data frame design:
- Data frame is not iterable. Instead, if you want to iterate its rows
use theeachrow
wrapper, and if you want to iterate its columns use the
eachcol
wrapper. - You can index data frame like a matrix, but you are always required to pass
both row and column indices (in other words: linear indexing is not
supported). - In broadcasting data frame behaves as a matrix (two dimensional container).
- You can get columns of a data frame by their name using property access.
For this reason often users are surprised when they read that data frame
is considered to be a collection of rows. From the way it supports the
standard interfaces it does not seem that it is the case.
What we are missing from the whole picture is that the four above interfaces are
related to only a limited number of functions:
- Iteration:
iterate
. - Indexing:
getindex
,setindex!
,firstindex
,lastindex
,size
. - Broadcasting:
axes
,broadcastable
,BroadcastStyle
,similar
,
copy
,copyto!
. - Property access:
getproperty
andsetproperty!
.
In general, DataFrames.jl exposes over 120 functions that work on data frame
objects and out of them 38 are methods that are extensions to functions
that are defined in Base
Julia and work on collections. All these functions
consider data frame to be a collection of rows.
In what follows I will go through all of them so that DataFrames.jl users have
an easy reference to them in one place.
Row sorting and reordering
We support sort
, sort!
, sortperm
, issorted
, permute!
, invpermute!
,
reverse!
, reverse
, shuffle!
, and shuffle
functions that work on data
frame rows.
Here let me remark that in particular shuffling functions are often quite handy
when preparing data to be passed to various statistical models.
Dropping rows
We support deleteat!
, keepat!
, empty!
, empty
, filter!
, filter
,
first
, last
, and resize!
.
Let me mention that resize!
allows not only to drop rows form a data frame
but also add them (although it is not often used).
In addition there is an isempty
function that checks if data frame has zero
rows. It is a important to remember that it is not required that data frame
has no columns:
julia> using DataFrame
julia> df = DataFrame(a=1, b=2)
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> isempty(df)
false
julia> empty!(df)
0×2 DataFrame
Row │ a b
│ Int64 Int64
─────┴──────────────
julia> isempty(df)
true
Remember that if the DataFrames.jl documentation says that data frame is empty
it means that it has zero rows (but it does not say anything about number of
columns).
Adding rows
You can add single rows to a data frame using push!
, pushfirst!
, and
insert!
or collections of rows (in general Tables.jl tables) using
append!
and prepend!
.
Related functions are repeat!
and repeat
to repeat rows in a data frame.
Row extraction
There are four functions that allow you to extract one row from a data frame
only
, pop!
, popat!
, and popfirst!
.
I often find the only
function useful, when I want to explicitly verify
the contract that some operation returned a data frame with only one row.
Identification of missing values in rows
You can use completecases
to find which rows do not contain missing values
and dropmissing!
and dropmissing
to drop them.
Identification of unique rows
There are unique
and unique!
functions that return unique rows in a
source data frame. To get an indicator vector which rows are non-unique use
the nonunique
function, and allunique
checks if all rows in a data frame
are unique.
Here let me show an example of nonunique
functionality that allows you to
choose which duplicates are highlighted (it was added in 1.5 release):
julia> df = DataFrame(a=[1, 2, 1, 3, 1])
5×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 2
3 │ 1
4 │ 3
5 │ 1
julia> nonunique(df) # by default first duplicate is kept
5-element Vector{Bool}:
0
0
1
0
1
julia> nonunique(df, keep=:last) # keep last duplicate
5-element Vector{Bool}:
1
0
1
0
0
julia> nonunique(df, keep=:noduplicates) # do not keep any duplicates
5-element Vector{Bool}:
1
0
1
0
1
The keep
keyword argument name is used because often false
in the returned
vector is meant to indicate which rows should be later kept in a data frame
(the same keyword argument name is consistently used in unique
and unique!
).
Conclusions
As you could see in this post there are many functions in Base
Julia that
support working with collections. In DataFrames.jl we wanted users to be able
to reuse these functions when working with data frames. Therefore all of them
are supported and they consider data frames to be collections of rows.
Sometimes it is useful to have an iterable and indexable collection of,
respectively, rows and columns of a data frame. For this reason we provide the
eachrow
and eachcol
wrappers that provide this functionality. As a
consequence, for clarity and to minimize the risk of error on user’s side,
without being wrapped data frame is not iterable and behaves like a matrix in
indexing and broadcasting.