Author Archives: Blog by Bogumił Kamiński

Working with rows of Tables.jl tables

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/22/tables.html

Introduction

Three weeks ago I wrote a post about getting a schema of Tables.jl tables.
Therefore today, to complement, I thought to discuss how one can get rows of such tables.

The post was written using Julia 1.9.2, Tables.jl 1.11.0, DataAPI.jl 1.15.0, and DataFrames.jl 1.6.1.

Why getting rows of a table is needed?

Many Julia users are happy with using DataFrames.jl to work with their tables.
However, this is only one of the available options.
This means that, especially package creators, prefer not to hardcode DataFrame
as a specific type that their package supports, but allow for generic Tables.jl tables.

An example of such need is, for example, a function that could take a generic table and
split it into train-validation-test subsets. To achieve this you need to be able
to take a subset of its rows.

How row sub-setting is supported in Tables.jl?

There are two functions that, in combination, can be used to generically subset a Tables.jl table:

  • the DataAPI.nrow function that returns a number of rows in a table;
  • the Tables.subset function that allows you to get a subset of rows of a table.

Before I turn to showing you how they work let me highlight one issue. Most of Tables.jl tables
support these functions. However, their support is not guaranteed. The reason is that some tables
are never materialized in memory, e.g. are only a stream of rows that can be read only once.
In such a case we will not know the number of rows in such a table (as it is dynamic) and, similarly,
to get a subset of its rows you would need to scan the whole stream anyway.

Using the row sub-setting interface of Tables.jl

The DataAPI.nrow function is easy to understand. You pass it a table and in return you get the number of its rows.
Let us see it in practice:

julia> using DataAPI

julia> using Tables

julia> table = (a=1:10, b=11:20, c=21:30)
(a = 1:10, b = 11:20, c = 21:30)

julia> DataAPI.nrow(table)
10

The Tables.subset accepts two positional arguments. The first is a table, and the second
are 1-based row indices that should be picked. You have two options for passing indices.
You can pass a single integer index like this:

julia> Tables.subset(table, 2)
(a = 2, b = 12, c = 22)

In which case you get a single row of a table.
The other option is to pass a collection of indices, in which case, you get a table (not a single row):

julia> Tables.subset(table, 2:3)
(a = 2:3, b = 12:13, c = 22:23)

To see that indeed it works for other tables, let us check a DataFrame from DataFrames.jl:

julia> using DataFrames

julia> df = DataFrame(table)
10×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     21
   2 │     2     12     22
   3 │     3     13     23
   4 │     4     14     24
   5 │     5     15     25
   6 │     6     16     26
   7 │     7     17     27
   8 │     8     18     28
   9 │     9     19     29
  10 │    10     20     30

julia> nrow(df)
10

julia> Tables.subset(df, 2)
DataFrameRow
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   2 │     2     12     22

julia> Tables.subset(df, 2:3)
2×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     12     22
   2 │     3     13     23

Again, note that Tables.subset(df, 2) returned DataFrameRow (a single row of a table),
while Tables.subset(df, 2:3) returned a DataFrame (a table).

Advanced sub-setting options

If you work with large tables you often hit performance and memory consumption considerations.
In terms of Tables.subset this is related to the question if this function copies data
or just makes a view of the source table. This option is handled by the viewhint keyword argument.

Let us first see how it works:

julia> Tables.subset(df, 2:3, viewhint=true)
2×3 SubDataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     12     22
   2 │     3     13     23

julia> Tables.subset(df, 2:3, viewhint=false)
2×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     2     12     22
   2 │     3     13     23

As you can see viewhint=true returned a view (a SubDataFrame), while viewhint=false produced a copy.

Let us see another example:

julia> table2 = Tables.rowtable(df)
10-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}}:
 (a = 1, b = 11, c = 21)
 (a = 2, b = 12, c = 22)
 (a = 3, b = 13, c = 23)
 (a = 4, b = 14, c = 24)
 (a = 5, b = 15, c = 25)
 (a = 6, b = 16, c = 26)
 (a = 7, b = 17, c = 27)
 (a = 8, b = 18, c = 28)
 (a = 9, b = 19, c = 29)
 (a = 10, b = 20, c = 30)

julia> Tables.subset(table2, 2:3, viewhint=true)
2-element view(::Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}}, 2:3) with eltype NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}:
 (a = 2, b = 12, c = 22)
 (a = 3, b = 13, c = 23)

julia> Tables.subset(table2, 2:3, viewhint=false)
2-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}}:
 (a = 2, b = 12, c = 22)
 (a = 3, b = 13, c = 23)

As you can see viewhint=true produced a view of a vector, while viewhint=false made a copy of source data.

Now you might ask why the keyword argument is called viewhint? The reason is that not all Tables.jl tables allow
for flexibility of making a view or a copy. Therefore the rules are as follows:

  • if viewhint is not passed then table decides on its side if it returns a copy or a view (depending on what is possible);
  • if viewhint=true then table should return a view, but if it is not possible this can be a copy;
  • if viewhint=false then table should return a copy, but if it is not possible this can be a view.

In other words viewhint should be considered as a performance hint only.
It does not guarantee to produce what you ask for (as for some tables satisfying this request might be impossible).

Conclusions

Summarizing our post. If you want to write a generic function that subsets a Tables.jl table then you can use:

  • the DataAPI.nrow function to learn how many rows it has;
  • the Tables.subset function to get a subset of its rows using 1-based indexing.

I hope these examples are useful for your work.

Does DataFrames.jl copy or not copy, that is the question

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/15/copying.html

Introduction

Some time ago I have written a post about my thoughts on copying of data when working with it in Julia.

Today I want to focus on a related, but more narrow topic related to DataFrames.jl.
People starting to work with this package are sometimes confused when columns
get copied and when they are not copied. I want to discuss the most common cases in this post.

Spoiler! The post is a bit long. If you want a simple advice – you can skip to the section with conclusions.

The post was written using Julia 1.9.2 and DataFrames.jl 1.6.1.

Getting a column from a data frame

Let us start with a simpler case. When does copying happen if we get a column form a data frame?

First we set up some initial data:

julia> using DataFrames

julia> df = DataFrame(a=1:10^6)
1000000×1 DataFrame
     Row │ a
         │ Int64
─────────┼─────────
       1 │       1
       2 │       2
       3 │       3
       4 │       4
       5 │       5
    ⋮    │    ⋮
  999997 │  999997
  999998 │  999998
  999999 │  999999
 1000000 │ 1000000
999991 rows omitted

There are three ways to get the :a column from this data frame: df.a, df[:, :a] and df[!, :a].
Let us check them one by one. Start with df.a:

julia> df.a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> @allocated df.a
0

df.a extracts the column without copying data. You can see it by the fact that there are no allocations performed in this operation.

Now check df[:, :a], which uses a standard row index : that is also used in arrays:

julia> df[:, :a]
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> @allocated df[:, :a]
8000048

df[:, :a] copies data, we see a lot of memory allocated this time. This is an identical behavior to how : works for arrays.

Finally check df[!, :a], which uses a non-standard ! row index:

julia> df[!, :a]
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> @allocated df[!, :a]
0

We can see that df[!, :a] does not allocate. It is equivalent to df.a, just with a bit different syntax
(the indexing syntax with ! is handy if we wanted to select multiple columns from a data frame, which is not possible with df.a syntax).

This part was relatively easy. Now let us turn to a harder case of setting a column of a data frame.

Case 1: setting a column in a data frame using assignment

First store the :a column in a temporary variable a (without copying it):

julia> a = df.a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

Now let us check various options of creation of a column that will store a.
Begin with creating of a new column.

julia> df.b = a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> df.b === a
true

We can see that if we put df.b on the left hand side the operation does not copy the passed data.
You probably already can guess that the same happens with df[!, :c] on left hand side. Indeed
it is the case:

julia> df[!, :c] = a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> df.c === a
true

What about df[:, :d]? Let us see:

julia> df[:, :d] = a
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

julia> df.d === a
false

So we see a first difference. When creating a new column the data was copied.
But what would happen if some column already existed in a data frame?

Well for df.b and df[!, :c] syntaxes nothing would change, as they just put
a right hand side vector into a data frame without copying it.
But for df[:, :d] the situation is different. Let us check:

julia> d = df.d;

julia> df[:, :d] = a;

julia> df.d === a
false

julia> df.d === d
true

We can see that if we use the df[:, :d] syntax on left hand side the operation is in-place,
that is the vector already present in df is reused and the data is stored in a column
already present in a data frame. This means that we cannot use df[:, :d] = ... to change
element type of column :d. Let us see:

julia> df[:, :d] = a .+ 0.5;
ERROR: InexactError: Int64(1.5)

Indeed a .+ 0.5 contains floating point values, and the :d column allowed only integers.
Note that with df.b = ... or df[!, :c] = ... we would not have this issue as they
replace columns with what is passed on a right hand side:

julia> df.b = a .+ 0.5
1000000-element Vector{Float64}:
      1.5
      2.5
      3.5
      4.5
      5.5
      6.5
      7.5
      ⋮
 999995.5
 999996.5
 999997.5
 999998.5
 999999.5
      1.0000005e6

There is one more twist to this story. It is related to ranges.
The issue is that DataFrame object always materializes ranges
stored in it.
Therefore the following operation allocates data:

julia> df.b = 1:10^6
1:1000000

julia> df.b
1000000-element Vector{Int64}:
       1
       2
       3
       4
       5
       6
       7
       ⋮
  999995
  999996
  999997
  999998
  999999
 1000000

The issue is that generally df.b = ... does not allocate, but since we disallow storing
ranges as columns of a data frame (in our case the 1:10^6 range) the allocation still takes place.
You would have the same behavior with df[!, :c] = 1:10^6.

Case 2: setting a column in a data frame using broadcasted assignment

Julia is famous for its powerful broadcasting capabilities. Let us thus investigate what happens when we
replace = with .= in our experiments. We will reproduce all the examples we gave above from scratch.

Start with df.b .= a:

julia> df = DataFrame(a=1:10^6);

julia> a = df.a;

julia> df.b .= a;

julia> df.b === a
false

We now see a difference. The :b column is freshly allocated.

Let us check the two other options of creation of a new column:

julia> df[!, :c] .= a;

julia> df.c === a
false

julia> df[:, :d] .= a;

julia> df.d === a
false

They have the same effect: a new column gets allocated.

In the case of an existing column df.b .= ... and df[!, :c] .= ...
would again create a new copied column:

julia> df.b .= a .+ 0.5
1000000-element Vector{Float64}:
      1.5
      2.5
      3.5
      4.5
      5.5
      6.5
      7.5
      ⋮
 999995.5
 999996.5
 999997.5
 999998.5
 999999.5
      1.0000005e6

The difference is with df[:, :d] .= ...:

julia> d = df.d;

julia> df[:, :d] .= a;

julia> df.d === a
false

julia> df.d === d
true

julia> df[:, :d] .= a .+ 0.5
ERROR: InexactError: Int64(1.5)

So we see that we have here an in-place operation just like with df[:, :d] = ....

Conclusions

As a summary let me discuss a common anti-pattern:

df.a = df.b

Given the examples I presented we know that after this operation the :a and :b columns
of the df data frame are aliased, i.e. df.a === df.b produces true. Usually this is not
a desired situation as many operations assume that columns of a data frame do not share memory.

Fortunately, we also already learnt an easy fix to the aliasing problem. You can just write:

df.a .= df.b

To get a copy of :b stored in column :a.

I hope the examples I gave in my post today will be useful for your work with DataFrames.jl.

Dropping columns from a data frame

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/08/dropcols.html

Introduction

One of the common tasks when working with a data frame is dropping some of its columns.
There are two ways to do it. You can either specify
which columns you want to keep or which columns you want to drop.

One of the frequent questions I get is how to do these operations with
[DataFrames.jl][tables] in case the list of columns to keep or drop might not
be a subset of columns of the data frame. This is the topic I want to cover in my today’s post.

The post was tested using Julia 1.9.2 and DataFrames.jl 1.6.1.

Standard column selection

First, create an example data frame:

julia> using DataFrames

julia> df = DataFrame(a=1, b=2, c=3, d=4)
1×4 DataFrame
 Row │ a      b      c      d
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      3      4

Now assume I want to keep columns :a and :c from it. You can do it by writing, for example:

julia> select(df, :a, :c)
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

You could also pass the columns as a variable using e.g., a vector:

julia> keep1 = [:a, :c]
2-element Vector{Symbol}:
 :a
 :c

julia> select(df, keep1)
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Now, let us discuss dropping columns. Assume we want to keep all columns except columns :b and :d.
We can achieve this by using the Not command:

julia> select(df, Not(:b, :d))
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Also in this case we can use a helper variable:

julia> drop1 = [:b, :d]
2-element Vector{Symbol}:
 :b
 :d

julia> select(df, Not(drop1))
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

The problematic case: the selected column is not present in a data frame

In some scenarios, we might want to provide a list of columns of which not
all are present in the data frame. For example, assume we want to keep
columns :a and :x. We see that the :x column is not present in our
df data frame.

Before we move forward, let me comment when such a situation occurs most often.
Assume you have 100 data frames that describe your data. Each data frame is similar,
but not identical. For example, a single data frame might represent data from one country
and the list of information for the countries does not have to be identical (for some
countries we might have more information, which results in more columns in a data frame).
When processing such data we might want to write one general condition on which columns
we want to keep or drop, and some of these columns might be present in only a subset of
all data frames.

Now let us go back to our example. Let us try keeping columns :a and :x:

julia> select(df, :a, :x)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

julia> keep2 = [:a, :x]
2-element Vector{Symbol}:
 :a
 :x

julia> select(df, keep2)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

We get an error. DataFrames.jl is designed to check, by default, that the operation you want to perform
on your data frame is valid. This is a conscious design decision. The reason is that in production application
settings most often when you say that you want to keep columns :a and :x you assume that they are present in df.
Thus you want to get an error if they would not be all present in it.

The same behavior can be observed for dropping columns. Assume we want to drop columns :b and :x:

julia> select(df, Not(:b, :x))
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

julia> drop2 = [:b, :x]
2-element Vector{Symbol}:
 :b
 :x

julia> select(df, drop2)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

So what we saw here is a default behavior that was designed to be safe.
In what follows let me discuss how to perform a flexible selection.

Performing column selection when some of them are not present in a data frame

There are several solutions for column selection when some of them are not present in a data frame.
Let me present the one that I find the most convenient.
For this operation I typically use the Cols selector. The reason is that you can pass
a condition function (a predicate) as an argument to Cols that will select columns
whose names meet a passed condition.

Therefore the following operation:

julia> select(df, keep1)
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Is the same as:

julia> keep1s = string.(keep1)
2-element Vector{String}:
 "a"
 "c"

julia> select(df, Cols(in(keep1s)))
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Note what we did here. The in(keep1s) expression produces a function that checks if a value passed to it is in the keep1s vector.
It is important to note that although column selection in DataFrames.jl accepts both Symbol (like :a) and strings (like "a")
as column names the Cols-based selector will perform the check against strings. Therefore I had to convert the keep1 vector
of symbols to a keep1s vector of strings.

So far the select(df, Cols(in(keep1s))) is more verbose than just writing select(df, keep1). However, the benefit of Cols
is that when the in(keep1s) check is done we can have in keep1s vector whatever values we like, in particular,
they do not have be valid column names of our df.

Therefore to keep columns :a and :x, if we are unsure if these columns are present in df we can write:

julia> select(df, Cols(in(["a", "x"])))
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

Note that this time the operation works without an error. And again, please keep in mind that in this selector we need to pass
column names as strings.

Now you can probably already tell how to pass a list of columns to drop, without requiring that they are present in df.
The only thing to do is to use the !in function (not-in) instead of in. Let us drop columns :b and :x from our
data frame (keeping in mind that :x is not present in it):

julia> select(df, Cols(!in(["b", "x"])))
1×3 DataFrame
 Row │ a      c      d
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      4

All worked as expected – the :b column was dropped and the :x column was ignored in the column dropping operation.

Conclusions

I hope you find the examples I gave today useful.

In general, the whole design of DataFrames.jl is similar to
what was discussed in this post. The default behavior is picked to be safe (as in our example: by default,
select checks if the columns you pass are present in a data frame), but it is possible to switch to an unsafe
mode relatively easily (in our example: using Cols with a predicate function).