How to get head or tail of a data frame in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/06/18/first.html

Introduction

This time I thought to make a post that will help people who know R or Python
and are starting to use DataFrames.jl.

The point is that DataFrames.jl does not define head and tail functions,
but rather first and last. In other words I want to serve those
of you who have googled up "DataFrames.jl head" and landed on this page.

All codes were tested under Julia 1.6.1 and DataFrames.jl 1.1.1.

What Julia Base offers

In Julia Base if you have some collection, e.g. a vector, you can use the
first and last functions to get their head/tail as follows:

julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> first(x)
1

julia> last(x)
5

julia> first(x, 2)
2-element Vector{Int64}:
 1
 2

julia> last(x, 2)
2-element Vector{Int64}:
 4
 5

As you can see there are two modes how these functions work:

  • if you pass just a collection then its first/last element is extracted out
    and returned;
  • if you pass a collection and a non-negative integer, then you get a collection
    holding the requested number of elements from the head/tail of the source
    collection.

Additionally the first and last functions have a nice feature that if you
pass them a non negative integer, then they do not fail, e.g.:

julia> first(x, 10)
5-element Vector{Int64}:
 1
 2
 3
 4
 5

This is a very convenient feature, especially when working with data, where you
do not know for sure that you will have enough observations. Consider a simple
scenario: we want to pick top three products per product category, and some
product categories might have less than tree entries (we will soon go back to
this example when we switch to DataFrames.jl).

The point is that with ordinary indexing if you request more elements than the
collection can hold you get:

julia> x[1:10]
ERROR: BoundsError: attempt to access 5-element Vector{Int64} at index [1:10]

so you have to write something like e.g.:

julia> x[1:min(10, end)]
5-element Vector{Int64}:
 1
 2
 3
 4
 5

which, while still nice, is not so easy to reason about. A more advanced issue
with indexing that first and last resolve is that not all collections are
1-based, so you have to be careful if you write generic code.

How DataFrames.jl works

Since getting a head/tail of a data frame is essentially the operation that
first and last functions from Julia Base offer these functions have
special methods defined in DataFrames.jl (and thus there is no need to introduce
new head and tail functions).

Let us see this at work:

julia> using DataFrames

julia> df = DataFrame(group=[1, 1, 1, 1, 2, 2], id=1:6)
6×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
   5 │     2      5
   6 │     2      6

julia> first(df)
DataFrameRow
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1

julia> last(df)
DataFrameRow
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   6 │     2      6

julia> first(df, 2)
2×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2

julia> last(df, 2)
2×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     2      5
   2 │     2      6

As you can see all is consistent with Julia Base. If you just pass a single
data frame as an argument to first/last you get a DataFrameRow. If you
additionally pass an integer you get a DataFrame.

Let us now show why the behavior “take at most n elements”(instead of “take
exactly n elements”
) is useful. Consider we want to pick top 3 ids
per group. This is easily achievable as follows (we assume that the table was
already sorted in some meaningful way):

julia> combine(groupby(df, :group), sdf -> first(sdf, 3))
5×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     2      5
   5 │     2      6

As you can see the operation worked smoothly, picking top three rows for group
1, but only two for group 2 (as this group held only two rows). This is the
behavior that I think is most convenient when doing split-apply-combine
strategy.

Conclusions

This post is aimed mostly at people starting to work with DataFrames.jl.
There are two messages I wanted to convey:

  • a direct one: use first and last functions if you want to get a head/tail
    of your data frame;
  • a more general one: DataFrames.jl is designed to follow the API provided by
    Julia Base. So if some functions exist there (like first and last) you can
    in general expect that these functions will have methods defined also in
    DataFrames.jl that work consistently (of course provided that some operation
    makes sense in this context).