Re-posted from: https://bkamins.github.io/julialang/2021/10/29/top10.html

Introduction

Today I will write about a common question asked when working with data that has
a general and nice, but not very well known solution in Julia.

The problem users often want to solve is how to get top-n values of some vector
or top-n rows in a data frame subject some condition (as usual we will use
DataFrames.jl data frame implementation). Here is an example of a recent
Stack Overflow post that posed exactly this question.

Before I move on let me make a small comment about the post from last week. Due
to my error when pushing things to GitHub its incorrect version was shared on
juliabloggers.com. The proper post about doing model selection using
cross validation can be found here.

In this the post I use Julia 1.6.3, BenchmarkTools 1.2.0, and DataFrames.jl
1.2.2.

Getting top-n values of a vector

In order to find top-n values in a vector use the partialsort function that is
shipped with Julia Base. Here is an example:

julia> using Random

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> x = rand(10)
10-element Vector{Float64}:
 0.5908446386657102
 0.7667970365022592
 0.5662374165061859
 0.4600853424625171
 0.7940257103317943
 0.8541465903790502
 0.20058603493384108
 0.2986142783434118
 0.24683718661000897
 0.5796722333690416

julia> partialsort(x, 1:3, rev=true)
3-element view(::Vector{Float64}, 1:3) with eltype Float64:
 0.8541465903790502
 0.7940257103317943
 0.7667970365022592

Above we have selected three top values, but in general we could have passed any
range instead of 1:3. The key benefit of this approach, apart from directly
producing the requested result, is that it is very fast in comparison to e.g.
sorting the whole vector and getting the top-n values. Here is a simple
benchmark:

julia> using BenchmarkTools

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> for n in [10^i for i in 3:7]
           @show n
           x = rand(n)
           @btime partialsort(x, 1:10)
           @btime sort(x)[1:10]
       end
n = 1000
  3.683 μs (2 allocations: 7.98 KiB)
  13.606 μs (2 allocations: 8.09 KiB)
n = 10000
  82.545 μs (3 allocations: 78.25 KiB)
  427.913 μs (3 allocations: 78.36 KiB)
n = 100000
  1.168 ms (3 allocations: 781.38 KiB)
  5.277 ms (3 allocations: 781.48 KiB)
n = 1000000
  11.676 ms (3 allocations: 7.63 MiB)
  67.431 ms (3 allocations: 7.63 MiB)
n = 10000000
  141.493 ms (3 allocations: 76.29 MiB)
  868.380 ms (3 allocations: 76.29 MiB)

As you can see there is a significant difference in timing and it grows with
the size of the input vector.

Getting top-n values in a data frame

In order to do the similar operation on a data frame we need to use another
function, i.e. partialsortperm. It returns the indices of the requested
values. Here is an example:

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> x = rand(10)
10-element Vector{Float64}:
 0.5908446386657102
 0.7667970365022592
 0.5662374165061859
 0.4600853424625171
 0.7940257103317943
 0.8541465903790502
 0.20058603493384108
 0.2986142783434118
 0.24683718661000897
 0.5796722333690416

julia> partialsortperm(x, 1:3, rev=true)
3-element view(::Vector{Int64}, 1:3) with eltype Int64:
 6
 5
 2

Now these indices can be used to subset a data frame as follows. Assume we have
a large data frame having columns :x and :y and want to get top 10 rows
with respect to column :x. You can achieve it efficiently like this:

julia> using DataFrames

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10^6), y=rand(10^6))
1000000×2 DataFrame
     Row │ x          y
         │ Float64    Float64
─────────┼──────────────────────
       1 │ 0.590845   0.901919
       2 │ 0.766797   0.491144
       3 │ 0.566237   0.0750416
       4 │ 0.460085   0.476479
       5 │ 0.794026   0.296354
       6 │ 0.854147   0.399129
       7 │ 0.200586   0.0894096
    ⋮    │     ⋮          ⋮
  999994 │ 0.798217   0.645084
  999995 │ 0.587953   0.978664
  999996 │ 0.47863    0.105387
  999997 │ 0.444533   0.0169648
  999998 │ 0.0250811  0.921271
  999999 │ 0.527551   0.731145
 1000000 │ 0.713854   0.0222126
             999986 rows omitted

julia> df[partialsortperm(df.x, 1:10, rev=true), :]
10×2 DataFrame
 Row │ x         y
     │ Float64   Float64
─────┼─────────────────────
   1 │ 0.999999  0.586541
   2 │ 0.999999  0.538707
   3 │ 0.999999  0.375452
   4 │ 0.999998  0.0909446
   5 │ 0.999998  0.426267
   6 │ 0.999997  0.747272
   7 │ 0.999995  0.0559812
   8 │ 0.999995  0.390885
   9 │ 0.999994  0.956622
  10 │ 0.999993  0.0909472

Conclusions

I use the partialsort and partialsortperm functions surprisingly often.
As you have seen they are not only easy to learn but also fast.

Finally, if you would need even more efficiency you can use the partialsort!
and partialsortperm! functions that are in-place variants of the discussed
functions.

juliabloggers.com

A Julia Language Blog Aggregator

Top 10 in the Julia language

Introduction

Getting top-n values of a vector

Getting top-n values in a data frame

Conclusions

Related