Re-posted from: https://bkamins.github.io/julialang/2021/10/29/top10.html
Introduction
Today I will write about a common question asked when working with data that has
a general and nice, but not very well known solution in Julia.
The problem users often want to solve is how to get top-n values of some vector
or top-n rows in a data frame subject some condition (as usual we will use
DataFrames.jl data frame implementation). Here is an example of a recent
Stack Overflow post that posed exactly this question.
Before I move on let me make a small comment about the post from last week. Due
to my error when pushing things to GitHub its incorrect version was shared on
juliabloggers.com. The proper post about doing model selection using
cross validation can be found here.
In this the post I use Julia 1.6.3, BenchmarkTools 1.2.0, and DataFrames.jl
1.2.2.
Getting top-n values of a vector
In order to find top-n values in a vector use the partialsort
function that is
shipped with Julia Base. Here is an example:
julia> using Random
julia> Random.seed!(1234)
MersenneTwister(1234)
julia> x = rand(10)
10-element Vector{Float64}:
0.5908446386657102
0.7667970365022592
0.5662374165061859
0.4600853424625171
0.7940257103317943
0.8541465903790502
0.20058603493384108
0.2986142783434118
0.24683718661000897
0.5796722333690416
julia> partialsort(x, 1:3, rev=true)
3-element view(::Vector{Float64}, 1:3) with eltype Float64:
0.8541465903790502
0.7940257103317943
0.7667970365022592
Above we have selected three top values, but in general we could have passed any
range instead of 1:3
. The key benefit of this approach, apart from directly
producing the requested result, is that it is very fast in comparison to e.g.
sorting the whole vector and getting the top-n values. Here is a simple
benchmark:
julia> using BenchmarkTools
julia> Random.seed!(1234)
MersenneTwister(1234)
julia> for n in [10^i for i in 3:7]
@show n
x = rand(n)
@btime partialsort(x, 1:10)
@btime sort(x)[1:10]
end
n = 1000
3.683 μs (2 allocations: 7.98 KiB)
13.606 μs (2 allocations: 8.09 KiB)
n = 10000
82.545 μs (3 allocations: 78.25 KiB)
427.913 μs (3 allocations: 78.36 KiB)
n = 100000
1.168 ms (3 allocations: 781.38 KiB)
5.277 ms (3 allocations: 781.48 KiB)
n = 1000000
11.676 ms (3 allocations: 7.63 MiB)
67.431 ms (3 allocations: 7.63 MiB)
n = 10000000
141.493 ms (3 allocations: 76.29 MiB)
868.380 ms (3 allocations: 76.29 MiB)
As you can see there is a significant difference in timing and it grows with
the size of the input vector.
Getting top-n values in a data frame
In order to do the similar operation on a data frame we need to use another
function, i.e. partialsortperm
. It returns the indices of the requested
values. Here is an example:
julia> Random.seed!(1234)
MersenneTwister(1234)
julia> x = rand(10)
10-element Vector{Float64}:
0.5908446386657102
0.7667970365022592
0.5662374165061859
0.4600853424625171
0.7940257103317943
0.8541465903790502
0.20058603493384108
0.2986142783434118
0.24683718661000897
0.5796722333690416
julia> partialsortperm(x, 1:3, rev=true)
3-element view(::Vector{Int64}, 1:3) with eltype Int64:
6
5
2
Now these indices can be used to subset a data frame as follows. Assume we have
a large data frame having columns :x
and :y
and want to get top 10 rows
with respect to column :x
. You can achieve it efficiently like this:
julia> using DataFrames
julia> Random.seed!(1234)
MersenneTwister(1234)
julia> df = DataFrame(x=rand(10^6), y=rand(10^6))
1000000×2 DataFrame
Row │ x y
│ Float64 Float64
─────────┼──────────────────────
1 │ 0.590845 0.901919
2 │ 0.766797 0.491144
3 │ 0.566237 0.0750416
4 │ 0.460085 0.476479
5 │ 0.794026 0.296354
6 │ 0.854147 0.399129
7 │ 0.200586 0.0894096
⋮ │ ⋮ ⋮
999994 │ 0.798217 0.645084
999995 │ 0.587953 0.978664
999996 │ 0.47863 0.105387
999997 │ 0.444533 0.0169648
999998 │ 0.0250811 0.921271
999999 │ 0.527551 0.731145
1000000 │ 0.713854 0.0222126
999986 rows omitted
julia> df[partialsortperm(df.x, 1:10, rev=true), :]
10×2 DataFrame
Row │ x y
│ Float64 Float64
─────┼─────────────────────
1 │ 0.999999 0.586541
2 │ 0.999999 0.538707
3 │ 0.999999 0.375452
4 │ 0.999998 0.0909446
5 │ 0.999998 0.426267
6 │ 0.999997 0.747272
7 │ 0.999995 0.0559812
8 │ 0.999995 0.390885
9 │ 0.999994 0.956622
10 │ 0.999993 0.0909472
Conclusions
I use the partialsort
and partialsortperm
functions surprisingly often.
As you have seen they are not only easy to learn but also fast.
Finally, if you would need even more efficiency you can use the partialsort!
and partialsortperm!
functions that are in-place variants of the discussed
functions.