Introduction

Today I will write about a common question asked when working with data that has a general and nice, but not very well known solution in Julia.

The problem users often want to solve is how to get top-n values of some vector or top-n rows in a data frame subject some condition (as usual we will use DataFrames.jl data frame implementation). Here is an example of a recent Stack Overflow post that posed exactly this question.

Before I move on let me make a small comment about the post from last week. Due to my error when pushing things to GitHub its incorrect version was shared on juliabloggers.com. The proper post about doing model selection using cross validation can be found here.

In this the post I use Julia 1.6.3, BenchmarkTools 1.2.0, and DataFrames.jl 1.2.2.

Getting top-n values of a vector

In order to find top-n values in a vector use the partialsort function that is shipped with Julia Base. Here is an example:

julia> using Random

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> x = rand(10)
10-element Vector{Float64}:
 0.5908446386657102
 0.7667970365022592
 0.5662374165061859
 0.4600853424625171
 0.7940257103317943
 0.8541465903790502
 0.20058603493384108
 0.2986142783434118
 0.24683718661000897
 0.5796722333690416

julia> partialsort(x, 1:3, rev=true)
3-element view(::Vector{Float64}, 1:3) with eltype Float64:
 0.8541465903790502
 0.7940257103317943
 0.7667970365022592

Above we have selected three top values, but in general we could have passed any range instead of 1:3. The key benefit of this approach, apart from directly producing the requested result, is that it is very fast in comparison to e.g. sorting the whole vector and getting the top-n values. Here is a simple benchmark:

julia> using BenchmarkTools

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> for n in [10^i for i in 3:7]
           @show n
           x = rand(n)
           @btime partialsort(x, 1:10)
           @btime sort(x)[1:10]
       end
n = 1000
  3.683 μs (2 allocations: 7.98 KiB)
  13.606 μs (2 allocations: 8.09 KiB)
n = 10000
  82.545 μs (3 allocations: 78.25 KiB)
  427.913 μs (3 allocations: 78.36 KiB)
n = 100000
  1.168 ms (3 allocations: 781.38 KiB)
  5.277 ms (3 allocations: 781.48 KiB)
n = 1000000
  11.676 ms (3 allocations: 7.63 MiB)
  67.431 ms (3 allocations: 7.63 MiB)
n = 10000000
  141.493 ms (3 allocations: 76.29 MiB)
  868.380 ms (3 allocations: 76.29 MiB)

As you can see there is a significant difference in timing and it grows with the size of the input vector.

Getting top-n values in a data frame

In order to do the similar operation on a data frame we need to use another function, i.e. partialsortperm. It returns the indices of the requested values. Here is an example:

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> x = rand(10)
10-element Vector{Float64}:
 0.5908446386657102
 0.7667970365022592
 0.5662374165061859
 0.4600853424625171
 0.7940257103317943
 0.8541465903790502
 0.20058603493384108
 0.2986142783434118
 0.24683718661000897
 0.5796722333690416

julia> partialsortperm(x, 1:3, rev=true)
3-element view(::Vector{Int64}, 1:3) with eltype Int64:
 6
 5
 2

Now these indices can be used to subset a data frame as follows. Assume we have a large data frame having columns :x and :y and want to get top 10 rows with respect to column :x. You can achieve it efficiently like this:

julia> using DataFrames

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> df = DataFrame(x=rand(10^6), y=rand(10^6))
1000000×2 DataFrame
     Row │ x          y
         │ Float64    Float64
─────────┼──────────────────────
       1 │ 0.590845   0.901919
       2 │ 0.766797   0.491144
       3 │ 0.566237   0.0750416
       4 │ 0.460085   0.476479
       5 │ 0.794026   0.296354
       6 │ 0.854147   0.399129
       7 │ 0.200586   0.0894096
    ⋮    │     ⋮          ⋮
  999994 │ 0.798217   0.645084
  999995 │ 0.587953   0.978664
  999996 │ 0.47863    0.105387
  999997 │ 0.444533   0.0169648
  999998 │ 0.0250811  0.921271
  999999 │ 0.527551   0.731145
 1000000 │ 0.713854   0.0222126
             999986 rows omitted

julia> df[partialsortperm(df.x, 1:10, rev=true), :]
10×2 DataFrame
 Row │ x         y
     │ Float64   Float64
─────┼─────────────────────
   1 │ 0.999999  0.586541
   2 │ 0.999999  0.538707
   3 │ 0.999999  0.375452
   4 │ 0.999998  0.0909446
   5 │ 0.999998  0.426267
   6 │ 0.999997  0.747272
   7 │ 0.999995  0.0559812
   8 │ 0.999995  0.390885
   9 │ 0.999994  0.956622
  10 │ 0.999993  0.0909472

Conclusions

I use the partialsort and partialsortperm functions surprisingly often. As you have seen they are not only easy to learn but also fast.

Finally, if you would need even more efficiency you can use the partialsort! and partialsortperm! functions that are in-place variants of the discussed functions.