Introduction

Recently I received a very interesting question about the performance of operations on GroupedDataFrame in DataFrames.jl.

In this post I want to share some comments related to the performance of DataFrames.jl.

The TLDR summary is: if you want top performance fall back to low-level Julia code (and DataFrames.jl exposes API that allows you to follow that path).

In this post I am using Julia 1.5.2 and DataFrames.jl 0.21.7.

The question and a direct answer

Assume you have a large data frame df that has two columns: :id and :v (in practice also many more columns that are irrelevant so I skip them). For each unique value of :id you want to keep only one row that corresponds to the smallest value of :v in the group defined by column :id.

We start with generating the example DataFrame and grouping it by :id column:

julia> using DataFrames, Random

julia> Random.seed!(1234);

julia> df = DataFrame(id=rand(1:10^6, 10^7), v=rand(10^7))
10000000×2 DataFrame
│ Row      │ id     │ v        │
│          │ Int64  │ Float64  │
├──────────┼────────┼──────────┤
│ 1        │ 99353  │ 0.954945 │
│ 2        │ 371388 │ 0.103541 │
⋮
│ 9999998  │ 71163  │ 0.37311  │
│ 9999999  │ 529940 │ 0.856262 │
│ 10000000 │ 845766 │ 0.401057 │

julia> gdf = groupby(df, :id);

julia> @time gdf = groupby(df, :id)
  1.326534 seconds (30 allocations: 280.590 MiB)
GroupedDataFrame with 999963 groups based on key: id
First Group (13 rows): id = 99353
│ Row │ id    │ v         │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 99353 │ 0.954945  │
│ 2   │ 99353 │ 0.584677  │
⋮
│ 11  │ 99353 │ 0.838052  │
│ 12  │ 99353 │ 0.0773429 │
│ 13  │ 99353 │ 0.774979  │
⋮
Last Group (1 row): id = 502811
│ Row │ id     │ v        │
│     │ Int64  │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ 502811 │ 0.953105 │

(note that I run here and below all commands twice to exclude compilation time and allocations from what is reported by @time macro)

A direct approach to solve this problem would be the following (here I store the result in res1 variable to compare it to other solutions at the end of this post):

julia> res1 = combine(sdf -> first(sort(sdf)), gdf);

julia> @time combine(sdf -> first(sort(sdf)), gdf)
 13.587720 seconds (296.78 M allocations: 12.672 GiB, 9.73% gc time)
999963×2 DataFrame
│ Row    │ id     │ v         │
│        │ Int64  │ Float64   │
├────────┼────────┼───────────┤
│ 1      │ 99353  │ 0.0436469 │
│ 2      │ 371388 │ 0.0203823 │
⋮
│ 999961 │ 167495 │ 0.228266  │
│ 999962 │ 527369 │ 0.0525908 │
│ 999963 │ 502811 │ 0.953105  │

We see that the operation is quite slow and massively allocates. The problem is that sort(sdf) allocates a fresh small DataFrame for each group. This is inefficient.

Let us think then if we can improve the performance of this code.

Pre-sorting

The first option to consider is pre-sorting of df like this:

julia> res2 = combine(first, groupby(sort(df, :v), :id));

julia> @time combine(first, groupby(sort(df, :v), :id))
  7.166366 seconds (14.03 M allocations: 1.811 GiB, 2.85% gc time)
999963×2 DataFrame
│ Row    │ id     │ v          │
│        │ Int64  │ Float64    │
├────────┼────────┼────────────┤
│ 1      │ 533731 │ 1.27094e-7 │
│ 2      │ 709049 │ 1.29805e-7 │
⋮
│ 999961 │ 803454 │ 0.991274   │
│ 999962 │ 107403 │ 0.99218    │
│ 999963 │ 925877 │ 0.994074   │

We already see much improvement. However, this solution has a problem that it still requires sorting, which in the chain sort, groupby, combine is the slowest step.

Using `argmin`

Actually we see that we do not need to sort our data to get the result. It is enough to find the minimum row with respect to column :v. Here is how one can do it:

julia> res3 = combine(sdf -> sdf[argmin(sdf.v), :], gdf);

julia> @time combine(sdf -> sdf[argmin(sdf.v), :], gdf)
  1.581478 seconds (15.48 M allocations: 593.183 MiB, 4.29% gc time)
999963×2 DataFrame
│ Row    │ id     │ v         │
│        │ Int64  │ Float64   │
├────────┼────────┼───────────┤
│ 1      │ 99353  │ 0.0436469 │
│ 2      │ 371388 │ 0.0203823 │
⋮
│ 999961 │ 167495 │ 0.228266  │
│ 999962 │ 527369 │ 0.0525908 │
│ 999963 │ 502811 │ 0.953105  │

Now we are much faster. We can do better if we start using a low-level API of DataFrames.jl and use Julia Base functionality.

Going for Julia Base

I will present two solutions here that do the job faster than the last example of combine. The first one uses the fact that we can iterate GroupedDataFrame and parentindices function returns us a tuple of indices into the parent of the passed object. Here is the approach that can be used:

julia> res4 = df[[parentindices(sdf)[1][argmin(sdf.v)] for sdf in gdf], :];

julia> @time df[[parentindices(sdf)[1][argmin(sdf.v)] for sdf in gdf], :]
  0.990385 seconds (5.19 M allocations: 325.875 MiB, 4.73% gc time)
999963×2 DataFrame
│ Row    │ id     │ v         │
│        │ Int64  │ Float64   │
├────────┼────────┼───────────┤
│ 1      │ 99353  │ 0.0436469 │
│ 2      │ 371388 │ 0.0203823 │
⋮
│ 999961 │ 167495 │ 0.228266  │
│ 999962 │ 527369 │ 0.0525908 │
│ 999963 │ 502811 │ 0.953105  │

And we see that we are somewhat faster. We can go much faster by using the groupindices function that gives us a group number for each row of the df data frame. The cost is that the solution is a bit more verbose:

julia> function getfirst(gdf, v)
           loc = zeros(Int, length(gdf))
           best = Vector{eltype(v)}(undef, length(gdf))
           for (i, (ind, vi)) in enumerate(zip(groupindices(gdf), v))
               if !ismissing(ind) && (loc[ind] == 0 || vi < best[ind])
                   best[ind] = vi
                   loc[ind] = i
               end
           end
           return parent(gdf)[loc, :]
       end
getfirst (generic function with 1 method)

julia> res5 = getfirst(gdf, df.v);

julia> @time getfirst(gdf, df.v)
  0.359210 seconds (20 allocations: 116.349 MiB)
999963×2 DataFrame
│ Row    │ id     │ v         │
│        │ Int64  │ Float64   │
├────────┼────────┼───────────┤
│ 1      │ 99353  │ 0.0436469 │
│ 2      │ 371388 │ 0.0203823 │
⋮
│ 999961 │ 167495 │ 0.228266  │
│ 999962 │ 527369 │ 0.0525908 │
│ 999963 │ 502811 │ 0.953105  │

We could go even faster (at least twice in my tests), but let me stop here and leave the potential improvements as an exercise.

Additionally, it is worth to mention that the vector groupindices(gdf) has a value equal to missing if some row from df is not mapped to any group in gdf (note that GroupedDataFrame can be indexed-into).

Conclusions

Before we conclude let us check if the five solutions give us the same results:

julia> length(unique(sort.([res1, res2, res3, res4, res5])))
1

Indeed they do. I had to use sort here, as res2 has a different row order than the rest of data frames holding the results.

What are the take aways from these examples:

if you think about avoiding allocations and doing unnecessary work (sorting in our case) you can get a reasonably fast result using high-level API of DataFrames.jl;
if you want to go really fast there is always an option to fall-back to custom code written in Julia Base and DataFrames.jl has a low-level API that allows you to efficiently do this.

The reason why high-level API is slower is that it handles dozens of possible use cases under one function and going low-level allows you do do only the thing that needs to be done in your use case.

In general, it is worth to highlight here that, as opposed to e.g. R, DataFrame in Julia is just one of the many data structures you can use. The high-level API DataFrames.jl provides will be fast enough in most of the cases (at the benefit of being very flexible). However, if things are too slow just go low-level, which is a natural thing to do in Julia.