Introduction

Today I want to focus on an issue that is often not noticed by users when working DataFrames.jl, but in some cases it it might be relevant.

The subject is the order of join and grouping operations result in DataFrames.jl. The key point of the post is that this order depends on several factors, so it is simplest to assume that it is undefined. I am not going to list all cases in my examples, but just focus on showing the fact as in the future the order might change.

In this post I am using Julia 1.6.1 and DataFrames.jl 1.0.1.

Joins

Consider the following example of innerjoin:

julia> using DataFrames

julia> df1 = DataFrame(x=[2, 3, 1, 4], id1=1:4)
4×2 DataFrame
 Row │ x      id1
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     3      2
   3 │     1      3
   4 │     4      4

julia> df2 = DataFrame(x=[1, 3, 2, 5, 6], id2=1:5)
5×2 DataFrame
 Row │ x      id2
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     3      2
   3 │     2      3
   4 │     5      4
   5 │     6      5

julia> innerjoin(df1, df2, on=:x)
3×3 DataFrame
 Row │ x      id1    id2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      1
   2 │     3      2      2
   3 │     2      1      3

julia> innerjoin(df2, df1, on=:x)
3×3 DataFrame
 Row │ x      id2    id1
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      3
   2 │     3      2      2
   3 │     2      3      1

As you can see currently the row order in the result of innerjoin is taken from the longer table.

Now consider outerjoin (similar results are for leftjoin and rightjoin):

julia> outerjoin(df1, df2, on=:x)
6×3 DataFrame
 Row │ x      id1      id2
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        3        1
   2 │     3        2        2
   3 │     2        1        3
   4 │     4        4  missing
   5 │     5  missing        4
   6 │     6  missing        5

julia> outerjoin(df2, df1, on=:x)
6×3 DataFrame
 Row │ x      id2      id1
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        1        3
   2 │     3        2        2
   3 │     2        3        1
   4 │     5        4  missing
   5 │     6        5  missing
   6 │     4  missing        4

Now we have the following parts of the table: first comes the chunk matching what innerjoin produces, then we have a non-matching part from the left table, and finally we have a non-matching part from the right table.

While before 1.0 release we did not guarantee the row order in joins, the actual order has changed in DataFrames.jl 1.0. The reason were performance considerations. Consider the following examples of joins and their timing:

julia> df1 = DataFrame(x=string.(1:10^7));

julia> df2 = DataFrame(x=string.(1:10));

julia> @time innerjoin(df1, df2, on=:x);
  0.246627 seconds (176 allocations: 13.797 KiB)

julia> @time innerjoin(df2, df1, on=:x);
  0.237981 seconds (175 allocations: 13.781 KiB)

(I am showing you the timings after compilationp; I use Vector{String} to join on as this case is the slowest scenario under DataFrames.jl 1.0).

Now switch to DataFrames.jl 0.22.7 for a while (you need a fresh session and a fresh project environment to test this; timings are again after compilation):

julia> df1 = DataFrame(x=string.(1:10^7));

julia> df2 = DataFrame(x=string.(1:10));

julia> @time innerjoin(df1, df2, on=:x);
  0.350317 seconds (177 allocations: 152.602 MiB)

julia> @time innerjoin(df2, df1, on=:x);
  1.140921 seconds (183 allocations: 662.071 MiB)

As you can see the current algorithm not only uses much less memory, but also it is faster in general and not affected by the argument order (the last thing was a major bane of joins before 1.0 release of DataFrames.jl).

For a reference check what data.table in R offers in this case in terms of performance (I am adding it as the performance against data.table is a hot topic recently):

> library(data.table)
data.table 1.14.0 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> dt1 <- data.table(x=as.character(1:10^7))
> dt2 <- data.table(x=as.character(1:10))
> system.time(merge(dt1, dt2, all=FALSE))
   user  system elapsed
  7.445   0.153   3.544
> system.time(merge(dt1, dt2, all=FALSE, sort=FALSE))
   user  system elapsed
  6.735   0.128   2.827

(note that I have used non-pooled vectors in both cases, as this was the scenario that allowed me to compare DataFrames.jl 1.0 and 0.22.7 best; clearly if we joined on pooled vectors the timings would be much better)

Grouping

In groupby operation the rules of ordering of the GroupedDataFrame object depend on the type of the column you join on (I am assuming you are not passing sort=true keyword argument, as then groups are sorted). The two cases are:

if you join on columns that are pooled (like PooledVector or CategoricalVector) and the number of possible groups is not huge then you get your result in the order of levels in the pool;
otherwise the group ordering is their order of appearance in the source vector.

Here a particular corner case are integer columns, which are treated to be pooled (so the groups are sorted), unless the range of the integers is huge (as then we fall back to the order of appearance). Here is an example:

julia> df = DataFrame(x=[3, 1, 2], y=[300, 1, 2])
3×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     3    300
   2 │     1      1
   3 │     2      2

julia> keys(groupby(df, :x))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (x = 1,)
 GroupKey: (x = 2,)
 GroupKey: (x = 3,)

julia> keys(groupby(df, :y))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (y = 300,)
 GroupKey: (y = 1,)
 GroupKey: (y = 2,)

What is considered to be huge is left undefined (as it might change in the future), but in general if there is less levels than the number of rows in of the data frame (this is a typical case in practice) then we do not consider it as a huge number.

Again - let us do some benchmarking against the 0.22.7 release of DataFrames.jl. First the results for DataFrames.jl 1.0:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time groupby(df, :x);
  0.055407 seconds (64 allocations: 85.834 MiB)

julia> @time groupby(df, :y);
  0.895716 seconds (50 allocations: 280.591 MiB)

and now under 0.22.7 release:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time groupby(df, :x);
  0.890674 seconds (31 allocations: 280.590 MiB)

julia> @time groupby(df, :y);
  0.884177 seconds (31 allocations: 280.590 MiB)

As you can see, in the case of grouping integer columns we are much faster than before if the integer range is not huge.

Let us have a comparison with data.table again (we need to also perform some aggregation to match apples to apples in terms of timing).

First DataFrames.jl 1.0:

julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);

julia> @time combine(groupby(df, :x), nrow);
  0.180592 seconds (261 allocations: 324.266 MiB)

julia> @time combine(groupby(df, :y), nrow);
  1.006619 seconds (247 allocations: 519.023 MiB)

> df <- data.table(x=1:(10^7+1), y=c(1:10^7, 10^10))
> system.time(df[, .N, by = x])
   user  system elapsed
  0.644   0.088   0.266
> system.time(df[, .N, by = y])
   user  system elapsed
  0.991   0.096   0.404

This time for the huge range DataFrames.jl is slower. (note that data.table is using four threads - which is great - and I tested my code on a single thread in Julia, as in DataFrames.jl we do not support multi-threading in this particular case yet)

Conclusions

In summary: although there are precise rules that determine the order of join and grouping results is simplest to assume that it is undefined (like in data bases). The reason is operation performance considerations (so the rules are complex and might change in the future).

However, based on the user feedback, we might in the future consider adding keyword arguments that would guarantee some particular order. Therefore if you have any thoughts on it please open an issue in the DataFrames.jl repository on GitHub.