The order of join and grouping result in DataFrames.jl
Introduction
Today I want to focus on an issue that is often not noticed by users when working DataFrames.jl, but in some cases it it might be relevant.
The subject is the order of join and grouping operations result in DataFrames.jl. The key point of the post is that this order depends on several factors, so it is simplest to assume that it is undefined. I am not going to list all cases in my examples, but just focus on showing the fact as in the future the order might change.
In this post I am using Julia 1.6.1 and DataFrames.jl 1.0.1.
Joins
Consider the following example of innerjoin
:
julia> using DataFrames
julia> df1 = DataFrame(x=[2, 3, 1, 4], id1=1:4)
4×2 DataFrame
Row │ x id1
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 3 2
3 │ 1 3
4 │ 4 4
julia> df2 = DataFrame(x=[1, 3, 2, 5, 6], id2=1:5)
5×2 DataFrame
Row │ x id2
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 3 2
3 │ 2 3
4 │ 5 4
5 │ 6 5
julia> innerjoin(df1, df2, on=:x)
3×3 DataFrame
Row │ x id1 id2
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 3 1
2 │ 3 2 2
3 │ 2 1 3
julia> innerjoin(df2, df1, on=:x)
3×3 DataFrame
Row │ x id2 id1
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 3
2 │ 3 2 2
3 │ 2 3 1
As you can see currently the row order in the result of innerjoin
is taken
from the longer table.
Now consider outerjoin
(similar results are for leftjoin
and rightjoin
):
julia> outerjoin(df1, df2, on=:x)
6×3 DataFrame
Row │ x id1 id2
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 3 1
2 │ 3 2 2
3 │ 2 1 3
4 │ 4 4 missing
5 │ 5 missing 4
6 │ 6 missing 5
julia> outerjoin(df2, df1, on=:x)
6×3 DataFrame
Row │ x id2 id1
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 3
2 │ 3 2 2
3 │ 2 3 1
4 │ 5 4 missing
5 │ 6 5 missing
6 │ 4 missing 4
Now we have the following parts of the table: first comes the chunk
matching what innerjoin
produces, then we have a non-matching part from the
left table, and finally we have a non-matching part from the right table.
While before 1.0 release we did not guarantee the row order in joins, the actual order has changed in DataFrames.jl 1.0. The reason were performance considerations. Consider the following examples of joins and their timing:
julia> df1 = DataFrame(x=string.(1:10^7));
julia> df2 = DataFrame(x=string.(1:10));
julia> @time innerjoin(df1, df2, on=:x);
0.246627 seconds (176 allocations: 13.797 KiB)
julia> @time innerjoin(df2, df1, on=:x);
0.237981 seconds (175 allocations: 13.781 KiB)
(I am showing you the timings after compilationp; I use Vector{String}
to join
on as this case is the slowest scenario under DataFrames.jl 1.0).
Now switch to DataFrames.jl 0.22.7 for a while (you need a fresh session and a fresh project environment to test this; timings are again after compilation):
julia> df1 = DataFrame(x=string.(1:10^7));
julia> df2 = DataFrame(x=string.(1:10));
julia> @time innerjoin(df1, df2, on=:x);
0.350317 seconds (177 allocations: 152.602 MiB)
julia> @time innerjoin(df2, df1, on=:x);
1.140921 seconds (183 allocations: 662.071 MiB)
As you can see the current algorithm not only uses much less memory, but also it is faster in general and not affected by the argument order (the last thing was a major bane of joins before 1.0 release of DataFrames.jl).
For a reference check what data.table
in R offers in this case in terms of
performance (I am adding it as the performance against data.table
is a hot
topic recently):
> library(data.table)
data.table 1.14.0 using 4 threads (see ?getDTthreads). Latest news: r-datatable.com
> dt1 <- data.table(x=as.character(1:10^7))
> dt2 <- data.table(x=as.character(1:10))
> system.time(merge(dt1, dt2, all=FALSE))
user system elapsed
7.445 0.153 3.544
> system.time(merge(dt1, dt2, all=FALSE, sort=FALSE))
user system elapsed
6.735 0.128 2.827
(note that I have used non-pooled vectors in both cases, as this was the scenario that allowed me to compare DataFrames.jl 1.0 and 0.22.7 best; clearly if we joined on pooled vectors the timings would be much better)
Grouping
In groupby
operation the rules of ordering of the GroupedDataFrame
object
depend on the type of the column you join on (I am assuming you are not passing
sort=true
keyword argument, as then groups are sorted). The two cases are:
- if you join on columns that are pooled (like
PooledVector
orCategoricalVector
) and the number of possible groups is not huge then you get your result in the order of levels in the pool; - otherwise the group ordering is their order of appearance in the source vector.
Here a particular corner case are integer columns, which are treated to be pooled (so the groups are sorted), unless the range of the integers is huge (as then we fall back to the order of appearance). Here is an example:
julia> df = DataFrame(x=[3, 1, 2], y=[300, 1, 2])
3×2 DataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 3 300
2 │ 1 1
3 │ 2 2
julia> keys(groupby(df, :x))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (x = 1,)
GroupKey: (x = 2,)
GroupKey: (x = 3,)
julia> keys(groupby(df, :y))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (y = 300,)
GroupKey: (y = 1,)
GroupKey: (y = 2,)
What is considered to be huge is left undefined (as it might change in the future), but in general if there is less levels than the number of rows in of the data frame (this is a typical case in practice) then we do not consider it as a huge number.
Again - let us do some benchmarking against the 0.22.7 release of DataFrames.jl. First the results for DataFrames.jl 1.0:
julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);
julia> @time groupby(df, :x);
0.055407 seconds (64 allocations: 85.834 MiB)
julia> @time groupby(df, :y);
0.895716 seconds (50 allocations: 280.591 MiB)
and now under 0.22.7 release:
julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);
julia> @time groupby(df, :x);
0.890674 seconds (31 allocations: 280.590 MiB)
julia> @time groupby(df, :y);
0.884177 seconds (31 allocations: 280.590 MiB)
As you can see, in the case of grouping integer columns we are much faster than before if the integer range is not huge.
Let us have a comparison with data.table
again (we need to also perform some
aggregation to match apples to apples in terms of timing).
First DataFrames.jl 1.0:
julia> df = DataFrame(x=1:10^7+1, y=[1:10^7; 10^10]);
julia> @time combine(groupby(df, :x), nrow);
0.180592 seconds (261 allocations: 324.266 MiB)
julia> @time combine(groupby(df, :y), nrow);
1.006619 seconds (247 allocations: 519.023 MiB)
> df <- data.table(x=1:(10^7+1), y=c(1:10^7, 10^10))
> system.time(df[, .N, by = x])
user system elapsed
0.644 0.088 0.266
> system.time(df[, .N, by = y])
user system elapsed
0.991 0.096 0.404
This time for the huge range DataFrames.jl is slower. (note that data.table
is using four threads - which is great - and I tested my code on a single thread
in Julia, as in DataFrames.jl we do not support multi-threading in this
particular case yet)
Conclusions
In summary: although there are precise rules that determine the order of join and grouping results is simplest to assume that it is undefined (like in data bases). The reason is operation performance considerations (so the rules are complex and might change in the future).
However, based on the user feedback, we might in the future consider adding keyword arguments that would guarantee some particular order. Therefore if you have any thoughts on it please open an issue in the DataFrames.jl repository on GitHub.