DataFrames.jl joins: matchmissing=:notequal
Introduction
In DataFrames.jl we have recently added in this PR a new
option for matchmissing keyword argument in joins. This functionality will be
made available in 1.2 release. In this post I want to discuss this new feature
before we release it.
The post is tested under Julia 1.6.1 and on DataFrames.jl main branch
(that includes the relevant PR).
How matchmissing keyword argument works
The matchmissing keyword argument allows the user to decide how missing
value is handled in on columns in joins. After this PR you have
three options to choose from:
:error(the default): throw an error ifmissingvalue is present in any of theoncolumns; the rationale is thatmissingindicates unknown value so if we knew it it could match to any of the non-missing values in theoncolumns in the other data frame we join;:equal:missingvalues are allowed and they are matched tomissingvalues only; in this scenario we treatmissingas any other value without giving it a special treatment;:notequal(a new option): in this casemissingis considered to be not equal to any other value (includingmissing).
Let me comment a bit more on the consequences of the :notequal rule. In
innerjoin this means that rows with missing values will be dropped both in
left and right table. In leftjoin, semijoin and antijoin they are dropped
from the right table only (which means that if missing is present in the left
table it is retained in processing but considered not to match any row in right
table). Similarly in rightjon rows with missing are dropped from left table
only. The case that is most difficult to handle is outerjoin. The reason is
that if missing would be present in both left and right table they would be
considered not equal and produce separate rows in the output table. We
considered this behavior as potentially confusing and therefore decided not to
allow :notequal in outerjoin.
Let me move to the examples showing the matchmissing=:notequal at work.
Examples
Here is a simple example code showing how the new option works:
julia> using DataFrames
julia> df1 = DataFrame(id=[1, missing, 3, 4], x=1:4)
4×2 DataFrame
Row │ id x
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ missing 2
3 │ 3 3
4 │ 4 4
julia> df2 = DataFrame(id=[1, 2, missing, 4], y=1:4)
4×2 DataFrame
Row │ id y
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ 2 2
3 │ missing 3
4 │ 4 4
Now we investigate all the possible join operations:
julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
2×3 DataFrame
Row │ id x y
│ Int64? Int64 Int64
─────┼──────────────────────
1 │ 1 1 1
2 │ 4 4 4
As you can see for innerjoin only rows with :id equal to 1 and 4 were
retained. Let us move forward:
julia> leftjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
Row │ id x y source
│ Int64? Int64 Int64? String
─────┼────────────────────────────────────
1 │ 1 1 1 both
2 │ 4 4 4 both
3 │ missing 2 missing left_only
4 │ 3 3 missing left_only
julia> rightjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
Row │ id x y source
│ Int64? Int64? Int64 String
─────┼─────────────────────────────────────
1 │ 1 1 1 both
2 │ 4 4 4 both
3 │ 2 missing 2 right_only
4 │ missing missing 3 right_only
For leftjoin and rightjoin we retain missing but only in the table for
which all rows must be retained. Therefore in leftjoin for :id equal to
missing we have :x equal to 2, but :y equal to missing (signaling that
there was no match which we can also see in :source column). The same
happens for :id equal to missing in rightjoin, but then :x is set to
missing.
The same rules work with semijoin and antijoin as you can see here:
julia> semijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
Row │ id x
│ Int64? Int64
─────┼───────────────
1 │ 1 1
2 │ 4 4
julia> antijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
Row │ id x
│ Int64? Int64
─────┼────────────────
1 │ missing 2
2 │ 3 3
Finally outerjoin just throws an error:
julia> outerjoin(df1, df2, on=:id, matchmissing=:notequal)
ERROR: ArgumentError: matchmissing == :notequal for `outerjoin` is not allowed
Conclusions
I hope this post helped you to learn the rationale and design of the new option
for the matchmissing keyword argument in joins. If you have any comments on
the functionality or its documentation please open an issue on DataFrames.jl GitHub repository.
Finally I would like to thank pstorozenko, nilshg, and nalimilan for working on this functionality.