DataFrames.jl joins: matchmissing=:notequal
Introduction
In DataFrames.jl we have recently added in this PR a new
option for matchmissing
keyword argument in joins. This functionality will be
made available in 1.2 release. In this post I want to discuss this new feature
before we release it.
The post is tested under Julia 1.6.1 and on DataFrames.jl main
branch
(that includes the relevant PR).
How matchmissing
keyword argument works
The matchmissing
keyword argument allows the user to decide how missing
value is handled in on
columns in joins. After this PR you have
three options to choose from:
:error
(the default): throw an error ifmissing
value is present in any of theon
columns; the rationale is thatmissing
indicates unknown value so if we knew it it could match to any of the non-missing values in theon
columns in the other data frame we join;:equal
:missing
values are allowed and they are matched tomissing
values only; in this scenario we treatmissing
as any other value without giving it a special treatment;:notequal
(a new option): in this casemissing
is considered to be not equal to any other value (includingmissing
).
Let me comment a bit more on the consequences of the :notequal
rule. In
innerjoin
this means that rows with missing
values will be dropped both in
left and right table. In leftjoin
, semijoin
and antijoin
they are dropped
from the right table only (which means that if missing
is present in the left
table it is retained in processing but considered not to match any row in right
table). Similarly in rightjon
rows with missing
are dropped from left table
only. The case that is most difficult to handle is outerjoin
. The reason is
that if missing
would be present in both left and right table they would be
considered not equal and produce separate rows in the output table. We
considered this behavior as potentially confusing and therefore decided not to
allow :notequal
in outerjoin
.
Let me move to the examples showing the matchmissing=:notequal
at work.
Examples
Here is a simple example code showing how the new option works:
julia> using DataFrames
julia> df1 = DataFrame(id=[1, missing, 3, 4], x=1:4)
4×2 DataFrame
Row │ id x
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ missing 2
3 │ 3 3
4 │ 4 4
julia> df2 = DataFrame(id=[1, 2, missing, 4], y=1:4)
4×2 DataFrame
Row │ id y
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ 2 2
3 │ missing 3
4 │ 4 4
Now we investigate all the possible join operations:
julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
2×3 DataFrame
Row │ id x y
│ Int64? Int64 Int64
─────┼──────────────────────
1 │ 1 1 1
2 │ 4 4 4
As you can see for innerjoin
only rows with :id
equal to 1
and 4
were
retained. Let us move forward:
julia> leftjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
Row │ id x y source
│ Int64? Int64 Int64? String
─────┼────────────────────────────────────
1 │ 1 1 1 both
2 │ 4 4 4 both
3 │ missing 2 missing left_only
4 │ 3 3 missing left_only
julia> rightjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
Row │ id x y source
│ Int64? Int64? Int64 String
─────┼─────────────────────────────────────
1 │ 1 1 1 both
2 │ 4 4 4 both
3 │ 2 missing 2 right_only
4 │ missing missing 3 right_only
For leftjoin
and rightjoin
we retain missing
but only in the table for
which all rows must be retained. Therefore in leftjoin
for :id
equal to
missing
we have :x
equal to 2
, but :y
equal to missing
(signaling that
there was no match which we can also see in :source
column). The same
happens for :id
equal to missing
in rightjoin
, but then :x
is set to
missing
.
The same rules work with semijoin
and antijoin
as you can see here:
julia> semijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
Row │ id x
│ Int64? Int64
─────┼───────────────
1 │ 1 1
2 │ 4 4
julia> antijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
Row │ id x
│ Int64? Int64
─────┼────────────────
1 │ missing 2
2 │ 3 3
Finally outerjoin
just throws an error:
julia> outerjoin(df1, df2, on=:id, matchmissing=:notequal)
ERROR: ArgumentError: matchmissing == :notequal for `outerjoin` is not allowed
Conclusions
I hope this post helped you to learn the rationale and design of the new option
for the matchmissing
keyword argument in joins. If you have any comments on
the functionality or its documentation please open an issue on DataFrames.jl GitHub repository.
Finally I would like to thank pstorozenko, nilshg, and nalimilan for working on this functionality.