Introduction

In DataFrames.jl we have recently added in this PR a new option for matchmissing keyword argument in joins. This functionality will be made available in 1.2 release. In this post I want to discuss this new feature before we release it.

The post is tested under Julia 1.6.1 and on DataFrames.jl main branch (that includes the relevant PR).

How matchmissing keyword argument works

The matchmissing keyword argument allows the user to decide how missing value is handled in on columns in joins. After this PR you have three options to choose from:

  • :error (the default): throw an error if missing value is present in any of the on columns; the rationale is that missing indicates unknown value so if we knew it it could match to any of the non-missing values in the on columns in the other data frame we join;
  • :equal: missing values are allowed and they are matched to missing values only; in this scenario we treat missing as any other value without giving it a special treatment;
  • :notequal (a new option): in this case missing is considered to be not equal to any other value (including missing).

Let me comment a bit more on the consequences of the :notequal rule. In innerjoin this means that rows with missing values will be dropped both in left and right table. In leftjoin, semijoin and antijoin they are dropped from the right table only (which means that if missing is present in the left table it is retained in processing but considered not to match any row in right table). Similarly in rightjon rows with missing are dropped from left table only. The case that is most difficult to handle is outerjoin. The reason is that if missing would be present in both left and right table they would be considered not equal and produce separate rows in the output table. We considered this behavior as potentially confusing and therefore decided not to allow :notequal in outerjoin.

Let me move to the examples showing the matchmissing=:notequal at work.

Examples

Here is a simple example code showing how the new option works:

julia> using DataFrames

julia> df1 = DataFrame(id=[1, missing, 3, 4], x=1:4)
4×2 DataFrame
 Row │ id       x
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │ missing      2
   3 │       3      3
   4 │       4      4

julia> df2 = DataFrame(id=[1, 2, missing, 4], y=1:4)
4×2 DataFrame
 Row │ id       y
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │       2      2
   3 │ missing      3
   4 │       4      4

Now we investigate all the possible join operations:

julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
2×3 DataFrame
 Row │ id      x      y
     │ Int64?  Int64  Int64
─────┼──────────────────────
   1 │      1      1      1
   2 │      4      4      4

As you can see for innerjoin only rows with :id equal to 1 and 4 were retained. Let us move forward:

julia> leftjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
 Row │ id       x      y        source
     │ Int64?   Int64  Int64?   String
─────┼────────────────────────────────────
   1 │       1      1        1  both
   2 │       4      4        4  both
   3 │ missing      2  missing  left_only
   4 │       3      3  missing  left_only

julia> rightjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
 Row │ id       x        y      source
     │ Int64?   Int64?   Int64  String
─────┼─────────────────────────────────────
   1 │       1        1      1  both
   2 │       4        4      4  both
   3 │       2  missing      2  right_only
   4 │ missing  missing      3  right_only

For leftjoin and rightjoin we retain missing but only in the table for which all rows must be retained. Therefore in leftjoin for :id equal to missing we have :x equal to 2, but :y equal to missing (signaling that there was no match which we can also see in :source column). The same happens for :id equal to missing in rightjoin, but then :x is set to missing.

The same rules work with semijoin and antijoin as you can see here:

julia> semijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
 Row │ id      x
     │ Int64?  Int64
─────┼───────────────
   1 │      1      1
   2 │      4      4

julia> antijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
 Row │ id       x
     │ Int64?   Int64
─────┼────────────────
   1 │ missing      2
   2 │       3      3

Finally outerjoin just throws an error:

julia> outerjoin(df1, df2, on=:id, matchmissing=:notequal)
ERROR: ArgumentError: matchmissing == :notequal for `outerjoin` is not allowed

Conclusions

I hope this post helped you to learn the rationale and design of the new option for the matchmissing keyword argument in joins. If you have any comments on the functionality or its documentation please open an issue on DataFrames.jl GitHub repository.

Finally I would like to thank pstorozenko, nilshg, and nalimilan for working on this functionality.