How is equality checked in DataFrames.jl?
Introduction
Today I want to discuss how values are tested for being equal in functions provided by DataFrames.jl.
I already discussed the topic of equality testing in the past in this post and this post and explain it extensively in chapter 7 of my Julia for Data Analysis book. However, the issue is still often raised by users, so I thought it is useful to go back to it one more time.
The post was written under Julia 1.9.0-rc1, CategoricalArrays.jl 0.10.7 and DataFrames.jl 1.5.0.
Why equality testing is hard?
When users learn Julia they are typically taught that ==
is the operator
that should be used for testing for equality. Here is a basic example:
julia> 1 == 2
false
julia> 1 == 1
true
However, there are the following aspects of ==
that make it not intuitive
in some scenarios.
First is that ==
does not guarantee to return Bool
value. The problem
is that if one of its arguments is missing
then the result will be missing
:
julia> 1 == missing
missing
julia> missing == missing
missing
Clearly this is not desirable in cases when we expect the operation to return Bool
(e.g. when filtering data):
julia> x = [1, 2, missing, 4, 5]
5-element Vector{Union{Missing, Int64}}:
1
2
missing
4
5
julia> x[x .> 2.5]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
In such cases you should use the coalesce
function to decide if you want to keep or drop missing
values:
julia> x[coalesce.(x .> 2.5, false)]
2-element Vector{Union{Missing, Int64}}:
4
5
julia> x[coalesce.(x .> 2.5, true)]
3-element Vector{Union{Missing, Int64}}:
missing
4
5
The second issue is that ==
follows IEEE semantics for floating-point numbers:
julia> NaN == NaN
false
julia> 0.0 == -0.0
true
First, you see that NaN
is not considered to be equal to NaN
. This can be quite surprising:
julia> x = [1, NaN, 2]
3-element Vector{Float64}:
1.0
NaN
2.0
julia> x[x .== NaN]
Float64[]
instead the isnan
function can be used:
julia> x[isnan.(x)]
1-element Vector{Float64}:
NaN
The 0.0
and -0.0
case is even more tricky. These are two technically distinct values,
however, in some applications user might want them to be treated as equal, while in other
as not equal. IEEE standard determines that they are considered to be equal when compared
using ==
.
In summary, the major problem with ==
is that it does not define a proper equivalence
relation. First, some values are not comparable (missing
is returned); second
it is not reflexive (for NaN
).
An alternative way to compare values
In many cases you need an equality operator that defines an equivalence relation.
In Julia this is provided by the isequal
function. As you can read in its documentation:
isequal
treats all floating-pointNaN
values as equal to each other, treats-0.0
as unequal to0.0
, andmissing
as equal tomissing
. Always returns aBool
value.
Let us check this:
julia> isequal(1, missing)
false
julia> isequal(missing, missing)
true
julia> isequal(NaN, NaN)
true
julia> isequal(0.0, -0.0)
false
In Julia functions that create equivalence classes over sets of some values use
isequal
to test for equality. In Base Julia such are for example Dict
and Set
operations or the unique
function:
julia> Set([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
Set{Union{Missing, Float64}} with 4 elements:
0.0
NaN
-0.0
missing
julia> unique([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
4-element Vector{Union{Missing, Float64}}:
0.0
-0.0
NaN
missing
The same rules carry over to DataFrames.jl.
Testing for equality in DataFrames.jl
There are the following functionalities of DataFrames.jl that rely on the isequal
equality test:
- deduplication with
unique
and related functions; - grouping with
groupby
; - joins (
innerjoin
etc.).
Let us see them in action one by one. We start with the deduplication:
julia> using DataFrames
julia> df = DataFrame(id=1:8, x=[0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
8×2 DataFrame
Row │ id x
│ Int64 Float64?
─────┼──────────────────
1 │ 1 0.0
2 │ 2 0.0
3 │ 3 -0.0
4 │ 4 -0.0
5 │ 5 NaN
6 │ 6 NaN
7 │ 7 missing
8 │ 8 missing
julia> unique(df, :x)
4×2 DataFrame
Row │ id x
│ Int64 Float64?
─────┼──────────────────
1 │ 1 0.0
2 │ 3 -0.0
3 │ 5 NaN
4 │ 7 missing
Indeed, we see that 0.0
and -0.0
are considered as not equal,
while NaN
and missing
are deduplicated.
Now let us turn to grouping:
julia> show(groupby(df, :x), allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (2 rows): x = 0.0
Row │ id x
│ Int64 Float64?
─────┼─────────────────
1 │ 1 0.0
2 │ 2 0.0
Group 2 (2 rows): x = -0.0
Row │ id x
│ Int64 Float64?
─────┼─────────────────
1 │ 3 -0.0
2 │ 4 -0.0
Group 3 (2 rows): x = NaN
Row │ id x
│ Int64 Float64?
─────┼─────────────────
1 │ 5 NaN
2 │ 6 NaN
Group 4 (2 rows): x = missing
Row │ id x
│ Int64 Float64?
─────┼─────────────────
1 │ 7 missing
2 │ 8 missing
As you can see we get the same result again. As a side note let me
comment that unique
internally uses the same mechanism as groupby
to identify duplicates.
Finally let us check joins:
julia> df_ref = DataFrame(x=[0.0, missing], val=1:2)
2×2 DataFrame
Row │ x val
│ Float64? Int64
─────┼──────────────────
1 │ 0.0 1
2 │ missing 2
julia> outerjoin(df, df_ref, on=:x)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error
We get a first problem. Joins detect that missing
value is present in key column.
By default it errors in such a case. We can change it using the matchmissing
keyword argument.
Let us assume that we want missing
values to be treated as equal and try the following join:
julia> outerjoin(df, df_ref, on=:x, matchmissing=:equal)
ERROR: ArgumentError: currently for numeric values NaN and `-0.0` in their real or imaginary components are not allowed. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.
We still get an error. In joins, for safety, if -0.0
is encountered in key then an error is thrown. This can be fixed by transforming the :x
column to categorical,
in which case -0.0
and 0.0
are considered to be different:
julia> using CategoricalArrays
julia> outerjoin(transform(df, :x => categorical => :x), df_ref, on=:x, matchmissing=:equal)
8×3 DataFrame
Row │ id x val
│ Int64? Float64? Int64?
─────┼────────────────────────────
1 │ 1 0.0 1
2 │ 2 0.0 1
3 │ 7 missing 2
4 │ 8 missing 2
5 │ 3 -0.0 missing
6 │ 4 -0.0 missing
7 │ 5 NaN missing
8 │ 6 NaN missing
Let us check categorical vector in more detail:
julia> levels(categorical(df.x))
3-element Vector{Float64}:
-0.0
0.0
NaN
As you can see 0.0
and -0.0
are considered to be separate levels in a categorical vector.
Conclusions
I hope the examples given today were useful for understanding how ==
and isequal
work in Julia.
As a final comment let me add that throwing an error for joins on -0.0
was a decision that was made
for safety reasons. However, if users give us a feedback that adding other options of handling -0.0
would be useful (e.g. treating them as equal or not-equal) then we could consider adding this feature
in the future releases of DataFrames.jl.