# How is equality checked in DataFrames.jl?

# Introduction

Today I want to discuss how values are tested for being equal in functions provided by DataFrames.jl.

I already discussed the topic of equality testing in the past in this post and this post and explain it extensively in chapter 7 of my Julia for Data Analysis book. However, the issue is still often raised by users, so I thought it is useful to go back to it one more time.

The post was written under Julia 1.9.0-rc1, CategoricalArrays.jl 0.10.7 and DataFrames.jl 1.5.0.

# Why equality testing is hard?

When users learn Julia they are typically taught that `==`

is the operator
that should be used for testing for equality. Here is a basic example:

```
julia> 1 == 2
false
julia> 1 == 1
true
```

However, there are the following aspects of `==`

that make it not intuitive
in some scenarios.

First is that `==`

does not guarantee to return `Bool`

value. The problem
is that if one of its arguments is `missing`

then the result will be `missing`

:

```
julia> 1 == missing
missing
julia> missing == missing
missing
```

Clearly this is not desirable in cases when we expect the operation to return `Bool`

(e.g. when filtering data):

```
julia> x = [1, 2, missing, 4, 5]
5-element Vector{Union{Missing, Int64}}:
1
2
missing
4
5
julia> x[x .> 2.5]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
```

In such cases you should use the `coalesce`

function to decide if you want to keep or drop `missing`

values:

```
julia> x[coalesce.(x .> 2.5, false)]
2-element Vector{Union{Missing, Int64}}:
4
5
julia> x[coalesce.(x .> 2.5, true)]
3-element Vector{Union{Missing, Int64}}:
missing
4
5
```

The second issue is that `==`

follows IEEE semantics for floating-point numbers:

```
julia> NaN == NaN
false
julia> 0.0 == -0.0
true
```

First, you see that `NaN`

is not considered to be equal to `NaN`

. This can be quite surprising:

```
julia> x = [1, NaN, 2]
3-element Vector{Float64}:
1.0
NaN
2.0
julia> x[x .== NaN]
Float64[]
```

instead the `isnan`

function can be used:

```
julia> x[isnan.(x)]
1-element Vector{Float64}:
NaN
```

The `0.0`

and `-0.0`

case is even more tricky. These are two technically distinct values,
however, in some applications user might want them to be treated as equal, while in other
as not equal. IEEE standard determines that they are considered to be equal when compared
using `==`

.

In summary, the major problem with `==`

is that it does not define a proper equivalence
relation. First, some values are not comparable (`missing`

is returned); second
it is not reflexive (for `NaN`

).

# An alternative way to compare values

In many cases you need an equality operator that defines an equivalence relation.
In Julia this is provided by the `isequal`

function. As you can read in its documentation:

`isequal`

treats all floating-point`NaN`

values as equal to each other, treats`-0.0`

as unequal to`0.0`

, and`missing`

as equal to`missing`

. Always returns a`Bool`

value.

Let us check this:

```
julia> isequal(1, missing)
false
julia> isequal(missing, missing)
true
julia> isequal(NaN, NaN)
true
julia> isequal(0.0, -0.0)
false
```

In Julia functions that create equivalence classes over sets of some values use
`isequal`

to test for equality. In Base Julia such are for example `Dict`

and `Set`

operations or the `unique`

function:

```
julia> Set([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
Set{Union{Missing, Float64}} with 4 elements:
0.0
NaN
-0.0
missing
julia> unique([0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
4-element Vector{Union{Missing, Float64}}:
0.0
-0.0
NaN
missing
```

The same rules carry over to DataFrames.jl.

# Testing for equality in DataFrames.jl

There are the following functionalities of DataFrames.jl that rely on the `isequal`

equality test:

- deduplication with
`unique`

and related functions; - grouping with
`groupby`

; - joins (
`innerjoin`

etc.).

Let us see them in action one by one. We start with the deduplication:

```
julia> using DataFrames
julia> df = DataFrame(id=1:8, x=[0.0, 0.0, -0.0, -0.0, NaN, NaN, missing, missing])
8×2 DataFrame
Row │ id x
│ Int64 Float64?
─────┼──────────────────
1 │ 1 0.0
2 │ 2 0.0
3 │ 3 -0.0
4 │ 4 -0.0
5 │ 5 NaN
6 │ 6 NaN
7 │ 7 missing
8 │ 8 missing
julia> unique(df, :x)
4×2 DataFrame
Row │ id x
│ Int64 Float64?
─────┼──────────────────
1 │ 1 0.0
2 │ 3 -0.0
3 │ 5 NaN
4 │ 7 missing
```

Indeed, we see that `0.0`

and `-0.0`

are considered as not equal,
while `NaN`

and `missing`

are deduplicated.

Now let us turn to grouping:

```
julia> show(groupby(df, :x), allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (2 rows): x = 0.0
Row │ id x
│ Int64 Float64?
─────┼─────────────────
1 │ 1 0.0
2 │ 2 0.0
Group 2 (2 rows): x = -0.0
Row │ id x
│ Int64 Float64?
─────┼─────────────────
1 │ 3 -0.0
2 │ 4 -0.0
Group 3 (2 rows): x = NaN
Row │ id x
│ Int64 Float64?
─────┼─────────────────
1 │ 5 NaN
2 │ 6 NaN
Group 4 (2 rows): x = missing
Row │ id x
│ Int64 Float64?
─────┼─────────────────
1 │ 7 missing
2 │ 8 missing
```

As you can see we get the same result again. As a side note let me
comment that `unique`

internally uses the same mechanism as `groupby`

to identify duplicates.

Finally let us check joins:

```
julia> df_ref = DataFrame(x=[0.0, missing], val=1:2)
2×2 DataFrame
Row │ x val
│ Float64? Int64
─────┼──────────────────
1 │ 0.0 1
2 │ missing 2
julia> outerjoin(df, df_ref, on=:x)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error
```

We get a first problem. Joins detect that `missing`

value is present in key column.
By default it errors in such a case. We can change it using the `matchmissing`

keyword argument.
Let us assume that we want `missing`

values to be treated as equal and try the following join:

```
julia> outerjoin(df, df_ref, on=:x, matchmissing=:equal)
ERROR: ArgumentError: currently for numeric values NaN and `-0.0` in their real or imaginary components are not allowed. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.
```

We still get an error. In joins, for safety, if `-0.0`

is encountered in key then an error is thrown. This can be fixed by transforming the `:x`

column to categorical,
in which case `-0.0`

and `0.0`

are considered to be different:

```
julia> using CategoricalArrays
julia> outerjoin(transform(df, :x => categorical => :x), df_ref, on=:x, matchmissing=:equal)
8×3 DataFrame
Row │ id x val
│ Int64? Float64? Int64?
─────┼────────────────────────────
1 │ 1 0.0 1
2 │ 2 0.0 1
3 │ 7 missing 2
4 │ 8 missing 2
5 │ 3 -0.0 missing
6 │ 4 -0.0 missing
7 │ 5 NaN missing
8 │ 6 NaN missing
```

Let us check categorical vector in more detail:

```
julia> levels(categorical(df.x))
3-element Vector{Float64}:
-0.0
0.0
NaN
```

As you can see `0.0`

and `-0.0`

are considered to be separate levels in a categorical vector.

# Conclusions

I hope the examples given today were useful for understanding how `==`

and `isequal`

work in Julia.

As a final comment let me add that throwing an error for joins on `-0.0`

was a decision that was made
for safety reasons. However, if users give us a feedback that adding other options of handling `-0.0`

would be useful (e.g. treating them as equal or not-equal) then we could consider adding this feature
in the future releases of DataFrames.jl.