Sorting data with missing values
Introduction
Sorting is one of the most common operations one wants to do with collections.
In this post I discuss how one can sort data that contain missing
values.
The post was written under Julia 1.10.1 and Missings.jl 1.2.0.
General rules of comparison with missing values
By default missing
is considered as greater than any other different value it is compared with:
julia> isless(Inf, missing)
true
julia> isless("abc", missing)
true
julia> isless(r"abc", missing)
true
Note, in particular, the last case. Although Regex
does not support comparisons it can be compared to missing
.
The reason is that isless
has a general catch-all definition when one of the arguments is missing
. Let us see it:
isless(::Missing, ::Missing) = false
isless(::Missing, ::Any) = false
isless(::Any, ::Missing) = true
The rule that missing
is greater than all else has an important consequence when sorting.
Default sorting with missing values
Let us create a simple vector containing missing
values:
julia> x = [missing, 3, 1, missing, 2, 4, missing]
7-element Vector{Union{Missing, Int64}}:
missing
3
1
missing
2
4
missing
If we sort
it missing
values end up at the end of the produced vector
because, by default, sorting is done in ascending order:
julia> sort(x)
7-element Vector{Union{Missing, Int64}}:
1
2
3
4
missing
missing
missing
If we want to get values in descending order missing
values come first:
julia> sort(x, rev=true)
7-element Vector{Union{Missing, Int64}}:
missing
missing
missing
4
3
2
1
But what if we wanted to have values sorted in descending order, but put missing
at the end?
Supplementary sorting order
Users often wanted a functionality that would allow them to sort values, but treat missing
as the smallest. This means that if you sort your data in a descending order missing
would be put at the end.
Similarly, if you want to sort your data in ascending order missing
would be put at the beginning.
With Missings.jl release 1.2 this functionality is supported with the missingsmallest
function:
julia> sort(x, lt=missingsmallest)
7-element Vector{Union{Missing, Int64}}:
missing
missing
missing
1
2
3
4
julia> sort(x, lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, Int64}}:
4
3
2
1
missing
missing
missing
By default missingsmallest
uses the isless
comparison.
More advanced cases of treating missing as smallest
Assume that you have the following vector that you want to sort by the length of the string:
julia> s = [missing, "abc", "x", missing, "bcde", "pq", missing]
7-element Vector{Union{Missing, String}}:
missing
"abc"
"x"
missing
"bcde"
"pq"
missing
If you try a simple way to do it you get an error:
julia> sort(s, by=length)
ERROR: MethodError: no method matching length(::Missing)
We need to wrap length
in passmissing
to get what we want:
julia> sort(s, by=passmissing(length))
7-element Vector{Union{Missing, String}}:
"x"
"pq"
"abc"
"bcde"
missing
missing
missing
julia> sort(s, by=passmissing(length), rev=true)
7-element Vector{Union{Missing, String}}:
missing
missing
missing
"bcde"
"abc"
"pq"
"x"
But what if we wanted to treat missing
values as smallest?
The first approach is the one we already know:
julia> sort(s, by=passmissing(length), lt=missingsmallest)
7-element Vector{Union{Missing, String}}:
missing
missing
missing
"x"
"pq"
"abc"
"bcde"
julia> sort(s, by=passmissing(length), lt=missingsmallest, rev=true)
7-element Vector{Union{Missing, String}}:
"bcde"
"abc"
"pq"
"x"
missing
missing
missing
However, there is an alternative. You can define a comparison function that works on strings:
julia> isshorter(s1::AbstractString, s2::AbstractString) = length(s1) < length(s2)
isshorter (generic function with 1 method)
Then you can pass the isshorter
function to missingsmallest
as a single argument to generate a comparison function
that automatically treats missing
values as smallest:
julia> sort(s, lt=missingsmallest(isshorter))
7-element Vector{Union{Missing, String}}:
missing
missing
missing
"x"
"pq"
"abc"
"bcde"
julia> sort(s, lt=missingsmallest(isshorter), rev=true)
7-element Vector{Union{Missing, String}}:
"bcde"
"abc"
"pq"
"x"
missing
missing
missing
Conclusions
The missingsmallest
functionality was added in Missings.jl 1.2.
I hope you will find it useful when working with your data!