ABC of handling missing values in Julia
Introduction
When working with real data one often encounters missing values. This is an introductory level post aiming to explain the corner cases of working with such data in the Julia language. It is intended to complement the section on Missing Values of the Julia Manual. I highly recommend to read it to everyone interested in the subject and therefore I will skip many topics that are covered in detail there.
The post was written under Julia 1.6.1 and Missings.jl 1.0.1.
Introducing missing
Missing values are represented in Julia using missing
that has type Missing
.
As is explained in the section on Missing Values of the Julia Manual:
Julia provides support for representing missing values in the statistical sense, that is for situations where no value is available for a variable in an observation, but a valid value theoretically exists.
It is useful to contrast this contract with the intended use of nothing
value that has type Nothing
, which should be used when some value objectively
does not exist.
For example findfirst(==(1), 2:3)
returns nothing
value as there does not
exist an index in the 2:3
range for which the value is equal to 1
. On the
other hand if we have some empirical data collected, e.g. about patients in the
clinical trial, and for one of such patients we have not recorded subjects age
then it should be represented as missing
(the patient objectively has some
age but we just do not know it).
If we work with data it is convenient to check if some value is missing
using
the ismissing
function. For example here is a way to drop missing
values
from a vector:
julia> x = [1, missing, 3, missing]
4-element Vector{Union{Missing, Int64}}:
1
missing
3
missing
julia> filter(!ismissing, x)
2-element Vector{Union{Missing, Int64}}:
1
3
In this example we have used the !ismissing
expression which produces a
function opposite to ismissing
, i.e. returning true
if the value is not
missing.
Typical problems with missing values
Since missing
value follows a three-valued logic the following fails:
julia> findall(==(1), [1, missing, 1, 2])
ERROR: TypeError: non-boolean (Missing) used in boolean context
The reason is that:
julia> 1 == missing
missing
and we can see that the comparison does not produce a valid Bool
value.
There are two ways to work around this problem. The first one is to use the
isequal
function:
julia> findall(isequal(1), [1, missing, 1, 2])
2-element Vector{Int64}:
1
3
The other is to use the coalesce
function:
julia> findall(x -> coalesce(x == 1, false), [1, missing, 1, 2])
2-element Vector{Int64}:
1
3
It is important to remember that these are not equivalent approaches. They can differ most notably when working with floating point numbers. Here is an example:
julia> findall(isequal(NaN), [NaN, missing, -0.0, 0.0, 1.0])
1-element Vector{Int64}:
1
julia> findall(x -> coalesce(x == NaN, false), [NaN, missing, -0.0, 0.0, 1.0])
Int64[]
julia> findall(isequal(0.0), [NaN, missing, -0.0, 0.0, 1.0])
1-element Vector{Int64}:
4
julia> findall(x -> coalesce(x == 0.0, false), [NaN, missing, -0.0, 0.0, 1.0])
2-element Vector{Int64}:
3
4
Of course one should use the method that is appropriate in the application area.
Corner cases of skipmissings
Typically aggregation functions produce missing
when they are passed a collection
holding missing
values:
julia> sum([1, missing, 2])
missing
A work-around this issue is to use the skipmissing
wrapper that is a lazy
iterator skipping missing values in the passed collection, so the following
works:
julia> sum(skipmissing([1, missing, 2]))
3
It is important to know a corner case of skipmissing
when the collection after
skipping missing values is empty. In such a case skipmissing
tries to strip
the Missing
part from the eltype
of the collection, and if it is specific
enough it can be used to produce a proper result of the aggregation. However,
if the type is not specific enough an error is raised, as you can see here:
julia> sum(skipmissing(Union{Int, Missing}[missing, missing, missing]))
0
julia> sum(skipmissing([missing, missing, missing]))
ERROR: ArgumentError: reducing over an empty collection is not allowed
julia> sum(skipmissing(Any[missing, missing, missing]))
ERROR: MethodError: no method matching zero(::Type{Any})
The conclusion is that one should try to use collections of Union{Missing, T}
,
where T
is a concrete type.
In the example above we were affected by one important design decision behind
Missing
type. It is a singleton type that is not parametric. The missing
value does not carry information what is the type of the missing value, it
could be any type. Here it is worth to contrast this with e.g. R, where we have
NA
, NA_integer_
, NA_real_
, NA_character_
, and NA_complex_
constants
that cover selected, most common R types, so you have the following (run under
R 4.1.1):
> sin(NA)
[1] NA
> sin(NA_character_)
Error in sin(NA_character_) :
non-numeric argument to mathematical function
> sum(NA, na.rm=T)
[1] 0
> sum(NA_complex_, na.rm=T)
[1] 0+0i
> sum(NA_character_, na.rm=T)
Error in sum(NA_character_, na.rm = T) :
invalid 'type' (character) of argument
The decision to avoid such differences was deliberate and was meant to simplify the design of functions working with missing values (at the cost of not carrying the type information, which has to be managed on user’s side).
You might ask how one can extract T
from Union{Missing, T}
type. It is
easy using the nonmissingtype
function:
ulia> nonmissingtype(Float64)
Float64
julia> nonmissingtype(Union{Missing, Float64})
Float64
julia> nonmissingtype(Any)
Any
julia> nonmissingtype(Union{Missing, Any})
Any
Functions not supporting missing values
As we have seen above many functions produce missing
when they are passed
missing
as an argument. The rationale is as follows: passed argument is
objectively present but just unknown, so the result of the operation should also
be present, but it is just unknown. Here are some examples of this behavior:
julia> 1 < missing
missing
julia> sin(missing)
missing
However, not all functions follow this rule. Take e.g. an Int
constructor:
julia> Int(missing)
ERROR: MethodError: no method matching Int64(::Missing)
So what should we do if we want to convert the vector of integer float or
missing values into a vector of Union{Int, Missing}
element type?
Note that the following fails:
julia> Int.([1.0, 2.0, missing])
ERROR: MethodError: no method matching Int64(::Missing)
You can do one of the two things. Either handle the case of missing
manually
like this:
julia> [ismissing(x) ? missing : Int(x) for x in [1.0, 2.0, missing]]
3-element Vector{Union{Missing, Int64}}:
1
2
missing
or use the passmissing
wrapper that is defined in the Missings.jl package:
julia> using Missings
julia> passmissing(Int).([1.0, 2.0, missing])
3-element Vector{Union{Missing, Int64}}:
1
2
missing
Changing element type of the collections
Very often we have a collection of data whose element type is not like we would
want it to be and we need to perform an appropriate transformation that keeps
the data but just changes the element type. In the context of missing
values
there are two such operations.
The first is when we have a collection that does not allow missing values, but
we want another collection that holds the same data, but allows them (e.g.
because later we might want to store missing
in such a collection). In such
a case use allowmissing
from Missings.jl:
julia> x = [1, 2, 3]
3-element Vector{Int64}:
1
2
3
julia> x[1] = missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
Closest candidates are:
convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
convert(::Type{T}, ::T) where T<:Number at number.jl:6
convert(::Type{T}, ::Number) where T<:Number at number.jl:7
...
Stacktrace:
[1] setindex!(A::Vector{Int64}, x::Missing, i1::Int64)
@ Base ./array.jl:839
[2] top-level scope
@ REPL[63]:1
julia> y = allowmissing(x)
3-element Vector{Union{Missing, Int64}}:
1
2
3
julia> y[1] = missing
missing
julia> y
3-element Vector{Union{Missing, Int64}}:
missing
2
3
An opposite scenario is when we started with a collection allowing missing
values, which were removed from it and now we want it to have a narrower element
type. Here disallowmissing
comes to our aid:
julia> x = [1, missing, 2]
3-element Vector{Union{Missing, Int64}}:
1
missing
2
julia> filter!(!ismissing, x)
2-element Vector{Union{Missing, Int64}}:
1
2
julia> disallowmissing(x)
2-element Vector{Int64}:
1
2
Finally sometimes we might want to create a collection initially filled with
missing
values, but allowing additionally some specific type of values.
Unfortunately the fill
function will not help us here:
julia> fill(missing, 2)
2-element Vector{Missing}:
missing
missing
and we can see that the element type of the produced collection is Missing
and
it is too narrow for practical use.
We should use the missings
function from Missings.jl instead. For instance:
julia> z = missings(Int, 3)
3-element Vector{Union{Missing, Int64}}:
missing
missing
missing
julia> z[1] = 100
100
julia> z
3-element Vector{Union{Missing, Int64}}:
100
missing
missing
The distinction between collections allowing and not allowing missing
values
is quite important in practice, so it is worth remembering the allowmissing
,
disallowmissing
, and missings
functions are available. To wrap up let us
contrast this design decision with R, where storing missing values in typical
scenarios (not in all scenarios though) is supported by default and cannot be
opted-out from.
Conclusions
Most of the topics I have discussed here are standard. However, hopefully for people starting to work with missing values in the Julia language these examples can serve as a good additional information on top of what is written in the section on Missing Values of the Julia Manual.