Introduction

When working with real data one often encounters missing values. This is an introductory level post aiming to explain the corner cases of working with such data in the Julia language. It is intended to complement the section on Missing Values of the Julia Manual. I highly recommend to read it to everyone interested in the subject and therefore I will skip many topics that are covered in detail there.

The post was written under Julia 1.6.1 and Missings.jl 1.0.1.

Introducing missing

Missing values are represented in Julia using missing that has type Missing.

As is explained in the section on Missing Values of the Julia Manual:

Julia provides support for representing missing values in the statistical sense, that is for situations where no value is available for a variable in an observation, but a valid value theoretically exists.

It is useful to contrast this contract with the intended use of nothing value that has type Nothing, which should be used when some value objectively does not exist.

For example findfirst(==(1), 2:3) returns nothing value as there does not exist an index in the 2:3 range for which the value is equal to 1. On the other hand if we have some empirical data collected, e.g. about patients in the clinical trial, and for one of such patients we have not recorded subjects age then it should be represented as missing (the patient objectively has some age but we just do not know it).

If we work with data it is convenient to check if some value is missing using the ismissing function. For example here is a way to drop missing values from a vector:

julia> x = [1, missing, 3, missing]
4-element Vector{Union{Missing, Int64}}:
 1
  missing
 3
  missing

julia> filter(!ismissing, x)
2-element Vector{Union{Missing, Int64}}:
 1
 3

In this example we have used the !ismissing expression which produces a function opposite to ismissing, i.e. returning true if the value is not missing.

Typical problems with missing values

Since missing value follows a three-valued logic the following fails:

julia> findall(==(1), [1, missing, 1, 2])
ERROR: TypeError: non-boolean (Missing) used in boolean context

The reason is that:

julia> 1 == missing
missing

and we can see that the comparison does not produce a valid Bool value.

There are two ways to work around this problem. The first one is to use the isequal function:

julia> findall(isequal(1), [1, missing, 1, 2])
2-element Vector{Int64}:
 1
 3

The other is to use the coalesce function:

julia> findall(x -> coalesce(x == 1, false), [1, missing, 1, 2])
2-element Vector{Int64}:
 1
 3

It is important to remember that these are not equivalent approaches. They can differ most notably when working with floating point numbers. Here is an example:

julia> findall(isequal(NaN), [NaN, missing, -0.0, 0.0, 1.0])
1-element Vector{Int64}:
 1

julia> findall(x -> coalesce(x == NaN, false), [NaN, missing, -0.0, 0.0, 1.0])
Int64[]

julia> findall(isequal(0.0), [NaN, missing, -0.0, 0.0, 1.0])
1-element Vector{Int64}:
 4

julia> findall(x -> coalesce(x == 0.0, false), [NaN, missing, -0.0, 0.0, 1.0])
2-element Vector{Int64}:
 3
 4

Of course one should use the method that is appropriate in the application area.

Corner cases of skipmissings

Typically aggregation functions produce missing when they are passed a collection holding missing values:

julia> sum([1, missing, 2])
missing

A work-around this issue is to use the skipmissing wrapper that is a lazy iterator skipping missing values in the passed collection, so the following works:

julia> sum(skipmissing([1, missing, 2]))
3

It is important to know a corner case of skipmissing when the collection after skipping missing values is empty. In such a case skipmissing tries to strip the Missing part from the eltype of the collection, and if it is specific enough it can be used to produce a proper result of the aggregation. However, if the type is not specific enough an error is raised, as you can see here:

julia> sum(skipmissing(Union{Int, Missing}[missing, missing, missing]))
0

julia> sum(skipmissing([missing, missing, missing]))
ERROR: ArgumentError: reducing over an empty collection is not allowed

julia> sum(skipmissing(Any[missing, missing, missing]))
ERROR: MethodError: no method matching zero(::Type{Any})

The conclusion is that one should try to use collections of Union{Missing, T}, where T is a concrete type.

In the example above we were affected by one important design decision behind Missing type. It is a singleton type that is not parametric. The missing value does not carry information what is the type of the missing value, it could be any type. Here it is worth to contrast this with e.g. R, where we have NA, NA_integer_, NA_real_, NA_character_, and NA_complex_ constants that cover selected, most common R types, so you have the following (run under R 4.1.1):

> sin(NA)
[1] NA
> sin(NA_character_)
Error in sin(NA_character_) :
  non-numeric argument to mathematical function
> sum(NA, na.rm=T)
[1] 0
> sum(NA_complex_, na.rm=T)
[1] 0+0i
> sum(NA_character_, na.rm=T)
Error in sum(NA_character_, na.rm = T) :
  invalid 'type' (character) of argument

The decision to avoid such differences was deliberate and was meant to simplify the design of functions working with missing values (at the cost of not carrying the type information, which has to be managed on user’s side).

You might ask how one can extract T from Union{Missing, T} type. It is easy using the nonmissingtype function:

ulia> nonmissingtype(Float64)
Float64

julia> nonmissingtype(Union{Missing, Float64})
Float64

julia> nonmissingtype(Any)
Any

julia> nonmissingtype(Union{Missing, Any})
Any

Functions not supporting missing values

As we have seen above many functions produce missing when they are passed missing as an argument. The rationale is as follows: passed argument is objectively present but just unknown, so the result of the operation should also be present, but it is just unknown. Here are some examples of this behavior:

julia> 1 < missing
missing

julia> sin(missing)
missing

However, not all functions follow this rule. Take e.g. an Int constructor:

julia> Int(missing)
ERROR: MethodError: no method matching Int64(::Missing)

So what should we do if we want to convert the vector of integer float or missing values into a vector of Union{Int, Missing} element type? Note that the following fails:

julia> Int.([1.0, 2.0, missing])
ERROR: MethodError: no method matching Int64(::Missing)

You can do one of the two things. Either handle the case of missing manually like this:

julia> [ismissing(x) ? missing : Int(x) for x in [1.0, 2.0, missing]]
3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

or use the passmissing wrapper that is defined in the Missings.jl package:

julia> using Missings

julia> passmissing(Int).([1.0, 2.0, missing])
3-element Vector{Union{Missing, Int64}}:
 1
 2
  missing

Changing element type of the collections

Very often we have a collection of data whose element type is not like we would want it to be and we need to perform an appropriate transformation that keeps the data but just changes the element type. In the context of missing values there are two such operations.

The first is when we have a collection that does not allow missing values, but we want another collection that holds the same data, but allows them (e.g. because later we might want to store missing in such a collection). In such a case use allowmissing from Missings.jl:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> x[1] = missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
Closest candidates are:
  convert(::Type{T}, ::Ptr) where T<:Integer at pointer.jl:23
  convert(::Type{T}, ::T) where T<:Number at number.jl:6
  convert(::Type{T}, ::Number) where T<:Number at number.jl:7
  ...
Stacktrace:
 [1] setindex!(A::Vector{Int64}, x::Missing, i1::Int64)
   @ Base ./array.jl:839
 [2] top-level scope
   @ REPL[63]:1

julia> y = allowmissing(x)
3-element Vector{Union{Missing, Int64}}:
 1
 2
 3

julia> y[1] = missing
missing

julia> y
3-element Vector{Union{Missing, Int64}}:
  missing
 2
 3

An opposite scenario is when we started with a collection allowing missing values, which were removed from it and now we want it to have a narrower element type. Here disallowmissing comes to our aid:

julia> x = [1, missing, 2]
3-element Vector{Union{Missing, Int64}}:
 1
  missing
 2

julia> filter!(!ismissing, x)
2-element Vector{Union{Missing, Int64}}:
 1
 2

julia> disallowmissing(x)
2-element Vector{Int64}:
 1
 2

Finally sometimes we might want to create a collection initially filled with missing values, but allowing additionally some specific type of values. Unfortunately the fill function will not help us here:

julia> fill(missing, 2)
2-element Vector{Missing}:
 missing
 missing

and we can see that the element type of the produced collection is Missing and it is too narrow for practical use.

We should use the missings function from Missings.jl instead. For instance:

julia> z = missings(Int, 3)
3-element Vector{Union{Missing, Int64}}:
 missing
 missing
 missing

julia> z[1] = 100
100

julia> z
3-element Vector{Union{Missing, Int64}}:
 100
    missing
    missing

The distinction between collections allowing and not allowing missing values is quite important in practice, so it is worth remembering the allowmissing, disallowmissing, and missings functions are available. To wrap up let us contrast this design decision with R, where storing missing values in typical scenarios (not in all scenarios though) is supported by default and cannot be opted-out from.

Conclusions

Most of the topics I have discussed here are standard. However, hopefully for people starting to work with missing values in the Julia language these examples can serve as a good additional information on top of what is written in the section on Missing Values of the Julia Manual.