Safety-efficiency trade-offs in the map function
Introduction
The map
function is borrowed from functional programming languages in Julia.
In many such languages you can safely assume that the functions you pass to
map
are pure, meaning that they have no side effects and always return the
same output for the same passed arguments.
In practice, however, you sometimes pass a non-pure function to map
. In this
post I present selected cases when this can lead to surprising results.
The post was written under Julia 1.7.0 and PooledArrays.jl 1.4.0.
Example scenario of non-pure function used with map
Assume you have a vector of data and you want to add some noise to values stored
in it. Here is an example how you can do it using map
:
julia> using Random
julia> Random.seed!(1234);
julia> x = zeros(5)
5-element Vector{Float64}:
0.0
0.0
0.0
0.0
0.0
julia> map(v -> v + randn(), x)
5-element Vector{Float64}:
-0.3597289068234817
1.0872084924285859
-0.4195896169388487
0.7189099374659392
0.4202471777937789
As expected we have added a different random number to every element of the passed vector.
Problematic vector types
Let us now move to a case when things get problematic:
julia> using SparseArrays
julia> y = sparse(x)
5-element SparseVector{Float64, Int64} with 0 stored entries
julia> Random.seed!(1234);
julia> map(v -> v + randn(), y)
5-element SparseVector{Float64, Int64} with 5 stored entries:
[1] = -0.359729
[2] = -0.359729
[3] = -0.359729
[4] = -0.359729
[5] = -0.359729
To our surprise we have added the same noise to every element of y
. The reason
is that the map
function assumes for SparseVector
that the function passed
to it is pure. Therefore it is called only once for 0.0
element of sparse
vector (note that technically this element is not stored in the vector).
Let us try another sparse vector:
julia> z = sparse(ones(5)) .- 1.0
5-element SparseVector{Float64, Int64} with 5 stored entries:
[1] = 0.0
[2] = 0.0
[3] = 0.0
[4] = 0.0
[5] = 0.0
julia> Random.seed!(1234);
julia> map(v -> v + randn(), z)
5-element SparseVector{Float64, Int64} with 5 stored entries:
[1] = 1.08721
[2] = -0.41959
[3] = 0.71891
[4] = 0.420247
[5] = -0.685671
What is going on? There are two things to observe:
- This time I have stored
0.0
values in the vector so for each entry in the sparse vector I get a new random number generated. - The stream of random numbers is shifted by one, as
map
internally calledrandn()
one more time for the default0.0
value (which is not stored in this vector in our case).
Things look complex. The short lesson is: do not use the default map
with
sparse vectors when you want to use a non-pure function.
As a side note, the same problem is present with broadcasting:
julia> Random.seed!(1234);
julia> (v -> v + randn()).(y)
5-element SparseVector{Float64, Int64} with 5 stored entries:
[1] = -0.359729
[2] = -0.359729
[3] = -0.359729
[4] = -0.359729
[5] = -0.359729
How to use map
with non-pure function and sparse vector
Fortunately there is a relatively simple way to resolve our problems. We need to
use the Base.@invoke
macro. In this way we can force Julia to use the default
map
method (not the optimized one that the SparseArrays
module defines).
Here is how you can do it:
julia> Random.seed!(1234);
julia> Base.@invoke map(v -> v + randn(), y)
5-element Vector{Float64}:
-0.3597289068234817
1.0872084924285859
-0.4195896169388487
0.7189099374659392
0.4202471777937789
julia> Random.seed!(1234);
julia> Base.@invoke map(v -> v + randn(), z)
5-element Vector{Float64}:
-0.3597289068234817
1.0872084924285859
-0.4195896169388487
0.7189099374659392
0.4202471777937789
This time we get the same results as for the x
vector. Note that the returned
object has Vector
type (previously we were getting SparseVector
).
The PooledArrays.jl case
The same problem as for SparseVector
is present in PooledArrays.jl. However,
here, by default, the map
function assumes that the function you pass to it
is not-pure for safety:
julia> using PooledArrays
julia> p = PooledArray(x)
5-element PooledVector{Float64, UInt32, Vector{UInt32}}:
0.0
0.0
0.0
0.0
0.0
julia> Random.seed!(1234);
julia> map(v -> v + randn(), p)
5-element PooledVector{Float64, UInt32, Vector{UInt32}}:
-0.3597289068234817
1.0872084924285859
-0.4195896169388487
0.7189099374659392
0.4202471777937789
If you are sure that you are passing a pure function to map
in this case
you can pass pure=true
keyword argument to speed things up:
julia> Random.seed!(1234);
julia> map(v -> v + randn(), p, pure=true)
5-element PooledVector{Float64, UInt32, Vector{UInt32}}:
-0.3597289068234817
-0.3597289068234817
-0.3597289068234817
-0.3597289068234817
-0.3597289068234817
(or to break things, as in this case, since we passed a non-pure function)
Conclusions
In summary let me highlight that the situation we have discussed in this post is
a hard design decision. The reason is that most of the time you indeed pass
pure functions to the map
function. Therefore having methods that are
optimized for performance makes sense. However, sometimes, you want to perform
mapping using a non-pure function, in which case you can get surprising results.
Hopefully, after reading this post you now know how to handle such cases.
Happy hacking!