Learning to zip stuff in Julia
Introduction
Today I want to discuss the basics of the design of the zip
function in Julia.
This is a relatively introductory topic but, in my experience, wrong usage of this function can lead to hard to catch bugs.
The post was written under Julia 1.9.2.
What does zip do?
Let us start with the description of the zip
function from its documentation:
Run multiple iterators at the same time, until any of them is exhausted. The value type of the zip iterator is a tuple of values of its subiterators.
If it is not fully clear what it does let me give two examples:
julia> [v for v in zip(1:3, 11:13)]
3-element Vector{Tuple{Int64, Int64}}:
(1, 11)
(2, 12)
(3, 13)
julia> [v for v in zip("abcdef", Iterators.cycle([1, 2]))]
6-element Vector{Tuple{Char, Int64}}:
('a', 1)
('b', 2)
('c', 1)
('d', 2)
('e', 1)
('f', 2)
The examples show the following nice features of the zip
function:
- It can work with any iterator, not just e.g. vectors.
In the second example we used a string and a special cyclic iterator from
Iterators
module. Importantly we can work with iterators that are not indexable. - The
zip
function works until the shortest iterator of all passed to it is exhausted. It is often useful when we have some of the iterators that are infinite. Again, in our second exampleIterators.cycle([1, 2]))
is infinite and allowed us to clearly indicate even and odd characters in the string.
The last nice feature of zip
is that it is lazy. To understand what it means check this:
julia> zip(1:3, 11:13)
zip(1:3, 11:13)
julia> typeof(zip(1:3, 11:13))
Base.Iterators.Zip{Tuple{UnitRange{Int64}, UnitRange{Int64}}}
We can see that the zip
function by itself does not do any work. Instead, it returns an iterator that can be queried.
That is why in the first examples I used comprehensions to show the values produced by the iterator returned by zip
.
Why is it useful? Being lazy avoids excessive memory allocation, which is often important, especially if you work with large data.
Why is zip risky?
So what are the potential problems with zip
in practice? The issue is that we often want to use it in scenarios
when we would want to be sure that all iterators passed to zip
have the same length. Unfortunately zip
does not check it.
Here is a simple example of a problematic definition of a function comupting a dot product:
julia> dot_bad(x, y) = sum(a * b for (a, b) in zip(x, y))
dot_bad (generic function with 1 method)
julia> dot_bad(1:3, 4:6)
32
julia> dot_bad(1:3, 4:7)
32
julia> dot_bad(1:4, 4:6)
32
Typically with such a function we would want it to error if it gets input iterators of unequal length.
In practice, if it is required that the passed iterators must have equal length, they usually support the length
function.
Therefore often fix of the definition that is similar to the following is enough:
julia> function dot_safer(x, y)
@assert length(x) == length(y)
return sum(a * b for (a, b) in zip(x, y))
end
dot_safer (generic function with 1 method)
julia> dot_safer(1:3, 4:6)
32
julia> dot_safer(1:3, 4:7)
ERROR: AssertionError: length(x) == length(y)
julia> dot_safer(1:4, 4:6)
ERROR: AssertionError: length(x) == length(y)
Conclusions
In summary: if you use zip
always ask yourself a question if you assume that the passed
iterators have the same length. If you do - make sure to check it in your code explicitly.