Hunting for bugs in Julia for Data Analysis
A few months ago my Julia for Data Analysis book was released. I tried very hard to make it correct. Regrettably, I failed.
Fortunately my readers are more careful than me and they found several issues in my book. I keep their log in the Errata section of the GitHub repository of the book.
In this post I want to share you my experience about the kinds of bugs slipped into the book and comment how I think they could be avoided.
My experience is that there are three important classes of issues:
- various misprints and typographical errors;
- consistency of code execution across Julia versions;
- factual errors.
Let me go through these classes in sequence.
Misprints and typographical errors
Such errors sneak in for many reasons. Let me highlight some that have bitten me and could be avoided.
First is printing of non-standard UTF-8 characters. These are nasty. In my
version of the book things look nice, but when it went through printing
preparation process, somehow on the way some characters got messed up.
For example in Chapter 6, section 6.4.1, page 132
codeunits("ε") is printed as
Takeaway: before your book goes to press always carefully check these parts of the text where you use non-standard UTF-8 characters (a common case in Julia).
Second is the side effect of using multi-selection or auto-replace in your
editor. For example in Chapter 3, section 3.2.3, page 59 I have
an issue that I write
sort(v::AbstractVector; kwthe.) instead of
sort(v::AbstractVector; kws...). It is clear that I must have used some
pattern matching and replaced
the. I do not even remember why.
Takeaway: multi-selection and auto-replace are nice, but never do them globally; it is best to review every change before it is applied (do not make them automatically in one shot; it is hart to resist, but this is what you really should not do).
Consistency across versions of Julia
There are many flavors of this issue. It can mostly be resolved by strict use of Project.toml and Manifest.toml files, but sometimes unexpected things happen.
For example in Chapter 1, section 1.2.1, page 7 I show the following snippet:
julia> function sum_n(n) s = 0 for i in 1:n s += i end return s end sum_n (generic function with 1 method) julia> @time sum_n(1_000_000_000) 0.000001 seconds 500000000500000000
This timing is surprisingly fast (and the reason is explained in the book). The issue is that this is the situation under Julia 1.7 as this is the version of the language I used in the book.
Under Julia 1.8 and Julia 1.9 running the same code takes longer (tested under Julia 1.9-beta4):
julia> @time sum_n(1_000_000_000) 2.265569 seconds 500000000500000000
The reason for this inconsistency is a bug in the
@time macro introduced in
Julia 1.8 release. The
sum_n(1_000_000_000) call (without
@time) is executed
fast. Here is a simplified benchmark (run under Julia 1.9-beta4) showing this:
julia> let start = time_ns() v = sum_n(1_000_000_000) stop=time_ns() v, Int(stop - start) end (500000000500000000, 1000)
Unfortunately there is an issue with the
@time macro used in global scope
that causes the timing to be inaccurate. This bug needs to be resolved in
Base Julia. See this issue for details.
Takeaway: things can change as software evolves; always make sure to explicitly tell your readers which version and configuration of software they should use to run your code.
This one is most problematic, as I really feel bad, when I find that I have written something that is not fully correct. Let me give you an example from Chapter 2, section 2.3.1, page 30.
The Julia Manual in the Short Circuit Evaluation section states:
if <cond> <statement> end, one can write
<cond> && <statement>(which could be read as:
<statement>). Similarly, instead of if
if ! <cond> <statement> end, one can write
<cond> || <statement>(which could be read as:
Similarly, in my book I have considered the following expressions:
x > 0 && println(x)
if x > 0 println(x) end
x = -7.
I write in the book that Julia interprets them both in the same way.
Indeed it is true that the same expressions get evaluated in both cases.
However, in general
if statement and doing short-circuit evaluation are
What is the difference? If the condition would be
false then the value of
nothing and the value of the expression using
false. Here is an example:
julia> x = -7 -7 julia> show(x > 0 && println(x)) false julia> show(if x > 0 println(x) end) nothing
The difference is subtle, but in cases when you would use the produced value later in your code it could become important.
Takeaway: always carefully consider all aspects of code that you present.
As a conclusion I would like to ask all readers of my book to share with me your feedback. I will try to incorporate it in the next release of the book in the best way I can.