Introduction

A few months ago my Julia for Data Analysis book was released. I tried very hard to make it correct. Regrettably, I failed.

Fortunately my readers are more careful than me and they found several issues in my book. I keep their log in the Errata section of the GitHub repository of the book.

In this post I want to share you my experience about the kinds of bugs slipped into the book and comment how I think they could be avoided.

My experience is that there are three important classes of issues:

various misprints and typographical errors;
consistency of code execution across Julia versions;
factual errors.

Let me go through these classes in sequence.

Misprints and typographical errors

Such errors sneak in for many reasons. Let me highlight some that have bitten me and could be avoided.

First is printing of non-standard UTF-8 characters. These are nasty. In my version of the book things look nice, but when it went through printing preparation process, somehow on the way some characters got messed up. For example in Chapter 6, section 6.4.1, page 132 codeunits("ε") is printed as codeunits("?").

Takeaway: before your book goes to press always carefully check these parts of the text where you use non-standard UTF-8 characters (a common case in Julia).

Second is the side effect of using multi-selection or auto-replace in your editor. For example in Chapter 3, section 3.2.3, page 59 I have an issue that I write sort(v::AbstractVector; kwthe.) instead of sort(v::AbstractVector; kws...). It is clear that I must have used some pattern matching and replaced s.. with the. I do not even remember why.

Takeaway: multi-selection and auto-replace are nice, but never do them globally; it is best to review every change before it is applied (do not make them automatically in one shot; it is hart to resist, but this is what you really should not do).

Consistency across versions of Julia

There are many flavors of this issue. It can mostly be resolved by strict use of Project.toml and Manifest.toml files, but sometimes unexpected things happen.

For example in Chapter 1, section 1.2.1, page 7 I show the following snippet:

julia> function sum_n(n)
           s = 0
           for i in 1:n
               s += i
           end
           return s
       end
sum_n (generic function with 1 method)

julia> @time sum_n(1_000_000_000)
  0.000001 seconds
500000000500000000

This timing is surprisingly fast (and the reason is explained in the book). The issue is that this is the situation under Julia 1.7 as this is the version of the language I used in the book.

Under Julia 1.8 and Julia 1.9 running the same code takes longer (tested under Julia 1.9-beta4):

julia> @time sum_n(1_000_000_000)
  2.265569 seconds
500000000500000000

The reason for this inconsistency is a bug in the @time macro introduced in Julia 1.8 release. The sum_n(1_000_000_000) call (without @time) is executed fast. Here is a simplified benchmark (run under Julia 1.9-beta4) showing this:

julia> let
           start = time_ns()
           v = sum_n(1_000_000_000)
           stop=time_ns()
           v, Int(stop - start)
       end
(500000000500000000, 1000)

Unfortunately there is an issue with the @time macro used in global scope that causes the timing to be inaccurate. This bug needs to be resolved in Base Julia. See this issue for details.

Takeaway: things can change as software evolves; always make sure to explicitly tell your readers which version and configuration of software they should use to run your code.

Factual errors

This one is most problematic, as I really feel bad, when I find that I have written something that is not fully correct. Let me give you an example from Chapter 2, section 2.3.1, page 30.

The Julia Manual in the Short Circuit Evaluation section states:

Instead of if <cond> <statement> end, one can write <cond> && <statement> (which could be read as: <cond> and then <statement>). Similarly, instead of if if ! <cond> <statement> end, one can write <cond> || <statement> (which could be read as: <cond> or else <statement>).

Similarly, in my book I have considered the following expressions:

x > 0 && println(x)

and

if x > 0
    println(x)
end

where x = -7.

I write in the book that Julia interprets them both in the same way.

Indeed it is true that the same expressions get evaluated in both cases. However, in general if statement and doing short-circuit evaluation are not equivalent.

What is the difference? If the condition would be false then the value of if statement (without else) is nothing and the value of the expression using short-circuiting is false. Here is an example:

julia> x = -7
-7

julia> show(x > 0 && println(x))
false
julia> show(if x > 0
           println(x)
       end)
nothing

The difference is subtle, but in cases when you would use the produced value later in your code it could become important.

Takeaway: always carefully consider all aspects of code that you present.

Conclusions

As a conclusion I would like to ask all readers of my book to share with me your feedback. I will try to incorporate it in the next release of the book in the best way I can.