Introduction

Toshiro Kageyama was an amateur Go player who became a successful professional. His story had a profound influence on my thinking about to data science and programming.

Let me start with a quote about his understanding of how go players differ (taken from the Lessons In The Fundamentals of Go book):

Once it was thought (…) that amateurs could not even approach the professional level. Nowadays, however, the great surge in the size of the Go-playing public has narrowed the distance has narrowed the distance between the tracks.

Does this sound familiar? In the past people that created complex, large scale machine learning solutions, were mostly PhDs in computer science or similar. Now the situation has changed. Thousands of developers, with various backgrounds, successfully create amazing projects.

Toshiro Kageyama notes that every Go-player wants to get stronger. Likewise, every data scientist, amateur or professional, wants to be able to create better models, faster, and more reliably.

So what is the secret for becoming stronger? The major message of Toshiro Kageyama is that you need to learn and focus on the fundamentals. In the book he gives numerous examples from Go, baseball, sumo wrestling, or even cooking, highlighting how important it is to follow basic principles of your trade no matter if you are amateur or professional. He stresses that in order to progress you cannot take shortcuts hoping that you can bypass fundamentals. In short term this might give some effects, but will not work in the long run.

What does all this have to do with Julia you probably wonder by now.

I have been writing the Julia For Data Analysis book for a year and it has been published this week. In this post I want to discuss what important fundamental principle I re-learned many times during this process.

Always crate project environment for your code

One of the basic things that any data scientist learns is that any project should be accompanied by a complete specification of environment in which it should be run.

In Julia this is achieved via sharing Project.toml and Manifest.toml files as is explained in the Pkg.jl manual.

So far all seems easy. If you want reproducibility of your work create the Project.toml and Manifest.toml files for your code and Julia will automatically instantiate the required environment.

This is indeed a golden rule that works like magic. The fact that Julia has such excellent package manager built into the language is one of the things that make me use it.

However, when I was writing a book I faced several challenges that are similar to problems that large projects developed over a long period of time face.

Changes in versions of Julia

So what was my major problem? I wrote the book under Julia 1.7. After I finished it Julia 1.8 was released.

Why is it an issue? Well - version of Julia is part of the specification of the project environment. Therefore, technically, I should expect my readers to install Julia 1.7 when running examples from my book.

However, this is what I wanted to avoid. It is natural to expect that users will want to use latest version of Julia when learning it. Therefore I had to make sure that my codes work both under Julia 1.7 and Julia 1.8.

And here I hit the first problem - not all that works under Julia 1.7 works under Julia 1.8. You might be surprised by this, as Julia 1.8 should not introduce any breaking changes. However, the major problem were binary dependencies, that is external code dependencies that are not written in Julia. Such dependencies sometimes have versions that are Julia-version specific. This is an observation that I recommend you keep in mind when working on some project that is expected to be worked on for a longer period of time.

Binary dependencies, round two

Julia has an excellent integration with Python. You can easily and conveniently execute Python code as a part of your Julia programs. I wanted to show this functionality in the book. As an example I used t-SNE from scikit-learn (as a side note: there are very nice implementations of this algorithm in pure Julia in TSne.jl or Manifolds.jl).

Guess what happened next? I got many requests from the readers indicating they had problems with proper setting up of Python on their machines. Although there are packages in Julia that help to automatically perform Python setup (PyCall.jl and Conda.jl), still the diverse range of end-user environment configurations meant that ensuring that proper Python executable along with all proper Python packages is available is not always trivial.

Handling of bug fixes

Another problem I had was handling of bug fixes in packages. The (real) scenario is as follows: I have finished writing a book and released all the codes and after some time a serious bug in some package is revealed (by serious I mostly mean things like security threats or severe memory leaks; guess what - they mostly happen when some package has binary dependencies).

Now the problem is as follows: I have some package in version e.g. 4.3.1 defined in my Project.toml, but the bug is found and fixed only after the package has version e.g. 5.1.2.

You most likely already see where the problem is. Typically the bug fix is not backported to 4.x branch of the package. I need to update the version of the package to 5.x branch (and then a cascade of other packages gets updated). In the end - I need to re-run all my source code for the book to check if it still works both under Julia 1.7 and 1.8.

The reality is that it did not work. And here let me comment on two reasons why:

  • I had to make a small adjustment because version 5.x of the package made some breaking change in the API. It is a problem but not super big - since I got an error indicating that API changed.
  • The second problem is more subtle. In the packages that got recursively updated some of them changed minor or patch versions. You would assume that nothing should break then. And here we hit a problem that is really serious (and is my call to package developers, if they still read this text):

Do not use internal APIs of packages on which your package depends!

(another fundamental rule to re-learn)

In my case this is what happened. Some package had a version change and modified its internal API. This is a non-breaking change, so all should be good. Unfortunately, it was not, as some other package relied on this internal API and stopped working.

Why I still think Julia package manager is great

Someone might say that all these problems are easily solved by contenerization. Indeed, I could have done this, but I did not want to. There are two reasons:

  • First, I want readers of my book to be exposed to real experience of using Julia - in all its aspects. And in practice you expect that you will use Julia installed on your machine and need to install packages.
  • Second, Julia package manager is excellent. Although I had problems I listed above, the fact is that they are minor and solvable.

Let me comment on several features of Julia package manager that are especially useful:

  • When you add or update package you can choose resolution tier that precisely describes what changes of versions of dependencies of the package are allowed (ranging from: preserve all to update all to the latest and greatest). This is important as when I only want to make a bug-fix update I want to allow as few changes to the other packages as possible.
  • Package management is lightweight. Although each of your projects can have a bit different set of packages and their versions it depends on, Julia package manager keeps a federated repository of the packages. This means that a specific version of the package needs to be installed only once per machine; this saves a lot of time and disk space.
  • It has an excellent support for artifacts, that can be any binary dependencies of the package; in particular it allows for shipping pre-built platform-specific dependencies. So although the binaries can be Julia-version dependent you do not need any extra external tools to build such packages. And the beauty of this solution is that it works on Linux, Mac, and Windows (think if have you ever had a situation that some package does not want to build on Linux when you e.g. used R).

Final thoughts

Today we talk about fundamentals. I also tried to write my Julia for Data Analysis book in a way that covers fundamentals of Julia for data analysis.

For this reason I decided to skip discussing advanced machine learning algorithms or integration options with various data sources. Instead, I cover in the book essential topics that a data scientist will relatively soon need to learn about the language (while skipping topics that are most likely less important). The fundamentals are divided into two parts in the book:

  • Part 1: the Julia language (e.g. various data container types, multiple dispatch, various ways to work with strings).
  • Part 2: the data analysis ecosystem (e.g. working with data frames, plotting, representing relationships in data using graphs).

With this book I hope to convince the readers that Julia language was designed so that you can easily pick it up but go very far to become a top data science developer. So, in principle, it sticks to what Toshiro Kageyama said: it allows a wide community of analysts to be able to do what in the past was only possible for experts.

Let me give a single example of this kind. When starting to work with Julia users are often confused by -> (syntax defining an anonymous function), => (a notation allowing you to create key-value pairs), ==(1) (a way to create a function that compares its argument to 1 using ==, a.k.a partial function application), or . (symbol used when broadcasting a function).

Often, when reading some advanced code such new users see expressions like (tested under Julia 1.8.2, DataFrames.jl 1.4.4, and DataFramesMeta.jl 0.12.0):

julia> using DataFrames

julia> df = DataFrame(a = [1, 3, 2, 1, 4])
5×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     3
   3 │     2
   4 │     1
   5 │     4

julia> transform(df, :a => ByRow(==(1)) => :a_eq_1)
5×2 DataFrame
 Row │ a      a_eq_1
     │ Int64  Bool
─────┼───────────────
   1 │     1    true
   2 │     3   false
   3 │     2   false
   4 │     1    true
   5 │     4   false

They might feel intimidated by the complexity of the syntax. While they could just write:

julia> using DataFramesMeta

julia> @rtransform(df, :a_eq_1 = :a == 1)
5×2 DataFrame
 Row │ a      a_eq_1
     │ Int64  Bool
─────┼───────────────
   1 │     1    true
   2 │     3   false
   3 │     2   false
   4 │     1    true
   5 │     4   false

The first style (using transform) is an underlying operation specification syntax of DataFrames.jl: designed to be flexible and allowing for expression of very complex operations. The second style (using @rtransform) is a domain specific language that is designed to be easy to pick up (dplyr users can mostly just start using it without learning much).

In the book I cover both. Most likely you want to start with an easy syntax that allows you to just get you the result you desire. However, I also explain all the fundamentals of both Julia and its data analysis ecosystem, so when you need to perform more complex tasks you will know all the required features and you are ready to use them.

For example in the last chapter of the book, I show how to create and use a web service that performs valuation of Asian options using multi-threading to ensure its fast response time. As you can imagine, it is a relatively complex (but fortunately short) code, and you need to have a good grasp of fundamentals to easily understand it. If you would like to learn why similar solutions are written using Julia in the industry I recommend you check out Timeline case study.