The hardest part of DataFrames.jl development process
Introduction
I have spent several years now helping to develop DataFrames.jl. There are many issues to consider when working on such a big package:
- providing new functionalities;
- avoiding and fixing bugs;
- performance;
- integration of the functionality with the rest of data ecosystem;
- handling of conflicting expectations of the users;
- getting the reviews done (super hard for complex PR’s);
- managing release process and synchronization with dependencies;
- working consistently on different versions of Julia that should be supported;
- fixing bugs uncovered in other packages/Julia;
- ensuring proper documentation and tutorials;
- managing deprecated functionality;
- …
These are the issues that instantly come to my mind and there are many more. A natural question is then — what is the hardest task?
From my experience it is deciding what API to provide (function names, their positional and keyword arguments, their return values), and the starting point in this area is deciding which functions should be made available to the user. The discussions about what should be the names of functions we export were one of the longest and hardest because this is a social process, that is hugely affected by past experience of the contributors.
As this topic is very wide I decided to comment on three selected decisions in this scope:
- why we decided not to provide
head
andtail
functions but usefirst
andlast
instead; - why we decided to provide
nrow
andncol
functions, whilesize
function gives the same information; - why we provide both
filter
andsubset
functions that serve the same purpose.
I hope this will shed some light in the mental process we go through when making such decisions.
This post refers to the state of the DataFrames.jl package in its 1.1.1 release.
Why head
and tail
are not defined
head
and tail
are commonly used in other ecosystems (e.g. in R) to get few
first/last rows of a data frame. This gives us a first criterion:
Criterion 1: try to use function names that are natural for users to guess without having to learn them.
However, there are first
and last
functions in Julia Base that serve the same
purpose. This gives us the following new criteria:
Criterion 2: stay consistent with Julia Base and try to add methods for functions already defined there (as users are likely to know them).
Criterion 3: minimize number of verbs (function names) that are introduced by the package, as this makes the functionality easier to learn and maintain.
Criterion 4: avoid defining common and short names. Such names are very likely to conflict with names defined user’s code leading to problems.
Criterion 5: if we want to add a method to a function defined in Julia Base will it do the same thing (we do not want to change the contract established in Julia Base) and not cause type piracy.
In this case we have first
and last
functions in the Julia Base that already
are defined to allow to pick first/last elements of the collection. Additionally
Julia Base defines Base.tail
, which is not exported currently, but there is
always a risk that this would change in the future (and it does a bit different
thing). Finally head
and tail
are pretty common names, that were likely to
be already in use in user’s code. Here a crucial consideration is that if we
claimed some common name many years ago it would be less a problem. However,
some users have thousands of code using DataFrames.jl. In such a case
introducing a common name might cause code base that worked previously to start
failing.
All in all - we stick to first
/last
combo although it does not conform to
Criterion 1 for some of the users (this is subjective though).
Why nrow
and ncol
are defined
In this case clearly we followed Criterion 1. Let us analyze why other criteria did not get that much weight. We are clearly breaking Criteria 2 and 3. Fortunately most likely we are not breaking Criterion 4 (names are short, but not likely to be commonly used). Criterion 5 is not applicable.
Let us dig into Criteria 2 and 3 a bit. Instead of writing nrow(df)
you
can alternatively write size(df, 1)
or size(df)[1]
. There are three reasons
why this is not optimal:
- it is a bit more to type;
- you actually have two styles to get number of rows (and I know from StactOverflow that which one to choose was confusing — we do not want for such a common operation to have two similar, but a bit different styles);
nrow
does not require you do define an anonymous function if you want to pass it to some higher order function; compare:combine(groupby(df, :col), nrow)
vs
combine(groupby(df, :col), x -> size(x, 1))
It is not only much easier to read but also the former has to be compiled only once while the latter is recompiled every time if you are in global scope.
For these reasons defining nrow
and ncol
was accepted.
Why both filter
and subset
are provided
Clearly there is a filter
function in Julia Base, so why do we need
a subset
function? I have discussed the differences between them in
my last post so they do not do the same. Here a crucial consideration was
following Critetion 5. Methods for the filter
function defined in
DataFrames.jl should follow the contract for filter
defined in Julia Base.
However, users wanted a function doing a similar thing, but with a different
contract (e.g. different order of arguments, whole column passed to the
predicate function, option to skip missing
values). Therefore we decided to
keep filter
consistent with Julia Base and add a new function subset
that
would follow what users wanted.
Conclusions
Before I finish let me add one more comment. What if we have a function name,
like describe
, that is not defined in Julia Base, but it is likely that
several packages might want add methods to it? In this case we need to have some
package umbrella that only defines this function (possibly with a default
implementation). In data science related ecosystem in Julia we have two such
packages: DataAPI.jl and StatsAPI.jl.