I have spent several years now helping to develop DataFrames.jl. There are many issues to consider when working on such a big package:
- providing new functionalities;
- avoiding and fixing bugs;
- integration of the functionality with the rest of data ecosystem;
- handling of conflicting expectations of the users;
- getting the reviews done (super hard for complex PR’s);
- managing release process and synchronization with dependencies;
- working consistently on different versions of Julia that should be supported;
- fixing bugs uncovered in other packages/Julia;
- ensuring proper documentation and tutorials;
- managing deprecated functionality;
These are the issues that instantly come to my mind and there are many more. A natural question is then — what is the hardest task?
From my experience it is deciding what API to provide (function names, their positional and keyword arguments, their return values), and the starting point in this area is deciding which functions should be made available to the user. The discussions about what should be the names of functions we export were one of the longest and hardest because this is a social process, that is hugely affected by past experience of the contributors.
As this topic is very wide I decided to comment on three selected decisions in this scope:
- why we decided not to provide
tailfunctions but use
- why we decided to provide
sizefunction gives the same information;
- why we provide both
subsetfunctions that serve the same purpose.
I hope this will shed some light in the mental process we go through when making such decisions.
This post refers to the state of the DataFrames.jl package in its 1.1.1 release.
tail are not defined
tail are commonly used in other ecosystems (e.g. in R) to get few
first/last rows of a data frame. This gives us a first criterion:
Criterion 1: try to use function names that are natural for users to guess without having to learn them.
However, there are
last functions in Julia Base that serve the same
purpose. This gives us the following new criteria:
Criterion 2: stay consistent with Julia Base and try to add methods for functions already defined there (as users are likely to know them).
Criterion 3: minimize number of verbs (function names) that are introduced by the package, as this makes the functionality easier to learn and maintain.
Criterion 4: avoid defining common and short names. Such names are very likely to conflict with names defined user’s code leading to problems.
Criterion 5: if we want to add a method to a function defined in Julia Base will it do the same thing (we do not want to change the contract established in Julia Base) and not cause type piracy.
In this case we have
last functions in the Julia Base that already
are defined to allow to pick first/last elements of the collection. Additionally
Julia Base defines
Base.tail, which is not exported currently, but there is
always a risk that this would change in the future (and it does a bit different
tail are pretty common names, that were likely to
be already in use in user’s code. Here a crucial consideration is that if we
claimed some common name many years ago it would be less a problem. However,
some users have thousands of code using DataFrames.jl. In such a case
introducing a common name might cause code base that worked previously to start
All in all - we stick to
last combo although it does not conform to
Criterion 1 for some of the users (this is subjective though).
ncol are defined
In this case clearly we followed Criterion 1. Let us analyze why other criteria did not get that much weight. We are clearly breaking Criteria 2 and 3. Fortunately most likely we are not breaking Criterion 4 (names are short, but not likely to be commonly used). Criterion 5 is not applicable.
Let us dig into Criteria 2 and 3 a bit. Instead of writing
can alternatively write
size(df, 1) or
size(df). There are three reasons
why this is not optimal:
- it is a bit more to type;
- you actually have two styles to get number of rows (and I know from StactOverflow that which one to choose was confusing — we do not want for such a common operation to have two similar, but a bit different styles);
nrowdoes not require you do define an anonymous function if you want to pass it to some higher order function; compare:
combine(groupby(df, :col), nrow)
combine(groupby(df, :col), x -> size(x, 1))
It is not only much easier to read but also the former has to be compiled only once while the latter is recompiled every time if you are in global scope.
For these reasons defining
ncol was accepted.
subset are provided
Clearly there is a
filter function in Julia Base, so why do we need
subset function? I have discussed the differences between them in
my last post so they do not do the same. Here a crucial consideration was
following Critetion 5. Methods for the
filter function defined in
DataFrames.jl should follow the contract for
filter defined in Julia Base.
However, users wanted a function doing a similar thing, but with a different
contract (e.g. different order of arguments, whole column passed to the
predicate function, option to skip
missing values). Therefore we decided to
filter consistent with Julia Base and add a new function
would follow what users wanted.
Before I finish let me add one more comment. What if we have a function name,
describe, that is not defined in Julia Base, but it is likely that
several packages might want add methods to it? In this case we need to have some
package umbrella that only defines this function (possibly with a default
implementation). In data science related ecosystem in Julia we have two such
packages: DataAPI.jl and StatsAPI.jl.