Introduction

I have spent several years now helping to develop DataFrames.jl. There are many issues to consider when working on such a big package:

  • providing new functionalities;
  • avoiding and fixing bugs;
  • performance;
  • integration of the functionality with the rest of data ecosystem;
  • handling of conflicting expectations of the users;
  • getting the reviews done (super hard for complex PR’s);
  • managing release process and synchronization with dependencies;
  • working consistently on different versions of Julia that should be supported;
  • fixing bugs uncovered in other packages/Julia;
  • ensuring proper documentation and tutorials;
  • managing deprecated functionality;

These are the issues that instantly come to my mind and there are many more. A natural question is then — what is the hardest task?

From my experience it is deciding what API to provide (function names, their positional and keyword arguments, their return values), and the starting point in this area is deciding which functions should be made available to the user. The discussions about what should be the names of functions we export were one of the longest and hardest because this is a social process, that is hugely affected by past experience of the contributors.

As this topic is very wide I decided to comment on three selected decisions in this scope:

  • why we decided not to provide head and tail functions but use first and last instead;
  • why we decided to provide nrow and ncol functions, while size function gives the same information;
  • why we provide both filter and subset functions that serve the same purpose.

I hope this will shed some light in the mental process we go through when making such decisions.

This post refers to the state of the DataFrames.jl package in its 1.1.1 release.

Why head and tail are not defined

head and tail are commonly used in other ecosystems (e.g. in R) to get few first/last rows of a data frame. This gives us a first criterion:

Criterion 1: try to use function names that are natural for users to guess without having to learn them.

However, there are first and last functions in Julia Base that serve the same purpose. This gives us the following new criteria:

Criterion 2: stay consistent with Julia Base and try to add methods for functions already defined there (as users are likely to know them).

Criterion 3: minimize number of verbs (function names) that are introduced by the package, as this makes the functionality easier to learn and maintain.

Criterion 4: avoid defining common and short names. Such names are very likely to conflict with names defined user’s code leading to problems.

Criterion 5: if we want to add a method to a function defined in Julia Base will it do the same thing (we do not want to change the contract established in Julia Base) and not cause type piracy.

In this case we have first and last functions in the Julia Base that already are defined to allow to pick first/last elements of the collection. Additionally Julia Base defines Base.tail, which is not exported currently, but there is always a risk that this would change in the future (and it does a bit different thing). Finally head and tail are pretty common names, that were likely to be already in use in user’s code. Here a crucial consideration is that if we claimed some common name many years ago it would be less a problem. However, some users have thousands of code using DataFrames.jl. In such a case introducing a common name might cause code base that worked previously to start failing.

All in all - we stick to first/last combo although it does not conform to Criterion 1 for some of the users (this is subjective though).

Why nrow and ncol are defined

In this case clearly we followed Criterion 1. Let us analyze why other criteria did not get that much weight. We are clearly breaking Criteria 2 and 3. Fortunately most likely we are not breaking Criterion 4 (names are short, but not likely to be commonly used). Criterion 5 is not applicable.

Let us dig into Criteria 2 and 3 a bit. Instead of writing nrow(df) you can alternatively write size(df, 1) or size(df)[1]. There are three reasons why this is not optimal:

  • it is a bit more to type;
  • you actually have two styles to get number of rows (and I know from StactOverflow that which one to choose was confusing — we do not want for such a common operation to have two similar, but a bit different styles);
  • nrow does not require you do define an anonymous function if you want to pass it to some higher order function; compare:
      combine(groupby(df, :col), nrow)
    

    vs

      combine(groupby(df, :col), x -> size(x, 1))
    

It is not only much easier to read but also the former has to be compiled only once while the latter is recompiled every time if you are in global scope.

For these reasons defining nrow and ncol was accepted.

Why both filter and subset are provided

Clearly there is a filter function in Julia Base, so why do we need a subset function? I have discussed the differences between them in my last post so they do not do the same. Here a crucial consideration was following Critetion 5. Methods for the filter function defined in DataFrames.jl should follow the contract for filter defined in Julia Base. However, users wanted a function doing a similar thing, but with a different contract (e.g. different order of arguments, whole column passed to the predicate function, option to skip missing values). Therefore we decided to keep filter consistent with Julia Base and add a new function subset that would follow what users wanted.

Conclusions

Before I finish let me add one more comment. What if we have a function name, like describe, that is not defined in Julia Base, but it is likely that several packages might want add methods to it? In this case we need to have some package umbrella that only defines this function (possibly with a default implementation). In data science related ecosystem in Julia we have two such packages: DataAPI.jl and StatsAPI.jl.