Introduction

This time I thought to make a post that will help people who know R or Python and are starting to use DataFrames.jl.

The point is that DataFrames.jl does not define head and tail functions, but rather first and last. In other words I want to serve those of you who have googled up "DataFrames.jl head" and landed on this page.

All codes were tested under Julia 1.6.1 and DataFrames.jl 1.1.1.

What Julia Base offers

In Julia Base if you have some collection, e.g. a vector, you can use the first and last functions to get their head/tail as follows:

julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> first(x)
1

julia> last(x)
5

julia> first(x, 2)
2-element Vector{Int64}:
 1
 2

julia> last(x, 2)
2-element Vector{Int64}:
 4
 5

As you can see there are two modes how these functions work:

  • if you pass just a collection then its first/last element is extracted out and returned;
  • if you pass a collection and a non-negative integer, then you get a collection holding the requested number of elements from the head/tail of the source collection.

Additionally the first and last functions have a nice feature that if you pass them a non negative integer, then they do not fail, e.g.:

julia> first(x, 10)
5-element Vector{Int64}:
 1
 2
 3
 4
 5

This is a very convenient feature, especially when working with data, where you do not know for sure that you will have enough observations. Consider a simple scenario: we want to pick top three products per product category, and some product categories might have less than tree entries (we will soon go back to this example when we switch to DataFrames.jl).

The point is that with ordinary indexing if you request more elements than the collection can hold you get:

julia> x[1:10]
ERROR: BoundsError: attempt to access 5-element Vector{Int64} at index [1:10]

so you have to write something like e.g.:

julia> x[1:min(10, end)]
5-element Vector{Int64}:
 1
 2
 3
 4
 5

which, while still nice, is not so easy to reason about. A more advanced issue with indexing that first and last resolve is that not all collections are 1-based, so you have to be careful if you write generic code.

How DataFrames.jl works

Since getting a head/tail of a data frame is essentially the operation that first and last functions from Julia Base offer these functions have special methods defined in DataFrames.jl (and thus there is no need to introduce new head and tail functions).

Let us see this at work:

julia> using DataFrames

julia> df = DataFrame(group=[1, 1, 1, 1, 2, 2], id=1:6)
6×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
   5 │     2      5
   6 │     2      6

julia> first(df)
DataFrameRow
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1

julia> last(df)
DataFrameRow
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   6 │     2      6

julia> first(df, 2)
2×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2

julia> last(df, 2)
2×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     2      5
   2 │     2      6

As you can see all is consistent with Julia Base. If you just pass a single data frame as an argument to first/last you get a DataFrameRow. If you additionally pass an integer you get a DataFrame.

Let us now show why the behavior “take at most n elements”(instead of “take exactly n elements”) is useful. Consider we want to pick top 3 ids per group. This is easily achievable as follows (we assume that the table was already sorted in some meaningful way):

julia> combine(groupby(df, :group), sdf -> first(sdf, 3))
5×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     2      5
   5 │     2      6

As you can see the operation worked smoothly, picking top three rows for group 1, but only two for group 2 (as this group held only two rows). This is the behavior that I think is most convenient when doing split-apply-combine strategy.

Conclusions

This post is aimed mostly at people starting to work with DataFrames.jl. There are two messages I wanted to convey:

  • a direct one: use first and last functions if you want to get a head/tail of your data frame;
  • a more general one: DataFrames.jl is designed to follow the API provided by Julia Base. So if some functions exist there (like first and last) you can in general expect that these functions will have methods defined also in DataFrames.jl that work consistently (of course provided that some operation makes sense in this context).