How to get head or tail of a data frame in DataFrames.jl?
Introduction
This time I thought to make a post that will help people who know R or Python and are starting to use DataFrames.jl.
The point is that DataFrames.jl does not define head
and tail
functions,
but rather first
and last
. In other words I want to serve those
of you who have googled up "DataFrames.jl head"
and landed on this page.
All codes were tested under Julia 1.6.1 and DataFrames.jl 1.1.1.
What Julia Base offers
In Julia Base if you have some collection, e.g. a vector, you can use the
first
and last
functions to get their head/tail as follows:
julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
1
2
3
4
5
julia> first(x)
1
julia> last(x)
5
julia> first(x, 2)
2-element Vector{Int64}:
1
2
julia> last(x, 2)
2-element Vector{Int64}:
4
5
As you can see there are two modes how these functions work:
- if you pass just a collection then its first/last element is extracted out and returned;
- if you pass a collection and a non-negative integer, then you get a collection holding the requested number of elements from the head/tail of the source collection.
Additionally the first
and last
functions have a nice feature that if you
pass them a non negative integer, then they do not fail, e.g.:
julia> first(x, 10)
5-element Vector{Int64}:
1
2
3
4
5
This is a very convenient feature, especially when working with data, where you do not know for sure that you will have enough observations. Consider a simple scenario: we want to pick top three products per product category, and some product categories might have less than tree entries (we will soon go back to this example when we switch to DataFrames.jl).
The point is that with ordinary indexing if you request more elements than the collection can hold you get:
julia> x[1:10]
ERROR: BoundsError: attempt to access 5-element Vector{Int64} at index [1:10]
so you have to write something like e.g.:
julia> x[1:min(10, end)]
5-element Vector{Int64}:
1
2
3
4
5
which, while still nice, is not so easy to reason about. A more advanced issue
with indexing that first
and last
resolve is that not all collections are
1-based, so you have to be careful if you write generic code.
How DataFrames.jl works
Since getting a head/tail of a data frame is essentially the operation that
first
and last
functions from Julia Base offer these functions have
special methods defined in DataFrames.jl (and thus there is no need to introduce
new head
and tail
functions).
Let us see this at work:
julia> using DataFrames
julia> df = DataFrame(group=[1, 1, 1, 1, 2, 2], id=1:6)
6×2 DataFrame
Row │ group id
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 1 4
5 │ 2 5
6 │ 2 6
julia> first(df)
DataFrameRow
Row │ group id
│ Int64 Int64
─────┼──────────────
1 │ 1 1
julia> last(df)
DataFrameRow
Row │ group id
│ Int64 Int64
─────┼──────────────
6 │ 2 6
julia> first(df, 2)
2×2 DataFrame
Row │ group id
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
julia> last(df, 2)
2×2 DataFrame
Row │ group id
│ Int64 Int64
─────┼──────────────
1 │ 2 5
2 │ 2 6
As you can see all is consistent with Julia Base. If you just pass a single
data frame as an argument to first
/last
you get a DataFrameRow
. If you
additionally pass an integer you get a DataFrame
.
Let us now show why the behavior “take at most n elements”(instead of “take exactly n elements”) is useful. Consider we want to pick top 3 ids per group. This is easily achievable as follows (we assume that the table was already sorted in some meaningful way):
julia> combine(groupby(df, :group), sdf -> first(sdf, 3))
5×2 DataFrame
Row │ group id
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 1 2
3 │ 1 3
4 │ 2 5
5 │ 2 6
As you can see the operation worked smoothly, picking top three rows for group
1
, but only two for group 2
(as this group held only two rows). This is the
behavior that I think is most convenient when doing split-apply-combine
strategy.
Conclusions
This post is aimed mostly at people starting to work with DataFrames.jl. There are two messages I wanted to convey:
- a direct one: use
first
andlast
functions if you want to get a head/tail of your data frame; - a more general one: DataFrames.jl is designed to follow the API provided by
Julia Base. So if some functions exist there (like
first
andlast
) you can in general expect that these functions will have methods defined also in DataFrames.jl that work consistently (of course provided that some operation makes sense in this context).