New features in DataFrames.jl 1.3: conclusion
Introduction
This is the last post from the series introducing features added in DataFrames.jl 1.3. There are many changes I have not covered yet. I have selected some of them that I think are most relevant in typical data wrangling workflows.
The topics I plan to discuss are:
- ordering of groups in
groupby
; unstack
now supportsfill
keyword argument;- deprecations in deleting rows and sorting API.
The post was written under Julia 1.7.0, DataFrames.jl 1.3.1, Chain.jl 0.4.10, and FreqTables.jl 0.4.5.
Ordering of groups in groupby
Let me start with highlighting that GroupedDataFrame
objects produced by the
groupby
function are indexable. This means that you can flexibly subset groups
or re-order them. Here is an example:
julia> using DataFrames
julia> df = DataFrame(a=[1,1,2,2,2,3])
6×1 DataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 1
3 │ 2
4 │ 2
5 │ 2
6 │ 3
julia> gdf = groupby(df, :a, sort=true)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = 1
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 1
⋮
Last Group (1 row): a = 3
Row │ a
│ Int64
─────┼───────
1 │ 3
julia> gdf[[3, 1]]
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 3
Row │ a
│ Int64
─────┼───────
1 │ 3
⋮
Last Group (2 rows): a = 1
Row │ a
│ Int64
─────┼───────
1 │ 1
2 │ 1
Here the gdf[[3, 1]]
operation picked two groups from gdf
putting group
with original index 3
first and group with original index 1
next.
This feature is often useful and gives a lot of flexibility to the users. Here is an example showing how you can sort groups based on non-key column values:
julia> df = DataFrame(a=[1,1,2,2,2,3], x=6:-1:1)
6×2 DataFrame
Row │ a x
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 5
3 │ 2 4
4 │ 2 3
5 │ 2 2
6 │ 3 1
julia> gdf = groupby(df, :a, sort=true)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = 1
Row │ a x
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 5
⋮
Last Group (1 row): a = 3
Row │ a x
│ Int64 Int64
─────┼──────────────
1 │ 3 1
julia> gdf[sortperm([sum(sdf.x) for sdf in gdf])]
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 3
Row │ a x
│ Int64 Int64
─────┼──────────────
1 │ 3 1
⋮
Last Group (2 rows): a = 1
Row │ a x
│ Int64 Int64
─────┼──────────────
1 │ 1 6
2 │ 1 5
However, this means that one should be careful when considering the ordering
of groups in a GroupedDataFrame
. For this reason apart from integer indexing
GroupedDataFrame
also supports indexing using values of grouping columns
(in the example I show Tuple
indexing, but also NamedTuple
and dictionary
indexing is supported):
julia> df = DataFrame(name=["Alice", "Bob"])
2×1 DataFrame
Row │ name
│ String
─────┼────────
1 │ Alice
2 │ Bob
julia> gdf = groupby(df, :name, sort=true)
GroupedDataFrame with 2 groups based on key: name
First Group (1 row): name = "Alice"
Row │ name
│ String
─────┼────────
1 │ Alice
⋮
Last Group (1 row): name = "Bob"
Row │ name
│ String
─────┼────────
1 │ Bob
julia> gdf[("Bob",)]
1×1 SubDataFrame
Row │ name
│ String
─────┼────────
1 │ Bob
or you can use a special GroupKey
object that is produced by the keys
function (this option is fastest):
julia> keys(gdf)
2-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (name = "Alice",)
GroupKey: (name = "Bob",)
So what is new in DataFrames.jl 1.3? The thing is that previously user was not
able to fully control the initial ordering of groups produced by groupby
in
all cases. Now this can be controlled by the sort
keyword argument and the
API has been established with the following rules:
- if you pass
sort=true
the groups will be sorted by values of grouping columns; - if you pass
sort=false
the groups will be produced in order of their first appearance in the source data frame; - if you omit passing the
sort
keyword argument the ordering of groups is undefined and will depend on the grouping algorithm used (DataFrames.jl has several grouping algorithms and tries to choose the fastest available).
To see that these options matter let me show two examples of grouping on an integer column:
julia> df = DataFrame(id=[2, 3, 1])
3×1 DataFrame
Row │ id
│ Int64
─────┼───────
1 │ 2
2 │ 3
3 │ 1
julia> keys(groupby(df, :id))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (id = 1,)
GroupKey: (id = 2,)
GroupKey: (id = 3,)
julia> keys(groupby(df, :id, sort=true))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (id = 1,)
GroupKey: (id = 2,)
GroupKey: (id = 3,)
julia> keys(groupby(df, :id, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (id = 2,)
GroupKey: (id = 3,)
GroupKey: (id = 1,)
julia> df = DataFrame(id=[2, 30, 1])
3×1 DataFrame
Row │ id
│ Int64
─────┼───────
1 │ 2
2 │ 30
3 │ 1
julia> keys(groupby(df, :id))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (id = 2,)
GroupKey: (id = 30,)
GroupKey: (id = 1,)
julia> keys(groupby(df, :id, sort=true))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (id = 1,)
GroupKey: (id = 2,)
GroupKey: (id = 30,)
julia> keys(groupby(df, :id, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (id = 2,)
GroupKey: (id = 30,)
GroupKey: (id = 1,)
As you can see passing the sort
keyword argument produces a consistent
ordering. However, when it is not passed in both examples we got a different
order of groups.
unstack
now supports fill
keyword argument
The change in unstack
is pretty simple, but in many common scenarios will be
useful I think. Now you can specify what value should be used to fill missing
combinations of data.
Let me give a practical example. Assume you have a data frame where you have several observations of peoples’ hair color and eye color:
julia> df = DataFrame(hair=["brown", "yellow", "brown", "brown"],
eyes=["blue", "blue", "green", "blue"])
4×2 DataFrame
Row │ hair eyes
│ String String
─────┼────────────────
1 │ brown blue
2 │ yellow blue
3 │ brown green
4 │ brown blue
You can create a frequency table of this data with the FreqTables.jl package:
julia> using FreqTables
julia> freqtable(df, :hair, :eyes)
2×2 Named Matrix{Int64}
hair ╲ eyes │ blue green
────────────┼─────────────
brown │ 2 1
yellow │ 1 0
You got a matrix with a desired result. However, what if you wanted to get
a DataFrame
instead. In the past you would do:
julia> using Chain
julia> @chain df begin
groupby([:hair, :eyes], sort=true)
combine(nrow)
unstack(:hair, :eyes, :nrow)
end
2×3 DataFrame
Row │ hair blue green
│ String Int64? Int64?
─────┼─────────────────────────
1 │ brown 2 1
2 │ yellow 1 missing
The only problem is that you get missing
instead of 0
in the cell where
there were no observations. To get 0
you would write:
julia> @chain df begin
groupby([:hair, :eyes], sort=true)
combine(nrow)
unstack(:hair, :eyes, :nrow)
coalesce.(0)
end
2×3 DataFrame
Row │ hair blue green
│ String Int64 Int64
─────┼──────────────────────
1 │ brown 2 1
2 │ yellow 1 0
Since DataFrames.jl the pipeline is easier as you can pass fill=0
keyword
argument to unstack
:
julia> @chain df begin
groupby([:hair, :eyes], sort=true)
combine(nrow)
unstack(:hair, :eyes, :nrow, fill=0)
end
2×3 DataFrame
Row │ hair blue green
│ String Int64 Int64
─────┼──────────────────────
1 │ brown 2 1
2 │ yellow 1 0
Deprecations in deleting rows and sorting
The deprecation in row deletion is simple. The delete!
function is deprecated
in favor of deleteat!
function. This change was made to make the DataFrames.jl
API consistent with the Julia Base API (where delete!
is defined to remove a
mapping for the given key in a collection, while deleteat!
removes items
from given indices).
The deprecation in sorting API is more subtle. Consider the following data frame:
julia> df = DataFrame(x=[1, 2, 2, 1], y =[2, 2, 1, 1], z=1:4)
4×3 DataFrame
Row │ x y z
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 1
2 │ 2 2 2
3 │ 2 1 3
4 │ 1 1 4
If you sort it without passing the list of columns on which it should be sorted by default a lexicographic sort on all columns is performed:
julia> sort(df)
4×3 DataFrame
Row │ x y z
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 4
2 │ 1 2 1
3 │ 2 1 3
4 │ 2 2 2
is the same as:
julia> sort(df, All())
4×3 DataFrame
Row │ x y z
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 4
2 │ 1 2 1
3 │ 2 1 3
4 │ 2 2 2
However, to our surprise, currently also when you ask for sorting on no columns you also get a data frame sorted on all columns:
julia> sort(df, Cols())
┌ Warning: When empty column selector is passed ordering is done on all colums. This behavior is deprecated and will change in the future.
│ caller = sortperm(df::DataFrame, cols::Cols{Tuple{}}; alg::Nothing, lt::typeof(isless), by::typeof(identity), rev::Bool, order::Base.Order.ForwardOrdering) at sort.jl:579
└ @ DataFrames ~/.julia/packages/DataFrames/BM4OQ/src/abstractdataframe/sort.jl:579
4×3 DataFrame
Row │ x y z
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 1 4
2 │ 1 2 1
3 │ 2 1 3
4 │ 2 2 2
We think that it is an incorrect behavior and in the future sorting on no columns will produce the result identical to the input data frame (no sorting will be performed).
Conclusions
This post concludes a series of reviews of new features in DataFrames.jl release 1.3. I have not covered everything that was introduced, a complete list of changes can be found in the NEWS.md file.
I hope you will enjoy using the package! Happy data wrangling in year 2022!