Sorting data by a transformation of columns in DataFrames.jl
Introduction
Recently I have several times received a question how to sort a data frame based on some transformation of its columns. In this post I will show you the way it can be currently done.
The post was written under Julia 1.6.0-rc1 and DataFrames 0.22.5.
Before we start create a data frame we will work with. It contains coordinates of some points in a three-dimensional space.
julia> using DataFrames
julia> using LinearAlgebra
julia> using Random
julia> Random.seed!(1234)
MersenneTwister(1234)
julia> df = DataFrame(x=rand(10), y=rand(10), z=rand(10)) .- 0.5
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼──────────────────────────────────
1 │ 0.0908446 0.148882 0.450498
2 │ 0.266797 -0.489094 0.46467
3 │ 0.0662374 -0.433577 0.445775
4 │ -0.0399147 0.456753 0.289904
5 │ 0.294026 0.146691 0.32116
6 │ 0.354147 -0.387514 -0.46584
7 │ -0.299414 -0.223979 -0.405456
8 │ -0.201386 0.151664 -0.185074
9 │ -0.253163 -0.443358 -0.37219
10 │ 0.0796722 0.342714 -0.125813
Basic sorting
You can sort this data frame by column :y
like this:
julia> sort(df, :y)
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼──────────────────────────────────
1 │ 0.266797 -0.489094 0.46467
2 │ -0.253163 -0.443358 -0.37219
3 │ 0.0662374 -0.433577 0.445775
4 │ 0.354147 -0.387514 -0.46584
5 │ -0.299414 -0.223979 -0.405456
6 │ 0.294026 0.146691 0.32116
7 │ 0.0908446 0.148882 0.450498
8 │ -0.201386 0.151664 -0.185074
9 │ 0.0796722 0.342714 -0.125813
10 │ -0.0399147 0.456753 0.289904
If you want to sort it in reverse just do:
julia> sort(df, :y, rev=true)
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼──────────────────────────────────
1 │ -0.0399147 0.456753 0.289904
2 │ 0.0796722 0.342714 -0.125813
3 │ -0.201386 0.151664 -0.185074
4 │ 0.0908446 0.148882 0.450498
5 │ 0.294026 0.146691 0.32116
6 │ -0.299414 -0.223979 -0.405456
7 │ 0.354147 -0.387514 -0.46584
8 │ 0.0662374 -0.433577 0.445775
9 │ -0.253163 -0.443358 -0.37219
10 │ 0.266797 -0.489094 0.46467
or
julia> sort(df, order(:y, rev=true))
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼──────────────────────────────────
1 │ -0.0399147 0.456753 0.289904
2 │ 0.0796722 0.342714 -0.125813
3 │ -0.201386 0.151664 -0.185074
4 │ 0.0908446 0.148882 0.450498
5 │ 0.294026 0.146691 0.32116
6 │ -0.299414 -0.223979 -0.405456
7 │ 0.354147 -0.387514 -0.46584
8 │ 0.0662374 -0.433577 0.445775
9 │ -0.253163 -0.443358 -0.37219
10 │ 0.266797 -0.489094 0.46467
Using order
is useful if you would want to sort a data frame by several columns
and apply different ordering rules to them.
If you want to apply a transformation to a single column and sort it based on
the transformed values use the by
option:
julia> sort(df, :y, by=abs)
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼──────────────────────────────────
1 │ 0.294026 0.146691 0.32116
2 │ 0.0908446 0.148882 0.450498
3 │ -0.201386 0.151664 -0.185074
4 │ -0.299414 -0.223979 -0.405456
5 │ 0.0796722 0.342714 -0.125813
6 │ 0.354147 -0.387514 -0.46584
7 │ 0.0662374 -0.433577 0.445775
8 │ -0.253163 -0.443358 -0.37219
9 │ -0.0399147 0.456753 0.289904
10 │ 0.266797 -0.489094 0.46467
or equivalently
julia> sort(df, order(:y, by=abs))
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼──────────────────────────────────
1 │ 0.294026 0.146691 0.32116
2 │ 0.0908446 0.148882 0.450498
3 │ -0.201386 0.151664 -0.185074
4 │ -0.299414 -0.223979 -0.405456
5 │ 0.0796722 0.342714 -0.125813
6 │ 0.354147 -0.387514 -0.46584
7 │ 0.0662374 -0.433577 0.445775
8 │ -0.253163 -0.443358 -0.37219
9 │ -0.0399147 0.456753 0.289904
10 │ 0.266797 -0.489094 0.46467
These patterns naturally extend to multiple columns, and sorting is performed lexicographically. Here is an example:
julia> df2 = DataFrame(x=rand(Bool, 16), y=rand(Bool, 16), z=rand(Bool, 16))
16×3 DataFrame
Row │ x y z
│ Bool Bool Bool
─────┼─────────────────────
1 │ false true false
2 │ true true true
3 │ true false true
4 │ false false true
5 │ true false true
6 │ true false false
7 │ true false false
8 │ false false false
9 │ false false true
10 │ false true true
11 │ true false true
12 │ false false false
13 │ true false false
14 │ true false false
15 │ false false false
16 │ false true false
julia> sort(df2, [:y, order(:z, rev=true), :x])
16×3 DataFrame
Row │ x y z
│ Bool Bool Bool
─────┼─────────────────────
1 │ false false true
2 │ false false true
3 │ true false true
4 │ true false true
5 │ true false true
6 │ false false false
7 │ false false false
8 │ false false false
9 │ true false false
10 │ true false false
11 │ true false false
12 │ true false false
13 │ false true true
14 │ true true true
15 │ false true false
16 │ false true false
However, a question is what if I want to sort a data frame on a function of multiple columns taken together?
Sorting on multiple columns considered jointly
Going back to our df
data frame what if we wanted to sort it by the distance
from the origin?
In this case the sortperm
function is useful. What you need to
do is to create a temporary object, get its sortperm
, and apply it to a
source data frame. Here is how it is done in practice:
julia> df[sortperm(norm.(eachrow(df))), :]
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼──────────────────────────────────
1 │ -0.201386 0.151664 -0.185074
2 │ 0.0796722 0.342714 -0.125813
3 │ 0.294026 0.146691 0.32116
4 │ 0.0908446 0.148882 0.450498
5 │ -0.0399147 0.456753 0.289904
6 │ -0.299414 -0.223979 -0.405456
7 │ 0.0662374 -0.433577 0.445775
8 │ -0.253163 -0.443358 -0.37219
9 │ 0.354147 -0.387514 -0.46584
10 │ 0.266797 -0.489094 0.46467
A nice thing is that sortperm
also works for data frames, so if you wanted to
sort the data frame by the sign of :x
and then by the sum of :y
and :z
columns you could write:
julia> df[sortperm(select(df, :x => ByRow(sign), [:y, :z] => +)), :]
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼──────────────────────────────────
1 │ -0.253163 -0.443358 -0.37219
2 │ -0.299414 -0.223979 -0.405456
3 │ -0.201386 0.151664 -0.185074
4 │ -0.0399147 0.456753 0.289904
5 │ 0.354147 -0.387514 -0.46584
6 │ 0.266797 -0.489094 0.46467
7 │ 0.0662374 -0.433577 0.445775
8 │ 0.0796722 0.342714 -0.125813
9 │ 0.294026 0.146691 0.32116
10 │ 0.0908446 0.148882 0.450498
Conclusion
The thing to remember is that because data frame fully supports standard
indexing like a matrix you can easily reorder it using the sortperm
function
applied to an object different than the original data frame.
However, since this feature request is raised quite often we are currently
discussing how to add a support to it in a standard sort
syntax. If
you are interested in the details you can check this issue.