Today I want to comment on a recurring topic that DataFrames.jl users raise. The question is how one should transform multiple columns of a data frame using operation specification syntax.
The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.
In DataFrames.jl the combine
, select
, and transform
functions allow
users for passing the requests for data transformation using operation
specification syntax. This syntax is feature-rich, and you can find its
description for example here. Today I want to focus on its principal concept.
In a general form each request for making an operation on data has the (E)xtract-(T)ransform-(L)oad form. That means that we need to specify:
These tree parts are syntactically expressed using the following form:
[source columns specification] => [transformation function] => [target columns specification]
Let me give an example. Assume you have the following data:
julia> using DataFrames
julia> df = DataFrame(reshape(1:15, 5, 3), :auto)
5×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 6 11
2 │ 2 7 12
3 │ 3 8 13
4 │ 4 9 14
5 │ 5 10 15
We want to compute the sum of column "x1"
and store it in column names "x1_sum"
Since the sum
function performs the addition operation the syntax specification should be:
"x1" => sum => "x1_sum"
Let us check it with the combine
function:
julia> combine(df, "x1" => sum => "x1_sum")
1×1 DataFrame
Row │ x1_sum
│ Int64
─────┼────────
1 │ 15
In this syntax it is important to note two things:
"x1"
column as a whole was passed to the sum
function (as we want to compute its sum);"x1"
column is a single positional argument passed to the sum
function.Two natural questions that arise are the following:
We will now investigate these two dimensions.
Vectorization in DataFrames.jl is easy. Just wrap the function you use in the ByRow
object. Here is an example:
julia> combine(df, "x1" => string => "x1_str")
1×1 DataFrame
Row │ x1_str
│ String
─────┼─────────────────
1 │ [1, 2, 3, 4, 5]
julia> combine(df, "x1" => ByRow(string) => "x1_strs")
5×1 DataFrame
Row │ x1_strs
│ String
─────┼─────────
1 │ 1
2 │ 2
3 │ 3
4 │ 4
5 │ 5
Note that "x1" => string => "x1_str"
passed the whole "x1"
column to the string
function so we got a single "[1, 2, 3, 4, 5]"
string in the output.
While writing "x1" => ByRow(string) => "x1_strs"
passed each element of "x1"
column to the string
function individually,
so in the result we got a vector of five string representations of numbers of the numbers from the source.
Now let us have a look at passing multiple columns. There are two ways you can do it.
The first is when your function accepts multiple positional arguments. An example of such function is string
see:
julia> string(df.x1, df.x2)
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"
If we pass a collection of columns as a source in operation specification syntax we get this behavior:
julia> combine(df, ["x1", "x2"] => string => "x1_x2_str")
1×1 DataFrame
Row │ x1_x2_str
│ String
─────┼─────────────────────────────────
1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]
Naturally, the above combines with vectorization. Therefore since:
julia> string.(df.x1, df.x2)
5-element Vector{String}:
"16"
"27"
"38"
"49"
"510"
we also have:
julia> combine(df, ["x1", "x2"] => ByRow(string) => "x1_x2_strs")
5×1 DataFrame
Row │ x1_x2_strs
│ String
─────┼────────────
1 │ 16
2 │ 27
3 │ 38
4 │ 49
5 │ 510
However, there are cases when we have a function that expects multiple columns to be passed as a single positional argument.
This is handled in DataFrames.jl with the AsTable
wrapper, which you can apply to the source columns.
If you use it then instead of getting multiple positional arguments the function will get a single positional argument
that will be a NamedTuple
holding the source columns.
To convince ourselves that this is indeed what happens let us create a helper function:
julia> function helper(x)
@show x
return string(x.x1, x.x2)
end
helper (generic function with 1 method)
This helper function first prints us its only argument x
and next assumes that it has x1
and x2
fields and applies the string
function to them.
Let us first check it in practice:
julia> helper((x1=[1, 2, 3, 4, 5], x2=[6, 7, 8, 9, 10]))
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
"[1, 2, 3, 4, 5][6, 7, 8, 9, 10]"
Now let us use the helper
function with combine
:
julia> combine(df, AsTable(["x1", "x2"]) => helper => "x1_x2_str")
x = (x1 = [1, 2, 3, 4, 5], x2 = [6, 7, 8, 9, 10])
1×1 DataFrame
Row │ x1_x2_str
│ String
─────┼─────────────────────────────────
1 │ [1, 2, 3, 4, 5][6, 7, 8, 9, 10]
Indeed, we see that helper
got a named tuple holding two columns of the source data frame.
Again, this syntax plays well with ByRow
:
julia> combine(df, AsTable(["x1", "x2"]) => ByRow(helper) => "x1_x2_strs")
x = (x1 = 1, x2 = 6)
x = (x1 = 2, x2 = 7)
x = (x1 = 3, x2 = 8)
x = (x1 = 4, x2 = 9)
x = (x1 = 5, x2 = 10)
5×1 DataFrame
Row │ x1_x2_strs
│ String
─────┼────────────
1 │ 16
2 │ 27
3 │ 38
4 │ 49
5 │ 510
We see that this time helper
got a separate named tuple for each row of source data frame.
In summary today we discussed two special operations in DataFrames.jl operation specification syntax:
ByRow
which vectorizes the function passed to it;AsTable
which allows us to pass source columns as a single named tuple to the transformation function
(instead of passing them as consecutive positional arguments, which is the default).I hope these examples were useful in helping you understand the design of operation specification syntax.
]]>This is a follow up to the post from last week. We will continue
discussing how one can work with GroupedDataFrame
objects in DataFrames.jl.
Today we focus on indexing of grouped data frames.
The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.
First create some grouped data frame:
julia> using DataFrames
julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
3 │ 2 c
4 │ 1 c
5 │ 3 b
6 │ 2 b
julia> gdf = groupby(df, :str, sort=true)
GroupedDataFrame with 3 groups based on key: str
First Group (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
⋮
Last Group (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
It is sometimes useful to learn what is a group number of each row of the source data frame df
in a grouped data frame gdf
.
You can easily get this information with groupindices
:
julia> groupindices(gdf)
6-element Vector{Union{Missing, Int64}}:
1
1
3
3
2
2
A basic operation when indexing a GroupedDataFrame
is to pick a group by its number. Here is an example:
julia> gdf[1]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
julia> gdf[2]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
julia> gdf[3]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
Note, that gdf
behaves similarly to a vector. You can even use begin
and end
in indexing:
julia> gdf[begin]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
julia> gdf[end]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
Often you might want to extract a group not by its position in gdf
, but by the value of the grouping
variable or variables. In this case you can use GroupKey
, dictionary, tuple, or named tuple to achieve this.
Let us check how it works. Start with dictionary, tuple, and named tuple:
julia> gdf[Dict("str" => "b")] # dictionary
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
julia> gdf[("b",)] # tuple
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
julia> gdf[(; str="b")] # named tuple
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
With GroupKey
we first need to get it from keys
, but everything else works the same:
julia> key = keys(gdf)[1]
GroupKey: (str = "a",)
julia> gdf[key]
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
You might ask why we require passing grouping variable in a container (dictionary, tuple, named tuple, GroupKey
)
and not directly pass the required value when indexing? The reason is that if you grouped your data by integer column
the result would be ambiguous. Here is an example showing that under the defined rules there is no such ambiguity:
julia> gdf2 = groupby(df, :int, sort=false)
GroupedDataFrame with 3 groups based on key: int
First Group (2 rows): int = 1
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
⋮
Last Group (2 rows): int = 2
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> gdf2[3] # third group
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> gdf2[(3, )] # group with value of the grouping variable equal to 3
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
You now know how to pick a single group, so selecting multiple groups is a natural next step. You can use a collection of any of the selectors we have already discussed. Here are some examples:
julia> gdf[[3, 1]] # selection by group number
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
⋮
Last Group (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
julia> gdf[[("c",), ("a",)]] # selection by grouping variable value
GroupedDataFrame with 2 groups based on key: str
First Group (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
⋮
Last Group (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
Note that indexing allows both for reordering and for dropping groups, which often comes handy when analyzing data.
Also note that groupindices
is aware of such changes:
julia> groupindices(gdf[[3, 1]])
6-element Vector{Union{Missing, Int64}}:
2
2
1
1
missing
missing
Here group with "c"
is first, with "a"
is second and with "b"
is dropped, so missing
is returned in the produced vector.
It is also worth to remember that subset
and filter
can be used with GroupedDataFrames
. This topic is discussed in this post.
Sometimes we do not want to index into a grouped data frame, but just check if it contains some key. This is easily achievable with the haskey
function:
julia> haskey(gdf, ("a",))
true
julia> haskey(gdf, ("z",))
false
In this post we discussed indexing of GroupedDataFrames
. This concludes the basic tutorial of working with these data structures.
I hope you will find the functionalities I have covered useful in your work.
One of the features of DataFrames.jl that I often find useful is that when you group
a data frame by some of its columns the resulting GroupedDataFrame
is an object
that gains new and useful functionalities.
Some time ago I have discussed how GroupedDataFrame
can be filtered. You can find
this post here. In this post and the following one that I plan to write next
week I thought that it would be useful to review other key functionalities of
a GroupedDataFrame
.
The post was written under Julia 1.10.1 and DataFrames.jl 1.6.1.
You can create a GroupedDataFrame
using the groupby
function.
Here are some examples:
julia> using DataFrames
julia> df = DataFrame(int=[1, 3, 2, 1, 3, 2],
str=["a", "a", "c", "c", "b", "b"])
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
3 │ 2 c
4 │ 1 c
5 │ 3 b
6 │ 2 b
julia> show(groupby(df, :int), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
Group 2 (2 rows): int = 2
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
Group 3 (2 rows): int = 3
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
julia> show(groupby(df, :int; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
Group 2 (2 rows): int = 2
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
Group 3 (2 rows): int = 3
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
julia> show(groupby(df, :int; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: int
Group 1 (2 rows): int = 1
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
Group 2 (2 rows): int = 3
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
Group 3 (2 rows): int = 2
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> show(groupby(df, :str), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
Group 2 (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
Group 3 (2 rows): str = "b"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
julia> show(groupby(df, :str; sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
Group 2 (2 rows): str = "b"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
Group 3 (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
julia> show(groupby(df, :str; sort=false), allgroups=true)
GroupedDataFrame with 3 groups based on key: str
Group 1 (2 rows): str = "a"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
Group 2 (2 rows): str = "c"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 1 c
Group 3 (2 rows): str = "b"
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 b
2 │ 2 b
What this example shows is that the key thing you need to remember to decide about a grouped data frame is the order of groups.
There are two options here:
sort=true
;sort=false
.You might ask what happens if you do not pass the sort
keyword argument?
In this case either of the options is used depending on which one is faster.
Therefore, omitting sort
, can be thought of as an information that the user does not
care about the order of groups but wants the grouping operation to be as fast as possible.
In some cases the order of groups is irrelevant (so you can safely skip passing it).
The most important scenario of this kind is when you use the select
or transform
function
with a GroupedDataFrame
. The reason is that these functions anyway always keep the order of
rows from the source data frame (no matter how the groups are rearranged in a GroupedDataFrame
).
However, it is not the case with combine
, as it respects the order of groups in a GroupedDataFrame
.
Let us see an example highlighting the difference between these cases:
julia> select(groupby(df, :int, sort=true), :str)
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
3 │ 2 c
4 │ 1 c
5 │ 3 b
6 │ 2 b
julia> combine(groupby(df, :int, sort=true), :str)
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
3 │ 2 c
4 │ 2 b
5 │ 3 a
6 │ 3 b
julia> select(groupby(df, :int, sort=false), :str)
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 3 a
3 │ 2 c
4 │ 1 c
5 │ 3 b
6 │ 2 b
julia> combine(groupby(df, :int, sort=false), :str)
6×2 DataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
3 │ 3 a
4 │ 3 b
5 │ 2 c
6 │ 2 b
As you can see select
kept the rows in the order in which they are present in df
no matter if we
passed sort=true
or sort=false
. On the other hand combine
returns rows grouped by the groups and
the order of groups corresponds to their order in GroupedDataFrame
, so passing sort=true
or
sort=false
in general changes.
When discussing select
or combine
in conjunction with GroupedDataFrame
it is important to mention
that there are four special cases of operation specification syntax designed specifically for working with
them. They are:
nrow
to compute the number of rows in each group.proprow
to compute the proportion of rows in each group.eachindex
to return a vector holding the number of each row within each group.groupindices
to return the group number.Each of them optionally allows you to specify the name of the target column by =>
syntax.
Here are some examples:
julia> combine(groupby(df, :int, sort=false), nrow)
3×2 DataFrame
Row │ int nrow
│ Int64 Int64
─────┼──────────────
1 │ 1 2
2 │ 3 2
3 │ 2 2
julia> combine(groupby(df, :int, sort=false), proprow => "row %")
3×2 DataFrame
Row │ int row %
│ Int64 Float64
─────┼─────────────────
1 │ 1 0.333333
2 │ 3 0.333333
3 │ 2 0.333333
julia> combine(groupby(df, :int, sort=false), eachindex)
6×2 DataFrame
Row │ int eachindex
│ Int64 Int64
─────┼──────────────────
1 │ 1 1
2 │ 1 2
3 │ 3 1
4 │ 3 2
5 │ 2 1
6 │ 2 2
julia> combine(groupby(df, :int, sort=false), groupindices => "group #")
3×2 DataFrame
Row │ int group #
│ Int64 Int64
─────┼────────────────
1 │ 1 1
2 │ 3 2
3 │ 2 3
Apart from using functions such as select
or combine
on a GroupedDataFrame
it is useful to know
that it supports iteration. Therefore you can use a GroupedDataFrame
in a loop or in a comprehension.
When iterated GroupedDataFrame
returns data frames corresponding to the groups. Let us see:
julia> for v in groupby(df, :int, sort=false)
println(v)
end
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> [v for v in groupby(df, :int, sort=false)]
3-element Vector{SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}:
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
julia> collect(groupby(df, :int, sort=false))
3-element Vector{Any}:
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
The last example has shown you that you can pass a GroupedDataFrame
to a function expecting an iterable, in this case the collect
function. The one exception to this rule is that you cannot use GroupedDataFrame
with the map
function directly:
julia> map(identity, groupby(df, :int, sort=false))
ERROR: ArgumentError: using map over `GroupedDataFrame`s is reserved
The reason is that it was not clear if such operation should produce a vector or a data frame, and it is easy enough to achieve both results with other means. If you want e vector use e.g. a comprehension. If you want a data frame use e.g. combine
or select
.
Sometimes, when iterating a GroupedDataFrame
we might be interested not only in a data frame per group, but also in a value of grouping variable. This is easily achieved with the keys
and pairs
functions (depending on whether you only want grouping values or both grouping values and data frames):
julia> map(identity, keys(groupby(df, :int, sort=false)))
3-element Vector{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}}:
GroupKey: (int = 1,)
GroupKey: (int = 3,)
GroupKey: (int = 2,)
julia> map(identity, pairs(groupby(df, :int, sort=false)))
3-element Vector{Pair{DataFrames.GroupKey{GroupedDataFrame{DataFrame}}, SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}}:
GroupKey: (int = 1,) => 2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 1 c
GroupKey: (int = 3,) => 2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 3 a
2 │ 3 b
GroupKey: (int = 2,) => 2×2 SubDataFrame
Row │ int str
│ Int64 String
─────┼───────────────
1 │ 2 c
2 │ 2 b
I used the map
function to show you that it is only reserved to use it with plain GroupedDataFrame
.
As you can see in this example each group in a GroupedDataFrame
is associated with a GroupKey
. To get all
keys use the keys
function:
julia> keys(groupby(df, :int, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
GroupKey: (int = 1,)
GroupKey: (int = 3,)
GroupKey: (int = 2,)
Let us, as an example extract the last key so see how one can work with it:
julia> key = last(keys(groupby(df, :int, sort=false)))
GroupKey: (int = 2,)
You can get a value of the key by property access or indexing:
julia> key.int
2
julia> key[1]
2
julia> key["int"]
2
julia> key[:int]
2
It is also easy co convert GroupKey
to a dictionary, vector, Tuple
or NamedTuple
if you would need it:
julia> Dict(key)
Dict{Symbol, Int64} with 1 entry:
:int => 2
julia> collect(key)
1-element Vector{Int64}:
2
julia> Tuple(key)
(2,)
julia> NamedTuple(key)
(int = 2,)
Note that, in general, you can group a data frame by multiple columns so you could query value of any grouping column
in the examples above. If you needed to get a list of grouping columns use the groupcols
function:
julia> groupcols(groupby(df, :int, sort=false))
1-element Vector{Symbol}:
:int
In this post we have learned how one can create a grouped data frame and how to choose the order of groups in it.
As a follow-up we have shown how GroupedDataFrame
interacts with functions like select
or combine
.
Next we discussed iterator interface support by GroupedDataFrame
and how to get and use information about
values of grouping columns for each group. I hope you found these examples useful.
In the post next week we will discuss how GroupedDataFrame
supports the indexing interface.
Some functions provided in Base Julia support partial application. I often find this functionality useful. Therefore in this post I want to give you its explanation and a summary which functions have this property.
The post was tested with Julia Version 1.12.0-DEV.53.
We will focus on partial application of functions having two positional arguments. Let us work by example.
Consider the in
function. You can call it to check if some item is in a collection.
Here is an example:
julia> in('a', "Abracadabra")
true
julia> in('x', "Abracadabra")
false
A common pattern you might need is to perform a repeated check if various items are contained in the same collection. For example assume you have a vector of characters and you want to filer it to keep only the elements contained in a reference collection. You can do it like this:
julia> v = 'a':'z'
'a':1:'z'
julia> filter(x -> in(x, "Abracadabra"), v)
5-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
This pattern is so commonly needed that there is a shorthand for x -> in(x, "Abracadabra")
.
Instead of creating this anonymous function you can just write in("Abracadabra")
.
The value returned by this function call behaves in the same way as x -> in(x, "Abracadabra")
.
Let us check:
julia> filter(in("Abracadabra"), v)
5-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
You can think of this operation as if we partially applied the in
function by fixing its second argument
(the collection) and leaving the first (the item we check) to be specified later.
In other words the following two operations are equivalent:
julia> in('a', "Abracadabra")
true
julia> in("Abracadabra")('a')
true
Fixing of the second argument is most common. However, sometimes it is useful to fix the first argument.
This is exactly the case of the filter
function we have just used.
What if you wanted to perform the filter(in("Abracadabra"), v)
test for multiple different values of v
but with a fixed predicate function?
Here is an example:
julia> vv = ['a'+i:'z' for i in 0:4]
5-element Vector{StepRange{Char, Int64}}:
'a':1:'z'
'b':1:'z'
'c':1:'z'
'd':1:'z'
'e':1:'z'
julia> map(v -> filter(in("Abracadabra"), v), vv)
5-element Vector{Vector{Char}}:
['a', 'b', 'c', 'd', 'r']
['b', 'c', 'd', 'r']
['c', 'd', 'r']
['d', 'r']
['r']
You probably see, where I am getting at. Instead of v -> filter(in("Abracadabra"), v)
we can write filter(in("Abracadabra"))
and fix
the first positional argument of filter
, leaving the second to be specified later.
Let us check if this works:
julia> map(filter(in("Abracadabra")), vv)
5-element Vector{Vector{Char}}:
['a', 'b', 'c', 'd', 'r']
['b', 'c', 'd', 'r']
['c', 'd', 'r']
['d', 'r']
['r']
Indeed, we get what we expected. Again, for a reference note that the following two operations are equivalent:
julia> filter(in("Abracadabra"), v)
5-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
julia> filter(in("Abracadabra"))(v)
5-element Vector{Char}:
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)
'r': ASCII/Unicode U+0072 (category Ll: Letter, lowercase)
Before I finish this section let me note that if you do not like writing that many parentheses you could use the |>
operator.
In our example we could write:
julia> map("Abracadabra" |> in |> filter, vv)
5-element Vector{Vector{Char}}:
['a', 'b', 'c', 'd', 'r']
['b', 'c', 'd', 'r']
['c', 'd', 'r']
['d', 'r']
['r']
Which style you use is a matter of preference.
We saw that some functions taking two arguments support partial application. Below I give you a list of all of them that are currently supported (and this is the reason why the post is written under Julia nightly, as there were recent changes in this list).
There is only one function in Base Julia that supports fixing its first argument and this function is filter
.
However, there are many functions supporting fixing of their second argument. Here is their list:
isequal
, ==
, !=
, >=
, <=
, >
, <
;in
, ∈
, ∋
, ∉
, ∌
;contains
, occursin
, endswith
, startswith
;issubset
,⊆
, ⊇
, ⊈
, ⊉
, ⊊
, ⊋
, isdisjoint
, issetequal
.After reading this post you know how to use partial function application in Julia and which functions from Base support it. I hope you will find this functionality useful in your code.
]]>Today I want to present a small benchmark of random number generation performance improvements between current Julia release 1.10.1 and current LTS version 1.6.7.
The idea for the benchmark follows a discussion with a friend who needed to run some compute intensive Julia code on its LTS version.
The post was written under Julia 1.10.1 and Julia 1.6.7.
Let us start with presenting the benchmark functions
function test_rand1()
s = 0
for i in 1:10^9
s += rand(1:1_000_000)
end
return s
end
function test_rand2()
s = 0.0
for i in 1:10^9
s += rand()
end
return s
end
They are relatively simple. I wanted to compare the performance of:
(1) integer generation from some range and (2) generation of floating point numbers from [0, 1)
interval
as these are two most common scenarios in practice.
Let us see the results. First comes Julia 1.6.7:
julia> @time test_rand1()
4.949335 seconds (13 allocations: 35.406 KiB)
499993991047124
julia> @time test_rand1()
4.663646 seconds
499998112691460
julia> @time test_rand2()
2.175424 seconds
5.000141761909688e8
julia> @time test_rand2()
2.238839 seconds
4.9999424544883996e8
And now we have Julia 1.10.1:
julia> @time test_rand1()
2.355028 seconds
500001818410630
julia> @time test_rand1()
2.287840 seconds
499998082399284
julia> @time test_rand2()
1.123886 seconds
5.000026226340503e8
julia> @time test_rand2()
1.117811 seconds
4.9999201274214923e8
So we see that things run roughly two times faster.
What is the reason for this difference?
The major point is that between Julia 1.6.7 and Julia 1.10.1
a default random number generator was changed. Let us see
(below I use copy
to ensure explicit instantiation of the random number generator object under Julia 1.10.1).
Again, first we test Julia 1.6.7:
julia> using Random
julia> copy(Random.default_rng())
MersenneTwister(0x2fe644ceb724000ca5e5b4409dc3c6ea, (0, 4502994048, 4502993046, 986, 2502992778, 986))
and next we check Julia 1.10.1:
julia> using Random
julia> copy(Random.default_rng())
Xoshiro(0x1273707731737276, 0x187b3d2e82fb1d48, 0x13f9fd1a82642acb, 0xa7dcba727da742e6, 0x3ed2b4d410aa4b31)
So indeed, we see that MersenneTwister was replaced by Xoshiro generator (to be exact Xoshiro256++).
This has one important consequence, apart from random number generation speed that is related to seeding of the generator. Let us check, Julia 1.6.7:
julia> Random.seed!(1)
MersenneTwister(1)
julia> rand()
0.23603334566204692
vs Julia 1.10.1:
julia> Random.seed!(1)
TaskLocalRNG()
julia> rand()
0.07336635446929285
This means that when you use the default random number generator you should not expect reproducibility of results between these two Julia versions. This lack of stability is documented as not ensured across Julia versions.
If you need to ensure such reproducibility you can use e.g. the StableRNGs.jl package.
The topic of changes in random number generation in Julia is probably well known to people doing compute intensive simulations. However, I thought it is worth to present these results for new users, who might be using different versions of Julia to execute the same code and wonder why the performance or the results themselves are different across them.
]]>Today, I wanted to switch back to a lighter subject. Therefore I decided to have a look at my favorite Project Euler website.
I picked the problem 116 as I have not tried to solve it yet. Interestingly, it turned out that there are two ways to approach this puzzle, so I thought to share them here.
The post was written under Julia 1.10.0.
The Project Euler puzzle 116 can be briefly stated as follows:
Given a row of 50 grey squares is to have a number of its tiles replaced with coloured oblong tiles chosen from red (length two), green (length three), or blue (length four). How many different ways can the grey tiles be replaced if colours cannot be mixed and at least one coloured tile must be used?
(If you want to see some visual examples of valid tilings, I encourage you to visit the puzzle 116 page.)
When we think of this problem, it is natural to generalize it. By C(n, d)
we can
define the number of ways that n
gray squares can be replaced with tiles of length d
.
Then the solution to our problem is C(n, 2) + C(n, 3) + C(n, 4)
.
So let us focus on computing C(n, d)
(assuming d
is positive).
The first approach is to ask how many tiles of length d
can be put. There must be at least
1
, and we cannot put more than n ÷ d
(here I use the ÷
notation taken from Julia that
denotes integer division; in other words the integer part of n / d
).
So now assume that we want to put i
blocks of length d
(assuming i
is valid). In
how many ways can we do it. Well, we put i
long blocks and we are left with n - d*i
gray blocks.
In total we have i + (n - d*i)
blocks. You can then think of it as you have that many slots
from which you need pick i
slots to put the long blocks. The number of ways you can do it is
given by the value of binomial coefficient. In Julia notation it is:
binomial(BigInt(i + (n - d*i)), BigInt(i))
.
Now you might ask why I put the BigInt
wrapper around the passed numbers? The reason is
that binomial coefficient gets large pretty quickly, so I want to make sure I will not
have issues with integer overflow.
Given these considerations the first function that produces C(n, d)
can be defined as:
function C1(n::Integer, d::Integer)
@assert d > 0 && n >= 0
return sum(i -> binomial(BigInt(i + (n - d*i)), BigInt(i)), 1:n ÷ d; init=big"0")
end
Note that I use the init=big"0"
initialization statement in the sum
to ensure the
correct handling of the cases when n < d
when we are given an empty collection to sum over.
However, there is a different way how we can think of computing C(n, d)
.
Assume we know the values of C(n, d)
for values of n
smaller than the requested one.
We look at the last tile in our row.
If it is empty, then we are down to n-1
tiles to be filled.
This can be done in C(n-1, d)
ways (remember that this value takes care
of the fact that at least one block of length d
has to be used).
But what if the last tile in our row is filled with a block of length d
?
Then we have two options. Either all other blocks are left gray (which gives us 1
combination)
or we are left with n-d
tiles that are filled with at least one block of length d
. The
second value is exactly C(n-d, d)
.
In summary we get that C(n, d) = C(n-1, d) + C(n-d, d) + 1
.
This formula assumes n
is at least d
. But clearly for n < d
we have 0
ways to arrange the blocks.
Let us write down the code that performs the required computation:
function C2(n::Integer, d::Integer)
@assert d > 0 && n >= 0
npos = Dict{Int,BigInt}(i => 0 for i in 0:d-1)
for j in d:n
npos[j] = npos[j-1] + npos[j-d] + 1
end
return npos[n]
end
Note in the code that I used the npos
dictionary to flexibly allow
for any potential integer values of n
. The dictionary has
Dict{Int,BigInt}
type, again, to ensure that the results of the computations
are stored correctly even if they are large.
Now we have two functions C1
and C2
that look completely differently.
Do they produce the same results. Let us check:
julia> using Test
julia> @testset "test C1 and C2 equality" begin
for n in 0:200, d in 1:20
@test C1(n, d) == C2(n, d)
end
end;
Test Summary: | Pass Total Time
test C1 and C2 equality | 4020 4020 0.9s
Indeed we see that both C1
and C2
functions produce the same results.
To convince ourselves that using arbitrary precision integers was indeed needed let us check some example values of the functions:
julia> C1(200, 2)
453973694165307953197296969697410619233825
julia> C2(200, 2)
453973694165307953197296969697410619233825
julia> typemax(Int)
9223372036854775807
Indeed, if we were not careful, we would have an integer overflow issue.
As usual I will not show the value of the solution to the problem to encourage you
to run the code yourself. You can get it by executing either
sum(d -> C1(50, d), 2:4)
or sum(d -> C2(50, d), 2:4)
.
(We have just checked that the value produced in both cases is the same).
I have written in the past about DataFrames.jl operation specification syntax (also called minilanguage), see for example this post or this post.
Today I want to discuss one design decision made in this minilanguage and its consequences. It is related with how vectors are handled when they are returned from some transformation function.
The post was written under Julia 1.10.0 and DataFrames.jl 1.6.1.
Consider the following example, where we want to compute a profit from some sales data:
julia> using DataFrames
julia> df = DataFrame(name=["A", "B", "C"],
revenue=[10, 20, 30],
cost=[5, 12, 18])
3×3 DataFrame
Row │ name revenue cost
│ String Int64 Int64
─────┼────────────────────────
1 │ A 10 5
2 │ B 20 12
3 │ C 30 18
julia> combine(df, All(), ["revenue", "cost"] => (-) => "profit")
3×4 DataFrame
Row │ name revenue cost profit
│ String Int64 Int64 Int64
─────┼────────────────────────────────
1 │ A 10 5 5
2 │ B 20 12 8
3 │ C 30 18 12
The crucial point to understand here is that the -
function takes
two columns "revenue"
and "cost"
and returns a vector.
Users typically expect, as in this example, that this vector should
be spread across several rows.
However, there are cases, when we might not want to spread a vector
into multiple rows. Consider for example a transformation in which
we want to put "revenue"
and "profit"
values in a 2-element vector
per product. Intuitively we could write something like:
julia> combine(groupby(df, :name),
All(),
["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
ERROR: ArgumentError: all functions must return vectors of the same length
We get an error unfortunately. We will soon understand why, but before
I proceed let me comment on the [only(x), only(y)]
part of the definition.
The only
function makes sure that we have exactly one row per product.
To diagnose the issue let us drop the All()
part in our call:
julia> combine(groupby(df, :name),
["revenue", "cost"] => ((x,y) -> [only(x), only(y)]) => "vec")
6×2 DataFrame
Row │ name vec
│ String Int64
─────┼───────────────
1 │ A 10
2 │ A 5
3 │ B 20
4 │ B 12
5 │ C 30
6 │ C 18
Now we understand the problem. Because our function returns a vector it gets
spread over several rows (which leads to an error as other columns of df
have
a different length).
As I have said above, most of the time vector spreading is a desired feature,
but in the example we have just studied it is not wanted.
For such cases DataFrames.jl allows you to protect vectors from being spread.
What you need to do is to call Ref
function on the returned value.
This will protect the result from being spread:
julia> combine(groupby(df, :name),
All(),
["revenue", "cost"] => ((x,y) -> Ref([only(x), only(y)])) => "vec")
3×4 DataFrame
Row │ name revenue cost vec
│ String Int64 Int64 Array…
─────┼──────────────────────────────────
1 │ A 10 5 [10, 5]
2 │ B 20 12 [20, 12]
3 │ C 30 18 [30, 18]
Now, as we wanted, the entries of the "vec"
columns are vectors. Wrapping the return
value of our function with Ref
protected the vectors from being spread.
The alternative function that you could use to get the same effect is fill
:
julia> combine(groupby(df, :name),
All(),
["revenue", "cost"] => ((x,y) -> fill([only(x), only(y)])) => "vec")
3×4 DataFrame
Row │ name revenue cost vec
│ String Int64 Int64 Array…
─────┼──────────────────────────────────
1 │ A 10 5 [10, 5]
2 │ B 20 12 [20, 12]
3 │ C 30 18 [30, 18]
or you could wrap the return value with another pair of [...]
:
julia> combine(groupby(df, :name),
All(),
["revenue", "cost"] => ((x,y) -> [[only(x), only(y)]]) => "vec")
3×4 DataFrame
Row │ name revenue cost vec
│ String Int64 Int64 Array…
─────┼──────────────────────────────────
1 │ A 10 5 [10, 5]
2 │ B 20 12 [20, 12]
3 │ C 30 18 [30, 18]
What is going on here? In all three cases (Ref
, fill
, and [...]
) we are wrapping a vector in another object that works like an outer vector.
In the case of [...]
it is just a vector, fill
produces a 0
-dimensional array, and Ref
creates a wrapper that behaves like 0-dimensional array.
In all cases DataFrames.jl treats this outer wrapper as a 1-element vector and just stores its contents in a single row (because there is one element to store).
I hope that you will find the example I gave today useful when transforming vectors using DataFrames.jl.
]]>Today I wanted to discuss a conceptual aspect of Julia programming. It is related to the question how you should query some object for its properties. The topic is especially relevant if you want to write code that is expected to be stable in the longer term, that means that it is easy to maintain as versions of its dependencies change.
The post was written under Julia 1.10.0 and DataFrames.jl 1.6.1.
A fundamental element of Julia design are composite types. This kind of object is a collection of fields, that have names. Each of such fields can hold some value.
To make things non-abstract let us have a look at a SubDataFrame
type from DataFrames.jl.
First create an instance of such object:
julia> using DataFrames
julia> df = DataFrame(x=1:3, y=11:13, z=111:113)
3×3 DataFrame
Row │ x y z
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 11 111
2 │ 2 12 112
3 │ 3 13 113
julia> sdf = @view df[1:2, 1:2]
2×2 SubDataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1 11
2 │ 2 12
To check what fields SubDataFrame
contains you can use the the fieldnames
function:
julia> fieldnames(SubDataFrame)
(:parent, :colindex, :rows)
Note that we pass a type to fieldnames
. It is important - the list of fields is fixed for every
instance of an object of a given type.
In this case we learned that SubDataFrame
has three fields. The three functions associated with
fieldnames
are: fieldcount
returning the number of fields of a type,
fieldtypes
returning their declared types, and hasfield
allowing you
to query if a specific field is present. There is an example:
julia> fieldcount(SubDataFrame)
3
julia> fieldtypes(SubDataFrame)
(AbstractDataFrame, DataFrames.AbstractIndex, AbstractVector{Int64})
julia> hasfield(SubDataFrame, :parent)
true
julia> hasfield(SubDataFrame, :parentx)
false
For a given instance of a type you can query the field with getfield
and set it with setfield!
.
For example, let us get the field :parent
of our sdf
object (a source data frame in this case):
julia> getfield(sdf, :parent)
3×3 DataFrame
Row │ x y z
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 11 111
2 │ 2 12 112
3 │ 3 13 113
Having learned all these methods you might ask yourself when to use it. The short answer is:
Never directly access fields of a type. They might be changed between versions of code you use without warning.
The longer answer is that you should assume that direct field access is typically considered internal. The list and fields and their types are an implementation detail and as a user of this type you should not rely on them. The use of property access is restricted to the designers of a type to allow them manipulate its inner physical representation.
So how should we work with composite types then?
Julia introduces a concept of property that is a logical representation of data stored in a given object.
You can query for properties of an object with the propertynames
function. You also have the hasproperty
,
getproperty
and setproperty!
functions similar as for fields.
In case of our sdf
SubDataFrame
we have the following logical representation:
julia> propertynames(sdf)
2-element Vector{Symbol}:
:x
:y
julia> hasproperty(sdf, :x)
true
julia> getproperty(sdf, :x)
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
1
2
julia> setproperty!(sdf, :x, [1001, 1002])
2-element Vector{Int64}:
1001
1002
julia> sdf
2×2 SubDataFrame
Row │ x y
│ Int64 Int64
─────┼──────────────
1 │ 1001 11
2 │ 1002 12
We immediately see a significant difference. The sdf
properties in this case are columns of our data frame.
We do not care how they are mapped to a physical representation of SubDataFrame
, this is taken care of
by designers of the DataFrames.jl package.
There are the following important aspects of properties.
The first is that property access is typically considered a public API. Designers of the type should make sure that the way you can access properties of an object should remain stable and a change in this area would be breaking, so:
You should access properties of objects in your code (not fields).
The second is that properties are bound to object, not to a type. This means that different objects of the same type may have different sets of properties. It is quite useful, e.g. each data frame can have a different set of columns.
The third, practical, information is that by default properties fall back to fields, as you can read here in the Julia Manual.
The next aspect is convenient syntax.
You do not need to call the getproperty
and setproperty!
functions explicitly.
The getproperty(a, :b)
is equivalent to a.b
, and setproperty!(a, :b, v)
is the same as a.b = v
.
Finally note that the propertynames
function optionally takes a second positional argument
that is Bool
. If it is passed and set to true
you get a list of all properties of some object.
By default the second argument is false
and you get a list of public properties of some object
(and in practice you should use the default mode).
Today I have a short conclusion.
Fields represent physical layout of a type. Properties represent a logical view of an object.
In your code use object properties and not their fields. Field access is considered internal and typically should be only done by developers of a package providing a given object.
]]>This week I have discussed with my colleague the Lichess puzzle dataset that I use in my Julia for Data Analysis book.
The dataset contains a list of puzzles along with information about them, such as puzzle difficulty, puzzle solution, and tags describing puzzle type.
We were discussing if tags assigned to puzzles in this dataset are accurate. In this post I give you an example how one can check it (and practice a bit CSV.jl and DataFrames.jl).
The post was written under Julia 1.10.0, CSV.jl 0.10.12, and DataFrames.jl 1.6.1.
In this post I show you a relatively brief code. Therefore I assume that first you download the file with the puzzle dataset and unpack it manually. (In the book I show how to do it using Julia. You can find the source code on GitHub repository of the book.)
Assuming you downloaded and unpacked the dataset into the puzzles.csv
file
we read it in. We are interested only in columns 3 and 8 of this file,
so I use the following commands:
julia> using CSV
julia> using DataFrames
julia> df = CSV.read("puzzles.csv", DataFrame; select=[3, 8], header=false)
2132989×2 DataFrame
Row │ Column3 Column8
│ String String
─────────┼──────────────────────────────────────────────────────────────────────
1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1 crushing hangingPiece long middl…
2 │ d3d6 f8d8 d6d8 f6d8 advantage endgame short
3 │ b6c5 e2g4 h3g4 d1g4 advantage middlegame short
4 │ g5e7 a5c3 b2c3 c6e7 advantage master middlegame short
5 │ e8f7 e2e6 f7f8 e6f7 mate mateIn2 middlegame short
6 │ a6a5 e5c7 a5b4 c7d8 crushing endgame fork short
7 │ d4b6 f6e4 h1g1 e4f2 crushing endgame short trappedPi…
8 │ d8f6 d1h5 h7h6 h5c5 advantage middlegame short
⋮ │ ⋮ ⋮
2132982 │ d2c2 c5d3 c2d3 c4d3 crushing fork middlegame short
2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7 crushing long middlegame quietMo…
2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6 crushing defensiveMove endgame l…
2132985 │ g1h1 e3e1 f7f1 e1f1 endgame mate mateIn2 short
2132986 │ g5c1 d5d6 d7f6 h7h8 advantage middlegame short
2132987 │ d2f3 d8a5 c1d2 a5b5 advantage fork opening short
2132988 │ f7f2 b2c2 c1b1 e2d1 endgame mate mateIn2 queensideAt…
2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3 advantage long opening
2132973 rows omitted
julia> rename!(df, ["moves", "tags"])
2132989×2 DataFrame
Row │ moves tags
│ String String
─────────┼──────────────────────────────────────────────────────────────────────
1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1 crushing hangingPiece long middl…
2 │ d3d6 f8d8 d6d8 f6d8 advantage endgame short
3 │ b6c5 e2g4 h3g4 d1g4 advantage middlegame short
4 │ g5e7 a5c3 b2c3 c6e7 advantage master middlegame short
5 │ e8f7 e2e6 f7f8 e6f7 mate mateIn2 middlegame short
6 │ a6a5 e5c7 a5b4 c7d8 crushing endgame fork short
7 │ d4b6 f6e4 h1g1 e4f2 crushing endgame short trappedPi…
8 │ d8f6 d1h5 h7h6 h5c5 advantage middlegame short
⋮ │ ⋮ ⋮
2132982 │ d2c2 c5d3 c2d3 c4d3 crushing fork middlegame short
2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7 crushing long middlegame quietMo…
2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6 crushing defensiveMove endgame l…
2132985 │ g1h1 e3e1 f7f1 e1f1 endgame mate mateIn2 short
2132986 │ g5c1 d5d6 d7f6 h7h8 advantage middlegame short
2132987 │ d2f3 d8a5 c1d2 a5b5 advantage fork opening short
2132988 │ f7f2 b2c2 c1b1 e2d1 endgame mate mateIn2 queensideAt…
2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3 advantage long opening
2132973 rows omitted
Note that the file does not have a header so when reading it we passed header=false
and then manually named the columns using rename!
.
I wanted only these two columns since today I want to check if the tags related
to mating are accurate. You can notice in the above printout that in the "tags"
column we have a tag "mateIn2"
. It indicates that the puzzle is mate in two moves.
This is the case for example for rows 5, 2132985, and 2132988.
In the matching "moves"
column we see that we have 4 corresponding moves.
The reason is that we have two players making the move (and 2 + 2 = 4).
What we want to check if these "mateInX"
tags are correct. I will check the
values of X
from 1 to 5 (as only these five options are present in tags,
I leave it to you as an exercise to verify).
When should we call the tags correct. There are two conditions:
"mateIn1"
and "mateIn2"
at the same time);Let us check it.
As a first step we (in place, i.e. modifying our df
data frame) transform the original columns
into more convenient form. Instead of the raw "moves"
I want the "nmoves"
column that gives me
a number of moves in the puzzle. Similarly instead of "tags"
I want indicator columns "mateInX"
for X
ranging from 1 to 5 showing me the puzzle type. Here is how you can achieve this:
julia> select!(df,
"moves" => ByRow(length∘split) => "nmoves",
["tags" => ByRow(contains("mateIn$i")) => "mateIn$i" for i in 1:5])
2132989×6 DataFrame
Row │ nmoves mateIn1 mateIn2 mateIn3 mateIn4 mateIn5
│ Int64 Bool Bool Bool Bool Bool
─────────┼─────────────────────────────────────────────────────
1 │ 6 false false false false false
2 │ 4 false false false false false
3 │ 4 false false false false false
4 │ 4 false false false false false
5 │ 4 false true false false false
6 │ 4 false false false false false
7 │ 4 false false false false false
8 │ 4 false false false false false
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
2132982 │ 4 false false false false false
2132983 │ 6 false false false false false
2132984 │ 6 false false false false false
2132985 │ 4 false true false false false
2132986 │ 4 false false false false false
2132987 │ 4 false false false false false
2132988 │ 4 false true false false false
2132989 │ 6 false false false false false
2132973 rows omitted
Now we see that some of the rows are not tagged as "mateInX"
. Let us filter them out,
to have only tagged rows left (again, we do the operation in-place):
julia> filter!(row -> any(row[Not("nmoves")]), df)
491743×6 DataFrame
Row │ nmoves mateIn1 mateIn2 mateIn3 mateIn4 mateIn5
│ Int64 Bool Bool Bool Bool Bool
────────┼─────────────────────────────────────────────────────
1 │ 4 false true false false false
2 │ 4 false true false false false
3 │ 2 true false false false false
4 │ 4 false true false false false
5 │ 2 true false false false false
6 │ 4 false true false false false
7 │ 4 false true false false false
8 │ 2 true false false false false
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
491736 │ 6 false false true false false
491737 │ 4 false true false false false
491738 │ 2 true false false false false
491739 │ 4 false true false false false
491740 │ 2 true false false false false
491741 │ 2 true false false false false
491742 │ 4 false true false false false
491743 │ 4 false true false false false
491727 rows omitted
Note that in the condition I used the row[Not("nmoves")]
selector, as I wanted to check all columns except the "nmoves"
.
Now we are ready to check the correctness of tags:
julia> combine(groupby(df, "nmoves"), Not("nmoves") .=> sum)
10×6 DataFrame
Row │ nmoves mateIn1_sum mateIn2_sum mateIn3_sum mateIn4_sum mateIn5_sum
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 2 136843 0 0 0 0
2 │ 4 0 274135 0 0 0
3 │ 6 0 0 68623 0 0
4 │ 8 0 0 0 9924 0
5 │ 10 0 0 0 0 1691
6 │ 12 0 0 0 0 367
7 │ 14 0 0 0 0 127
8 │ 16 0 0 0 0 25
9 │ 18 0 0 0 0 7
10 │ 20 0 0 0 0 1
The table reads as follows:
"mateInX"
for X
in 1 to 4 range are correct.
The "mateIn5"
tag actually means a situation where there are five or more moves.So the verdict is that tagging is correct, but we need to know the interpretation of
"mateIn5"
column as it is actually five or more moves. We could rename the column to
e.g. "mateIn5+"
to reflect that or add a metadata to our df
table where we would store
this information (I leave this to you as an exercise).
I hope that CSV.jl and DataFrames.jl users found the examples that I gave today useful and interesting. Enjoy!
]]>Today I decided to write about code refactoring in Julia. This is a topic that is, in my experience, a quite big advantage of this language.
A common situation you are faced with when writing your code is as follows. You need some functionality in your program and it is available in some library. However, what the library provides does not meet your expectations. Since in Julia most packages are written in Julia under MIT license, it is easy to solve this issue. You just take the source code and modify it.
Today I want to show you a practical example of such a situation I have had this week when working with the Graphs.jl package.
The post was written using Julia 1.10.0, BenchmarkTools.jl 1.4.0, and Graphs.jl 1.9.0.
In my work I needed to generate random geometric graphs.
This is a simple random graph model that works as follows
(here I describe a general idea, for details please check the
Wikipedia entry on random geometric graphs).
To generate a graph on N
vertices you first drop N
random points
in some metric space. Next you connect two points with an edge
if their distance is less than some pre-specified distance.
The Graphs.jl library provides the euclidean_graph
function
that generates such graphs. Here is a summary of its docstring:
euclidean_graph(N, d; rng=nothing, seed=nothing, L=1., p=2., cutoff=-1., bc=:open)
Generate N uniformly distributed points in the box [0,L]^{d}
and return a Euclidean graph, a map containing the distance on each
edge and a matrix with the points' positions.
An edge between vertices x[i] and x[j] is inserted if norm(x[i]-x[j], p) < cutoff.
In case of negative cutoff instead every edge is inserted.
Set bc=:periodic to impose periodic boundary conditions in the box [0,L]^d.
So what is the problem with this function? Unfortunately it is slow.
Let us, for example check how long it takes to compute an average degree
of a node in such a graph with n
nodes and cutoff=sqrt(10/n)
, when setting
bc=:periodic
(periodic boundary, i.e. distance is measured on a torus) for two-dimensional space.
julia> using Graphs
julia> for n in 1_000:1_000:10_000
println(@time ne(euclidean_graph(n, 2; cutoff=sqrt(10/n), bc=:periodic, seed=1)[1])/n)
end
0.091657 seconds (2.50 M allocations: 193.170 MiB, 14.06% gc time)
15.604
0.300285 seconds (10.00 M allocations: 765.801 MiB, 12.59% gc time)
15.661
0.686230 seconds (22.50 M allocations: 1.686 GiB, 12.47% gc time)
15.744
1.175881 seconds (40.00 M allocations: 2.990 GiB, 10.97% gc time)
15.7065
1.800561 seconds (62.50 M allocations: 4.666 GiB, 10.76% gc time)
15.6568
2.697535 seconds (90.00 M allocations: 6.716 GiB, 14.76% gc time)
15.641333333333334
3.690599 seconds (122.50 M allocations: 9.138 GiB, 13.28% gc time)
15.743571428571428
4.745701 seconds (160.00 M allocations: 11.932 GiB, 13.53% gc time)
15.714
5.962431 seconds (202.51 M allocations: 15.099 GiB, 12.70% gc time)
15.722222222222221
7.195257 seconds (250.01 M allocations: 18.638 GiB, 11.42% gc time)
15.7086
We can see that the euclidean_graph
function scales badly with n
.
Note that by choosing cutoff=sqrt(10/n)
we have a roughly constant
average degree, so the number of edges generates scales linearly, but
the generation time seems to grow much faster.
To find out the source of the problem we can investigate the source
of euclidean_graph
, which consists of two methods:
function euclidean_graph(
N::Int,
d::Int;
L=1.0,
rng::Union{Nothing,AbstractRNG}=nothing,
seed::Union{Nothing,Integer}=nothing,
kws...,
)
rng = rng_from_rng_or_seed(rng, seed)
points = rmul!(rand(rng, d, N), L)
return (euclidean_graph(points; L=L, kws...)..., points)
end
function euclidean_graph(points::Matrix; L=1.0, p=2.0, cutoff=-1.0, bc=:open)
d, N = size(points)
weights = Dict{SimpleEdge{Int},Float64}()
cutoff < 0.0 && (cutoff = typemax(Float64))
if bc == :periodic
maximum(points) > L && throw(
DomainError(maximum(points), "Some points are outside the box of size $L")
)
end
for i in 1:N
for j in (i + 1):N
if bc == :open
Δ = points[:, i] - points[:, j]
elseif bc == :periodic
Δ = abs.(points[:, i] - points[:, j])
Δ = min.(L .- Δ, Δ)
else
throw(ArgumentError("$bc is not a valid boundary condition"))
end
dist = norm(Δ, p)
if dist < cutoff
e = SimpleEdge(i, j)
weights[e] = dist
end
end
end
g = Graphs.SimpleGraphs._SimpleGraphFromIterator(keys(weights), Int)
if nv(g) < N
add_vertices!(g, N - nv(g))
end
return g, weights
end
The beauty of Julia is that this source is written in Julia and is pretty short.
It immediately allows us to pinpoint the source of our problems. The core of we work
is done in a double-loop iterating over i
and j
indices. So the complexity of this
algorithm is quadratic in number of vertices.
The second beauty of Julia is that we can easily fix this. The idea can be found in the Wikipedia entry on random geometric graphs in the algorithms section here.
A simple way to improve the performance of the algorithm is to notice that if
you know L
and cutoff
you can partition the space into equal sized cells
having side length floor(Int, L / cutoff)
. Now you see that if you have a vertex
in some cell then it can be connected only to nodes in the same cell or cells directly
adjacent to it (the cells that are more far away contain the points that must be farther
than cutoff
from our point). This means that we will have a much lower number of points
to consider. Below I show a code that is a modification of the original source that
adds this feature. The key added function is to_buckets
which computes the bucket
identifier for each vertex and creates a dictionary that is a mapping from bucked
identifier to a vector of node numbers that fall into it:
using LinearAlgebra
using Random
function euclidean_graph2(
N::Int,
d::Int;
L=1.0,
rng::Union{Nothing,AbstractRNG}=nothing,
seed::Union{Nothing,Integer}=nothing,
kws...,
)
rng = Graphs.rng_from_rng_or_seed(rng, seed)
points = rmul!(rand(rng, d, N), L)
return (euclidean_graph2(points; L=L, kws...)..., points)
end
function to_buckets(points::Matrix, L, cutoff)
d, N = size(points)
dimlen = max(floor(Int, L / max(cutoff, eps())), 1)
buckets = Dict{Vector{Int}, Vector{Int}}()
for (i, point) in enumerate(eachcol(points))
bucket = floor.(Int, point .* dimlen ./ L)
push!(get!(() -> Int[], buckets, bucket), i)
end
return buckets, dimlen
end
function euclidean_graph2(points::Matrix; L=1.0, p=2.0, cutoff=-1.0, bc=:open)
d, N = size(points)
weights = Dict{Graphs.SimpleEdge{Int},Float64}()
cutoff < 0.0 && (cutoff = typemax(Float64))
if bc == :periodic
maximum(points) > L && throw(
DomainError(maximum(points), "Some points are outside the box of size $L")
)
end
buckets, dimlen = to_buckets(points, L, cutoff)
deltas = collect(Iterators.product((-1:1 for _ in 1:size(points, 1))...))
void = Int[]
for (k1, v1) in pairs(buckets)
for i in v1
for d in deltas
k2 = bc == :periodic ? mod.(k1 .+ d, dimlen) : k2 = k1 .+ d
v2 = get(buckets, k2, void)
for j in v2
i < j || continue
if bc == :open
Δ = points[:, i] - points[:, j]
elseif bc == :periodic
Δ = abs.(points[:, i] - points[:, j])
Δ = min.(L .- Δ, Δ)
else
throw(ArgumentError("$bc is not a valid boundary condition"))
end
dist = norm(Δ, p)
if dist < cutoff
e = Graphs.SimpleEdge(i, j)
weights[e] = dist
end
end
end
end
end
g = Graphs.SimpleGraphs._SimpleGraphFromIterator(keys(weights), Int)
if nv(g) < N
add_vertices!(g, N - nv(g))
end
return g, weights
end
Note that it took less than 30 lines of code to add the requested feature to the code.
Let us test our new code:
julia> for n in 1_000:1_000:10_000
println(@time ne(euclidean_graph2(n, 2; cutoff=sqrt(10/n), bc=:periodic, seed=1)[1])/n)
end
0.017221 seconds (274.10 k allocations: 21.751 MiB, 22.15% gc time)
15.604
0.022855 seconds (558.52 k allocations: 42.289 MiB, 10.43% gc time)
15.661
0.032693 seconds (852.46 k allocations: 69.684 MiB, 8.52% gc time)
15.744
0.043141 seconds (1.10 M allocations: 87.196 MiB, 14.73% gc time)
15.7065
0.071273 seconds (1.41 M allocations: 109.725 MiB, 7.67% gc time)
15.6568
0.068194 seconds (1.70 M allocations: 130.828 MiB, 12.54% gc time)
15.641333333333334
0.071277 seconds (1.98 M allocations: 150.712 MiB, 11.85% gc time)
15.743571428571428
0.081463 seconds (2.24 M allocations: 169.153 MiB, 10.67% gc time)
15.714
0.099957 seconds (2.48 M allocations: 186.492 MiB, 8.08% gc time)
15.722222222222221
0.148573 seconds (2.84 M allocations: 213.214 MiB, 18.37% gc time)
15.7086
We seem to get what we wanted. Our computation time looks to scale quite well. Also the obtained average degree numbers are identical to the original ones.
Let us compare the performance on an even larger graph:
julia> n = 100_000;
julia> @time ne(euclidean_graph(n, 2; cutoff=sqrt(10/n), bc=:periodic, seed=1)[1])/n
908.976252 seconds (25.00 G allocations: 1.819 TiB, 11.64% gc time, 0.00% compilation time)
15.70797
julia> julia> @time ne(euclidean_graph2(n, 2; cutoff=sqrt(10/n), bc=:periodic, seed=1)[1])/n
1.640495 seconds (27.53 M allocations: 2.121 GiB, 19.83% gc time)
15.70797
Indeed we see that the timing of the original implementation becomes prohibitive for larger graphs.
Before we finish there is one important task we need to make. We should check if
indeed the euclidean_graph2
function produces the same results as euclidean_graph
.
This is easy to test with the following randomized test:
julia> using Test
julia> Random.seed!(1234);
julia> @time for i in 1:1000
N = rand(10:500)
d = rand(1:5)
L = rand()
p = 3*rand()
cutoff = rand() * L / 4
bc = rand([:open, :periodic])
seed = rand(UInt32)
@test euclidean_graph(N, d; L, p, cutoff, bc, seed) ==
euclidean_graph2(N, d; L, p, cutoff, bc, seed)
end
16.955773 seconds (275.09 M allocations: 20.342 GiB, 12.27% gc time)
We have tested 1000 random setups of the experiments. In each of them both functions returned the same results.
In my post I have shown you an example that one can easily tweak some package code to your needs. In this case this was performance, but it can be equally well functionality.
I did not comment too much on the code itself, as it was a bit longer than usual, but let me
discuss as a closing remark one performance aspect of the code. In my to_buckets
function I used
the get!
function to populate the dictionary with a mutable default value (Int[]
in that case).
You might wonder why I preferred to use an anonymous function instead of passing a default
as a third argument. The reason is number of allocations. Check this code:
julia> using BenchmarkTools
julia> function f1()
d = Dict(1 => Int[])
for i in 1:10^6
get!(d, 1, Int[])
end
return d
end
f1 (generic function with 1 method)
julia> function f2()
d = Dict(1 => Int[])
for i in 1:10^6
get!(() -> Int[], d, 1)
end
return d
end
f2 (generic function with 1 method)
julia> @benchmark f1()
BenchmarkTools.Trial: 195 samples with 1 evaluation.
Range (min … max): 19.961 ms … 45.328 ms ┊ GC (min … max): 10.26% … 13.19%
Time (median): 23.962 ms ┊ GC (median): 10.45%
Time (mean ± σ): 25.660 ms ± 5.050 ms ┊ GC (mean ± σ): 12.27% ± 3.76%
▃▂▂█▃▃ ▃▂▃
▆▄██████▇███▇▆▄▄▇▄▄▁▃▆▄▃▅▄▅▄▄▄▃▄▃▁▃▃▁▁▁▃▃▃▃▃▃▁▃▁▁▃▁▁▁▁▁▃▁▁▃ ▃
20 ms Histogram: frequency by time 44.3 ms <
Memory estimate: 61.04 MiB, allocs estimate: 1000005.
julia> @benchmark f2()
BenchmarkTools.Trial: 902 samples with 1 evaluation.
Range (min … max): 4.564 ms … 11.178 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.149 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.526 ms ± 1.396 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅▄▄▃▃▄▅▆▄▁▁ ▃
██████████████▆▆▆▆▆▅▇▆▄▆▆▅▅▁▄▅▁▅▁▇▅▆▄▄▇▁▅▅▅▄▁▄▄▆▅▁▅▁▁▅▆██▅ █
4.56 ms Histogram: log(frequency) by time 10 ms <
Memory estimate: 592 bytes, allocs estimate: 5.
As you can see f2
is much faster than f1
and does much less allocations. The issue
is that f1
allocates a fresh Int[]
object in every iteration of the loop, while
f2
would allocate it only if get!
does not hit a key that already exists in d
(and in our experiment I always queried for 1
which was present in d
).