A little exercise in CSV.jl and DataFrames.jl
Introduction
This week I have discussed with my colleague the Lichess puzzle dataset that I use in my Julia for Data Analysis book.
The dataset contains a list of puzzles along with information about them, such as puzzle difficulty, puzzle solution, and tags describing puzzle type.
We were discussing if tags assigned to puzzles in this dataset are accurate. In this post I give you an example how one can check it (and practice a bit CSV.jl and DataFrames.jl).
The post was written under Julia 1.10.0, CSV.jl 0.10.12, and DataFrames.jl 1.6.1.
Getting the data
In this post I show you a relatively brief code. Therefore I assume that first you download the file with the puzzle dataset and unpack it manually. (In the book I show how to do it using Julia. You can find the source code on GitHub repository of the book.)
Assuming you downloaded and unpacked the dataset into the puzzles.csv
file
we read it in. We are interested only in columns 3 and 8 of this file,
so I use the following commands:
julia> using CSV
julia> using DataFrames
julia> df = CSV.read("puzzles.csv", DataFrame; select=[3, 8], header=false)
2132989×2 DataFrame
Row │ Column3 Column8
│ String String
─────────┼──────────────────────────────────────────────────────────────────────
1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1 crushing hangingPiece long middl…
2 │ d3d6 f8d8 d6d8 f6d8 advantage endgame short
3 │ b6c5 e2g4 h3g4 d1g4 advantage middlegame short
4 │ g5e7 a5c3 b2c3 c6e7 advantage master middlegame short
5 │ e8f7 e2e6 f7f8 e6f7 mate mateIn2 middlegame short
6 │ a6a5 e5c7 a5b4 c7d8 crushing endgame fork short
7 │ d4b6 f6e4 h1g1 e4f2 crushing endgame short trappedPi…
8 │ d8f6 d1h5 h7h6 h5c5 advantage middlegame short
⋮ │ ⋮ ⋮
2132982 │ d2c2 c5d3 c2d3 c4d3 crushing fork middlegame short
2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7 crushing long middlegame quietMo…
2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6 crushing defensiveMove endgame l…
2132985 │ g1h1 e3e1 f7f1 e1f1 endgame mate mateIn2 short
2132986 │ g5c1 d5d6 d7f6 h7h8 advantage middlegame short
2132987 │ d2f3 d8a5 c1d2 a5b5 advantage fork opening short
2132988 │ f7f2 b2c2 c1b1 e2d1 endgame mate mateIn2 queensideAt…
2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3 advantage long opening
2132973 rows omitted
julia> rename!(df, ["moves", "tags"])
2132989×2 DataFrame
Row │ moves tags
│ String String
─────────┼──────────────────────────────────────────────────────────────────────
1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1 crushing hangingPiece long middl…
2 │ d3d6 f8d8 d6d8 f6d8 advantage endgame short
3 │ b6c5 e2g4 h3g4 d1g4 advantage middlegame short
4 │ g5e7 a5c3 b2c3 c6e7 advantage master middlegame short
5 │ e8f7 e2e6 f7f8 e6f7 mate mateIn2 middlegame short
6 │ a6a5 e5c7 a5b4 c7d8 crushing endgame fork short
7 │ d4b6 f6e4 h1g1 e4f2 crushing endgame short trappedPi…
8 │ d8f6 d1h5 h7h6 h5c5 advantage middlegame short
⋮ │ ⋮ ⋮
2132982 │ d2c2 c5d3 c2d3 c4d3 crushing fork middlegame short
2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7 crushing long middlegame quietMo…
2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6 crushing defensiveMove endgame l…
2132985 │ g1h1 e3e1 f7f1 e1f1 endgame mate mateIn2 short
2132986 │ g5c1 d5d6 d7f6 h7h8 advantage middlegame short
2132987 │ d2f3 d8a5 c1d2 a5b5 advantage fork opening short
2132988 │ f7f2 b2c2 c1b1 e2d1 endgame mate mateIn2 queensideAt…
2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3 advantage long opening
2132973 rows omitted
Note that the file does not have a header so when reading it we passed header=false
and then manually named the columns using rename!
.
The task
I wanted only these two columns since today I want to check if the tags related
to mating are accurate. You can notice in the above printout that in the "tags"
column we have a tag "mateIn2"
. It indicates that the puzzle is mate in two moves.
This is the case for example for rows 5, 2132985, and 2132988.
In the matching "moves"
column we see that we have 4 corresponding moves.
The reason is that we have two players making the move (and 2 + 2 = 4).
What we want to check if these "mateInX"
tags are correct. I will check the
values of X
from 1 to 5 (as only these five options are present in tags,
I leave it to you as an exercise to verify).
When should we call the tags correct. There are two conditions:
- there is no duplicate tagging (e.g. a puzzle cannot be
"mateIn1"
and"mateIn2"
at the same time); - the number of moves in a puzzle matches the tag.
Let us check it.
The solution
As a first step we (in place, i.e. modifying our df
data frame) transform the original columns
into more convenient form. Instead of the raw "moves"
I want the "nmoves"
column that gives me
a number of moves in the puzzle. Similarly instead of "tags"
I want indicator columns "mateInX"
for X
ranging from 1 to 5 showing me the puzzle type. Here is how you can achieve this:
julia> select!(df,
"moves" => ByRow(length∘split) => "nmoves",
["tags" => ByRow(contains("mateIn$i")) => "mateIn$i" for i in 1:5])
2132989×6 DataFrame
Row │ nmoves mateIn1 mateIn2 mateIn3 mateIn4 mateIn5
│ Int64 Bool Bool Bool Bool Bool
─────────┼─────────────────────────────────────────────────────
1 │ 6 false false false false false
2 │ 4 false false false false false
3 │ 4 false false false false false
4 │ 4 false false false false false
5 │ 4 false true false false false
6 │ 4 false false false false false
7 │ 4 false false false false false
8 │ 4 false false false false false
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
2132982 │ 4 false false false false false
2132983 │ 6 false false false false false
2132984 │ 6 false false false false false
2132985 │ 4 false true false false false
2132986 │ 4 false false false false false
2132987 │ 4 false false false false false
2132988 │ 4 false true false false false
2132989 │ 6 false false false false false
2132973 rows omitted
Now we see that some of the rows are not tagged as "mateInX"
. Let us filter them out,
to have only tagged rows left (again, we do the operation in-place):
julia> filter!(row -> any(row[Not("nmoves")]), df)
491743×6 DataFrame
Row │ nmoves mateIn1 mateIn2 mateIn3 mateIn4 mateIn5
│ Int64 Bool Bool Bool Bool Bool
────────┼─────────────────────────────────────────────────────
1 │ 4 false true false false false
2 │ 4 false true false false false
3 │ 2 true false false false false
4 │ 4 false true false false false
5 │ 2 true false false false false
6 │ 4 false true false false false
7 │ 4 false true false false false
8 │ 2 true false false false false
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
491736 │ 6 false false true false false
491737 │ 4 false true false false false
491738 │ 2 true false false false false
491739 │ 4 false true false false false
491740 │ 2 true false false false false
491741 │ 2 true false false false false
491742 │ 4 false true false false false
491743 │ 4 false true false false false
491727 rows omitted
Note that in the condition I used the row[Not("nmoves")]
selector, as I wanted to check all columns except the "nmoves"
.
Now we are ready to check the correctness of tags:
julia> combine(groupby(df, "nmoves"), Not("nmoves") .=> sum)
10×6 DataFrame
Row │ nmoves mateIn1_sum mateIn2_sum mateIn3_sum mateIn4_sum mateIn5_sum
│ Int64 Int64 Int64 Int64 Int64 Int64
─────┼─────────────────────────────────────────────────────────────────────────
1 │ 2 136843 0 0 0 0
2 │ 4 0 274135 0 0 0
3 │ 6 0 0 68623 0 0
4 │ 8 0 0 0 9924 0
5 │ 10 0 0 0 0 1691
6 │ 12 0 0 0 0 367
7 │ 14 0 0 0 0 127
8 │ 16 0 0 0 0 25
9 │ 18 0 0 0 0 7
10 │ 20 0 0 0 0 1
The table reads as follows:
- There are no duplicates in tags.
- Tags
"mateInX"
forX
in 1 to 4 range are correct. The"mateIn5"
tag actually means a situation where there are five or more moves.
So the verdict is that tagging is correct, but we need to know the interpretation of
"mateIn5"
column as it is actually five or more moves. We could rename the column to
e.g. "mateIn5+"
to reflect that or add a metadata to our df
table where we would store
this information (I leave this to you as an exercise).
Conclusions
I hope that CSV.jl and DataFrames.jl users found the examples that I gave today useful and interesting. Enjoy!