Introduction

In 24 hours I will have my first Twitch live streaming session. It will begin on Friday, May 13, 7 PM EDT on ManningPublications channel.

In this post I want to share the source material I am going to present so that everyone interested can easily follow it.

The codes are a shortened version of contents of chapters 8 and 9 of my upcoming Julia for Data Analysis book.

Environment setup

I will run the codes under Julia 1.7.2. You will need to install the following packages (I show you the versions of the packages I use):

CSV.jl 0.10.4
CodecBzip2.jl 0.7.2
DataFrames.jl 1.3.4
Loess.jl 0.5.4
Plots.jl 1.28.1

The problem

In the session I will analyze Lichess puzzles database. It contains information about over 2,000,000 puzzles, covering such data as number of times a given puzzle was played, how hard the puzzle is, how much Lichess users like the puzzle, or what chess themes the puzzle features. My goal is to check the relationship between the puzzle hardness and how much users like it.

Source code

Here is the source code that I am going to present and explain during the session.

I will start with fetching the data from the internet, unpacking it, and reading it into a data frame:

import Downloads
Downloads.download("https://github.com/bkamins/JuliaForDataAnalysis/" *
                   "raw/main/puzzles.csv.bz2",
                   "puzzles.csv.bz2")

using CodecBzip2
compressed = read("puzzles.csv.bz2")
plain = transcode(Bzip2Decompressor, compressed)

using CSV
using DataFrames
puzzles = CSV.read(plain, DataFrame;
                   header=["PuzzleId", "FEN", "Moves", "Rating","RatingDeviation",
                           "Popularity", "NbPlays", "Themes","GameUrl"])

describe(puzzles)

Next, I will perform exploratory data analysis of the data base and subset it to only keep the puzzles that I will later want to analyze:

using Plots
plot([histogram(puzzles[!, col]; label=col) for
      col in ["Rating", "RatingDeviation", "Popularity", "NbPlays"]]...)

using Statistics
plays_lo = median(puzzles.NbPlays)
rating_lo = 1500
rating_hi = quantile(puzzles.Rating, 0.99)
row_selector = (puzzles.NbPlays .> plays_lo) .&&
               (rating_lo .< puzzles.Rating .< rating_hi)

sum(row_selector)
count(row_selector)

good = puzzles[row_selector, ["Rating", "Popularity"]]

plot(histogram(good.Rating; label="Rating"),
     histogram(good.Popularity; label="Popularity"))

describe(good)

Finally I will perform some aggregation data of the data stored in the Lichess database and analyze the relationship between puzzle difficulty and popularity:

grouped_good = groupby(good, :Rating, sort=true)
agg_good = combine(grouped_good, :Popularity => mean)
scatter(agg_good.Rating, agg_good.Popularity_mean;
        xlabel="rating", ylabel="mean popularity", legend=false)

using Loess
model = loess(agg_good.Rating, agg_good.Popularity_mean)
agg_good.pred = predict(model, float.(agg_good.Rating))
plot!(agg_good.Rating, agg_good.pred; width=5)

Conclusions

I invite everyone to join me during the Twitch live streaming session. If you would have any questions please do not hesitate to ask them in chat and I will try to answer them live. I hope you will enjoy it!