In 24 hours I will have my first Twitch live streaming session. It will begin on Friday, May 13, 7 PM EDT on ManningPublications channel.
In this post I want to share the source material I am going to present so that everyone interested can easily follow it.
The codes are a shortened version of contents of chapters 8 and 9 of my upcoming Julia for Data Analysis book.
I will run the codes under Julia 1.7.2. You will need to install the following packages (I show you the versions of the packages I use):
- CSV.jl 0.10.4
- CodecBzip2.jl 0.7.2
- DataFrames.jl 1.3.4
- Loess.jl 0.5.4
- Plots.jl 1.28.1
In the session I will analyze Lichess puzzles database. It contains information about over 2,000,000 puzzles, covering such data as number of times a given puzzle was played, how hard the puzzle is, how much Lichess users like the puzzle, or what chess themes the puzzle features. My goal is to check the relationship between the puzzle hardness and how much users like it.
Here is the source code that I am going to present and explain during the session.
I will start with fetching the data from the internet, unpacking it, and reading it into a data frame:
import Downloads Downloads.download("https://github.com/bkamins/JuliaForDataAnalysis/" * "raw/main/puzzles.csv.bz2", "puzzles.csv.bz2") using CodecBzip2 compressed = read("puzzles.csv.bz2") plain = transcode(Bzip2Decompressor, compressed) using CSV using DataFrames puzzles = CSV.read(plain, DataFrame; header=["PuzzleId", "FEN", "Moves", "Rating","RatingDeviation", "Popularity", "NbPlays", "Themes","GameUrl"]) describe(puzzles)
Next, I will perform exploratory data analysis of the data base and subset it to only keep the puzzles that I will later want to analyze:
using Plots plot([histogram(puzzles[!, col]; label=col) for col in ["Rating", "RatingDeviation", "Popularity", "NbPlays"]]...) using Statistics plays_lo = median(puzzles.NbPlays) rating_lo = 1500 rating_hi = quantile(puzzles.Rating, 0.99) row_selector = (puzzles.NbPlays .> plays_lo) .&& (rating_lo .< puzzles.Rating .< rating_hi) sum(row_selector) count(row_selector) good = puzzles[row_selector, ["Rating", "Popularity"]] plot(histogram(good.Rating; label="Rating"), histogram(good.Popularity; label="Popularity")) describe(good)
Finally I will perform some aggregation data of the data stored in the Lichess database and analyze the relationship between puzzle difficulty and popularity:
grouped_good = groupby(good, :Rating, sort=true) agg_good = combine(grouped_good, :Popularity => mean) scatter(agg_good.Rating, agg_good.Popularity_mean; xlabel="rating", ylabel="mean popularity", legend=false) using Loess model = loess(agg_good.Rating, agg_good.Popularity_mean) agg_good.pred = predict(model, float.(agg_good.Rating)) plot!(agg_good.Rating, agg_good.pred; width=5)
I invite everyone to join me during the Twitch live streaming session. If you would have any questions please do not hesitate to ask them in chat and I will try to answer them live. I hope you will enjoy it!