The mighty trio: Project Euler, Simpson's paradox, and DataFrames.jl
Introduction
Recently, following users’ feedback, I started solving more puzzles in my blog. To choose a problem for today I went to Project Euler page, which provides hundreds of very nice puzzles. I wanted to avoid solving some very easy problem, so I decided to pick the first puzzle from the list that has less than 1000 people who solved it. This lead me to problem 236.
To my surprise the problem is an example of the Simpson’s paradox, so hopefully if you are a data scientist, you will find it interesting.
In general Project Euler does not encourage posting solutions to the problems given in that page. For this reason, on purpose, I have not optimized my code and the solution is brute force (to encourage you to find an elegant solution; the presented solution takes several minutes to produce the result). Also, I decided to show the code, but not the value of the final solution (to encourage you to try solving it).
So, why I think it is interesting to read this post? I solved the puzzle mostly using DataFrames.jl, so I hope you can find some useful data wrangling patterns when reading it.
The presented codes were run under Julia 1.8.2, DataFrames.jl 1.4.3, and DataFramesMeta.jl 0.12.0. The solution requires 32GB of RAM (this is, again, on purpose; if you have less RAM I encourage you to optimize my code as a part of the Project Euler challenge).
Simpson’s paradox
Simpson’s paradox is a phenomenon in which data analyzed in groups presents a different direction of relationships than if you analyze the data after aggregating it.
To show you the paradox in practice let me introduce the Project Euler problem 236. The description is copied from Project Euler page:
Suppliers ‘A’ and ‘B’ provided the following numbers of products for the luxury hamper market:
Product 'A' 'B'
Beluga Caviar 5248 640
Christmas Cake 1312 1888
Gammon Joint 2624 3776
Vintage Port 5760 3776
Champagne Truffles 3936 5664
Although the suppliers try very hard to ship their goods in perfect condition, there is inevitably some spoilage - i.e. products gone bad.
The suppliers compare their performance using two types of statistic:
- The five per-product spoilage rates for each supplier are equal to the number of products gone bad divided by the number of products supplied, for each of the five products in turn.
- The overall spoilage rate for each supplier is equal to the total number of products gone bad divided by the total number of products provided by that supplier.
To their surprise, the suppliers found that each of the five per-product spoilage rates was worse (higher) for ‘B’ than for ‘A’ by the same factor (ratio of spoilage rates), m>1; and yet, paradoxically, the overall spoilage rate was worse for ‘A’ than for ‘B’, also by a factor of m.
What’s the largest possible value of m?
This situation is an example of Simpson’s paradox. Within-groups supplier ‘B’ has more spoilage than supplier ‘A’, however in aggregate the situation is reversed. Before we go and try solving the puzzle let us check if such a situation is possible.
Assume that the amount of spoilage per product and supplier is as follows:
Product 'A' 'B'
Beluga Caviar 2478 630
Christmas Cake 16 48
Gammon Joint 33 99
Vintage Port 180 246
Champagne Truffles 23 69
We can check that the spoilage percentages are approximately:
Product 'A' 'B'
Beluga Caviar 47.22% 98.44%
Christmas Cake 1.22% 2.54%
Gammon Joint 1.26% 2.62%
Vintage Port 3.13% 6.51%
Champagne Truffles 0.58% 1.22%
As you can see supplier ‘B’ has over 2 times higher spoilage percentage on every product (you can check that the spoilage percentage ratio is the same for all products and is approximately equal to 2.085).
Now let us check aggregates. Supplier ‘A’ delivered 18880 products of which 2730 were spoiled. Supplier ‘B’ delivered 15744 products of which 1092 were spoiled. So the aggregate spoilage percentage for supplier ‘A’ is approximately 14.46%, and for supplier ‘B’ 6.94%. Interestingly, supplier ‘A’ has over two times higher spoilage percentage than supplier ‘B’ (astonishingly the ratio of spoilage percentage is the same as above, but in reverse direction).
Solving the puzzle
Let us now solve the puzzle and find the largest possible ratio m of spoilage percentages between suppliers ‘A’ and ‘B’ that meets the conditions of the puzzle.
Start by loading the required DataFramesMeta.jl package and the data:
julia> using DataFramesMeta
julia> const pv = [(a=5248, b=640),
(a=1312, b=1888),
(a=2624, b=3776),
(a=5760, b=3776),
(a=3936, b=5664)]
5-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
(a = 5248, b = 640)
(a = 1312, b = 1888)
(a = 2624, b = 3776)
(a = 5760, b = 3776)
(a = 3936, b = 5664)
julia> const pt = (a=sum(x -> x.a, pv), b=sum(x -> x.b, pv))
(a = 18880, b = 15744)
The pv
vector holds the amounts of products delivered by suppliers ‘A’ and
‘B’ and the pt
named tuple stores their totals.
We know that for each product-supplier combination spoilage is at least 1.
Using the allcombinations
function we create data frames containing all
possible compositions of spoilage amounts for each product and for total, and
store them in dfs
vector:
julia> dfs = [allcombinations(DataFrame, a=1:p.a, b=1:p.b) for p in pv];
julia> push!(dfs, allcombinations(DataFrame, a=1:pt.a, b=1:pt.b));
julia> nrow.(dfs)
6-element Vector{Int64}:
3358720
2477056
9908224
21749760
22293504
297246720
We can see that the number of options is large, but seems within range what current computers can handle.
Before we proceed let us peek at the contents of one of the data frames to be sure that its structure is as expected:
julia> dfs[1]
3358720×2 DataFrame
Row │ a b
│ Int64 Int64
─────────┼──────────────
1 │ 1 1
2 │ 2 1
3 │ 3 1
⋮ │ ⋮ ⋮
3358718 │ 5246 640
3358719 │ 5247 640
3358720 │ 5248 640
3358714 rows omitted
All looks good.
Now for each combination we need to compute the m ratio, remembering that for
totals it should be reversed. Additionally, we want to collect columns :a
and
:b
into one column holding a tuple of values stored in them, as it will be
easier to work with such data later and only keep options for which m is
greater than 1. We can easily do both tasks using macros provided by
DataFramesMeta.jl:
julia> ms = Function[(x, y) -> (y * p.a) // (x * p.b) for p in pv];
julia> push!(ms, (x, y) -> (x * pt.b) // (y * pt.a));
julia> foreach(dfs, ms) do df, m
@rselect!(df, :m = m(:a, :b), :ab = (:a, :b))
@rsubset!(df, :m > 1)
end
Note that for each data frame I defined its own function computing the m
ratio and stored it in ms
vector.
Again, let us check how the data frames look after the transformation:
julia> nrow.(dfs)
6-element Vector{Int64}:
1681600
1238224
4953504
10875840
11145840
148621760
julia> dfs[1]
1681600×2 DataFrame
Row │ m ab
│ Rational… Tuple…
─────────┼─────────────────────────
1 │ 41//5 (1, 1)
2 │ 41//10 (2, 1)
3 │ 41//15 (3, 1)
⋮ │ ⋮ ⋮
1681598 │ 5248//5245 (5245, 640)
1681599 │ 2624//2623 (5246, 640)
1681600 │ 5248//5247 (5247, 640)
1681594 rows omitted
We notice that we dropped around 50% of rows (this is expected). Observe that
using the //
operator in the functions calculating m we get rational number
as a result, so our computations are exact, without any rounding.
Let us now group the data frames by the m ratio value (remember we want a solution where this ratio is the same for all products and for total):
julia> gdfs = groupby.(dfs, :m);
julia> length.(gdfs)
6-element Vector{Int64}:
1021210
753076
3012264
6612288
6777256
90353608
We see that we have quite a lot groups. However we are interested only in cases where m matches in all the tables, so let us now find such candidate m values:
julia> key_tuples = [@combine(gdf, :mt = (first(:m),)).mt for gdf in gdfs];
julia> match_keys = intersect(key_tuples...);
julia> sort!(match_keys, rev=true)
7040-element Vector{Tuple{Rational{Int64}}}:
(1230//1,)
(738//1,)
(615//1,)
⋮
(1230//1229,)
(1476//1475,)
(3895//3894,)
Fortunately we see that only 7040 possible values of m are available. I have sorted the matching keys in a descending order to have highest m values first.
You might wonder why I used @combine
to convert the ratios into tuples
holding one-element with this ratio. The reason is that GroupedDataFrame
supports a fast group lookup by the value of the grouping variable and one of
the options of such lookup is to pass a tuple storing it. Therefore we can
create groups
vector that stores 6-element vectors of data frames
representing the five products and the total that have matching m:
julia> groups = [[gdfs[i][key] for i in 1:6] for key in match_keys];
Note that this operation in DataFrames.jl is memory efficient. Indexing into
a GroupedDataFrame
to get a single group returns a SubDataFrame
, which
is a view of the source data frame.
As before, let us peek at the result:
julia> groups[1]
6-element Vector{SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}:
4×2 SubDataFrame
Row │ m ab
│ Rational… Tuple…
─────┼─────────────────────
1 │ 1230//1 (1, 150)
2 │ 1230//1 (2, 300)
3 │ 1230//1 (3, 450)
4 │ 1230//1 (4, 600)
1×2 SubDataFrame
Row │ m ab
│ Rational… Tuple…
─────┼──────────────────────
1 │ 1230//1 (1, 1770)
2×2 SubDataFrame
Row │ m ab
│ Rational… Tuple…
─────┼──────────────────────
1 │ 1230//1 (1, 1770)
2 │ 1230//1 (2, 3540)
1×2 SubDataFrame
Row │ m ab
│ Rational… Tuple…
─────┼──────────────────────
1 │ 1230//1 (3, 2419)
3×2 SubDataFrame
Row │ m ab
│ Rational… Tuple…
─────┼──────────────────────
1 │ 1230//1 (1, 1770)
2 │ 1230//1 (2, 3540)
3 │ 1230//1 (3, 5310)
12×2 SubDataFrame
Row │ m ab
│ Rational… Tuple…
─────┼────────────────────────
1 │ 1230//1 (1475, 1)
2 │ 1230//1 (2950, 2)
3 │ 1230//1 (4425, 3)
4 │ 1230//1 (5900, 4)
5 │ 1230//1 (7375, 5)
6 │ 1230//1 (8850, 6)
7 │ 1230//1 (10325, 7)
8 │ 1230//1 (11800, 8)
9 │ 1230//1 (13275, 9)
10 │ 1230//1 (14750, 10)
11 │ 1230//1 (16225, 11)
12 │ 1230//1 (17700, 12)
So the highest potentially possible m ratio is 1230. But is it valid? Note that we have not checked one condition yet. The total spoilage (one of the options in the last data frame displayed above) must be a sum of five per-product spoilages.
So we need to drop these m ratios for which this is not possible. To achieve
this we define the test_options
helper function:
julia> function test_options(inv, outs)
v = copy(inv[1])
for i in 2:4
s = Set{Tuple{Int, Int}}()
for a in v, b in inv[i]
push!(s, a .+ b)
end
v = collect(s)
end
for a in v, b in inv[5]
a .+ b in outs && return true
end
return false
end
test_options (generic function with 1 method)
This function takes two arguments inv
which is a vector of :ab
spoilages
of the five products and outs
which is a Set
of total spoilages.
The function works as follows. We check what possible total sums of five
product spoilages we can get (we do it iteratively, starting from product one,
then adding options for consecutive products). Then we check if the totals of
spoilages for both suppliers can be found in the outs
collection (and that is
the reason this is a Set
, as sets support fast lookup). If this is the case
we return true
. If it is not possible we return false
.
Note that the trick in this function, to make it feasible computationally
(again good enough - the solution is not optimal, this is a challenge for you),
is the incremental computation of possible (a, b)
product total sums and
reduction of possible options after each iteration in a Set
. From the Julia
programming perspective two lines of code are relevant. The first is
v = copy(inv[1])
and the second is v = collect(s)
. They make the function
type-stable, as they ensure that the v
variable over which we iterate a lot
within a function is always bound to the object that has
Vector{Tuple{Int, Int}}
type that can be inferred by the compiler.
Let use use the test_options
function to filter out the m values that
meet the condition of our puzzle:
julia> answer = [g[1].m[1] for g in groups
if test_options([g[i].ab for i in 1:5], Set(g[6].ab))];
julia> first(answer);
Since we have sorted m in descending order above first(answer)
is a solution
to our puzzle. I have put ;
at the end of all expressions to hide it.
Though, let me show you the data that was disclosed by the puzzle authors on the Project Euler problem 236 page:
julia> length(answer)
35
julia> last(answer)
1476//1475
Indeed we see that there are 35 unique values of m that are feasible and that the smallest of them is 1476/1575, so it is quite likely that the code works correctly.
Conclusions
I hope you enjoyed the puzzle, the Simpson’s paradox example, and the solution using DataFrames.jl. If you would like to check your programming skills I encourage you to improve over my solution (an efficient code solving the puzzle runs in under 1 second).