Cross validation: a second take
Introduction
After my last post I got a comment that I should concentrate more on statistical conclusions than on how to implement things. Therefore today I continue it to show how cross validation works for model selection.
In this the post I use Julia 1.6.3, DataFrames.jl 1.2.2, FreqTables.jl 0.4.5, and GLM.jl 1.5.1.
Code for experiment setup
We will use the same settings as in my last post.
As before we will collect three mean squared errors:
mse_whole
: calculated on training data set;mse_cv
: calculated using 10-fold cross validation;mse_t
: expected prediction squared error.
The difference is that this time we will make the following changes:
- we will have nine features
x1
,x2
, …,x9
, but only first five of them will affect the target with weight1
, the remaining four will be unimportant; - I will test nine nested models incrementally including features
x1
,x2
, up tox9
to the model.
We will be interested to see how well cross validation works for model selection.
Here is the code that sets up the experiments (except for the changes described above it is identical to what we have discussed in my last post):
using DataFrames
using FreqTables
using GLM
using Random
using Statistics
mse(model, df) = sum(x->x^2, predict(model, df) - df.y) / nrow(df);
function mset(model)
c = zeros(10)
c[1:length(coef(model))] .= coef(model)
ref = [ones(6); zeros(4)]
return 1 + sum((c .- ref) .^ 2)
end
get_fold_data(df, fold) =
(train=view(df, df.fold .!= fold, :),
test=view(df, df.fold .== fold, :))
function runtest(id)
df = DataFrame(randn(50, 9), :auto)
df.y = sum(eachcol(df)[1:5]) .+ 1.0 .+ randn(50)
formulas = [Term(:y) ~ sum([Term(Symbol(:x, i)) for i in 1:n]) for n in 1:9]
models = [lm(f, df) for f in formulas]
mse_wholes = [mse(m, df) for m in models]
mse_ts = [mset(m) for m in models]
df.fold = shuffle!((1:nrow(df)) .% 10)
mse_cvs = map(formulas) do f
return mean(0:9) do fold
train, test = get_fold_data(df, fold)
model_cv = lm(f, train)
return mse(model_cv, test)
end
end
return DataFrame(id=id, vars=1:9, mse_whole=mse_wholes, mse_cv=mse_cvs, mse_t=mse_ts)
end
The experiment
Knowing the data generation process we know that the true model is using five explanatory variables.
Let us now run 1000 independent repetitions of our experiment and analyze the results:
julia> Random.seed!(12);
julia> res = reduce(vcat, [runtest(i) for i in 1:1000])
9000×5 DataFrame
Row │ id vars mse_whole mse_cv mse_t
│ Int64 Int64 Float64 Float64 Float64
──────┼───────────────────────────────────────────
1 │ 1 1 5.44188 5.84075 5.15904
2 │ 1 2 2.8385 3.16917 4.4904
3 │ 1 3 2.29577 2.71023 3.48133
4 │ 1 4 1.71476 2.05224 2.31772
5 │ 1 5 0.859515 1.04158 1.14846
6 │ 1 6 0.853335 1.11315 1.17749
7 │ 1 7 0.845039 1.12457 1.18426
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
8995 │ 1000 4 1.41656 1.64706 2.42478
8996 │ 1000 5 0.886208 1.11717 1.2702
8997 │ 1000 6 0.885057 1.18246 1.26373
8998 │ 1000 7 0.884199 1.26832 1.25502
8999 │ 1000 8 0.849425 1.24977 1.33918
9000 │ 1000 9 0.829926 1.24833 1.35149
8987 rows omitted
First we check what is the mean MSE across the experiments (this is not a super useful statistics, but let us start with it):
julia> combine(groupby(res, :vars), names(res, r"mse") .=> mean)
9×4 DataFrame
Row │ vars mse_whole_mean mse_cv_mean mse_t_mean
│ Int64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────
1 │ 1 4.81822 5.25666 5.21646
2 │ 2 3.7604 4.28551 4.2631
3 │ 3 2.74739 3.27702 3.26673
4 │ 4 1.79483 2.24701 2.23122
5 │ 5 0.872812 1.1506 1.13955
6 │ 6 0.853597 1.18245 1.16617
7 │ 7 0.834749 1.21654 1.19383
8 │ 8 0.815655 1.25266 1.2222
9 │ 9 0.796079 1.2905 1.25287
We can see that on the average mse_whole
decreases with number of variables,
while mse_cv
and mse_t
has a minimum for five variables. Also mse_cv
is
slightly biased upwards in comparison to mse_t
, while mse_whole
is
significantly biased downwards. All this was expected.
However, we are interested in model selection. So let us check which models would get selected by the three criteria:
julia> best = combine(groupby(res, :id), names(res, r"mse") .=> argmin)
1000×4 DataFrame
Row │ id mse_whole_argmin mse_cv_argmin mse_t_argmin
│ Int64 Int64 Int64 Int64
──────┼──────────────────────────────────────────────────────
1 │ 1 9 5 5
2 │ 2 9 5 5
3 │ 3 9 6 5
4 │ 4 9 5 5
5 │ 5 9 5 5
6 │ 6 9 5 5
7 │ 7 9 5 6
⋮ │ ⋮ ⋮ ⋮ ⋮
995 │ 995 9 5 5
996 │ 996 9 5 8
997 │ 997 9 5 5
998 │ 998 9 5 6
999 │ 999 9 5 5
1000 │ 1000 9 5 7
987 rows omitted
julia> freqtable(best, :mse_whole_argmin)
1-element Named Vector{Int64}
mse_whole_argmin │
──────────────────┼─────
9 │ 1000
julia> freqtable(best, :mse_cv_argmin, :mse_t_argmin)
5×5 Named Matrix{Int64}
mse_cv_argmin ╲ mse_t_argmin │ 5 6 7 8 9
─────────────────────────────┼────────────────────────
5 │ 480 134 49 24 8
6 │ 122 2 0 1 0
7 │ 64 15 0 0 0
8 │ 42 3 2 0 0
9 │ 41 7 4 2 0
For mse_whole
always a full model is selected - not something that we
would want, but this is expected.
For mse_cv
it selects the model that would be selected by mse_t
in a bit
less than 50% of cases. We can also see that in general it has a slight
tendency to pick models with too many variables. However, it should be noted
that also under mse_t
it would be sometimes better to pick a more complex
model given the concrete sampled training data.
Finally let us compare a relative MSE loss of using a whole model and using a model picked by cross validation:
julia> loss = combine(groupby(res, :id)) do sdf
(loss_whole = sdf.mse_t[argmin(sdf.mse_whole)] / minimum(sdf.mse_t) - 1,
loss_cv = sdf.mse_t[argmin(sdf.mse_cv)] / minimum(sdf.mse_t) - 1)
end
1000×3 DataFrame
Row │ id loss_whole loss_cv
│ Int64 Float64 Float64
──────┼────────────────────────────────
1 │ 1 0.246816 0.0
2 │ 2 0.0582093 0.0
3 │ 3 0.068783 0.0510148
4 │ 4 0.00661844 0.0
5 │ 5 0.00934061 0.0
⋮ │ ⋮ ⋮ ⋮
997 │ 997 0.0243234 0.0
998 │ 998 0.0965239 0.00218267
999 │ 999 0.0966297 0.0
1000 │ 1000 0.0768627 0.0120936
991 rows omitted
julia> describe(loss, :all, cols=r"loss")
2×13 DataFrame
Row │ variable mean std min q25 median q75 max nunique nmissing first last eltype
│ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64 Nothing Int64 Float64 Float64 DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ loss_whole 0.10089 0.0835492 0.0 0.0403303 0.0795957 0.13795 0.550086 0 0.246816 0.0768627 Float64
2 │ loss_cv 0.0416662 0.0827592 0.0 0.0 0.00010347 0.047352 0.550086 0 0.0 0.0120936 Float64
We can see that indeed using cross validation gives us a significantly lower loss in comparison to using all the available features.
Conclusions
In conclusion we can say that cross validation worked relatively well for model selection in our case (given that we were working with a small sample).
Also it is interesting to note that under concrete observations an optimal set of variables to be picked for linear model estimation was not always the set of variables that we knew that truly took part in data generation process. Sometimes, given the estimation procedure used, it would be better to pick a different set of variables.