Introduction

After my last post I got a comment that I should concentrate more on statistical conclusions than on how to implement things. Therefore today I continue it to show how cross validation works for model selection.

In this the post I use Julia 1.6.3, DataFrames.jl 1.2.2, FreqTables.jl 0.4.5, and GLM.jl 1.5.1.

Code for experiment setup

We will use the same settings as in my last post.

As before we will collect three mean squared errors:

  • mse_whole: calculated on training data set;
  • mse_cv: calculated using 10-fold cross validation;
  • mse_t: expected prediction squared error.

The difference is that this time we will make the following changes:

  • we will have nine features x1, x2, …, x9, but only first five of them will affect the target with weight 1, the remaining four will be unimportant;
  • I will test nine nested models incrementally including features x1, x2, up to x9 to the model.

We will be interested to see how well cross validation works for model selection.

Here is the code that sets up the experiments (except for the changes described above it is identical to what we have discussed in my last post):

using DataFrames
using FreqTables
using GLM
using Random
using Statistics

mse(model, df) = sum(x->x^2, predict(model, df) - df.y) / nrow(df);

function mset(model)
    c = zeros(10)
    c[1:length(coef(model))] .= coef(model)
    ref = [ones(6); zeros(4)]
    return 1 + sum((c .- ref) .^ 2)
end

get_fold_data(df, fold) =
    (train=view(df, df.fold .!= fold, :),
     test=view(df, df.fold .== fold, :))

function runtest(id)
    df = DataFrame(randn(50, 9), :auto)
    df.y = sum(eachcol(df)[1:5]) .+ 1.0 .+ randn(50)
    formulas = [Term(:y) ~ sum([Term(Symbol(:x, i)) for i in 1:n]) for n in 1:9]
    models = [lm(f, df) for f in formulas]

    mse_wholes = [mse(m, df) for m in models]
    mse_ts = [mset(m) for m in models]

    df.fold = shuffle!((1:nrow(df)) .% 10)
    mse_cvs = map(formulas) do f
        return mean(0:9) do fold
            train, test = get_fold_data(df, fold)
            model_cv = lm(f, train)
            return mse(model_cv, test)
        end
    end

    return DataFrame(id=id, vars=1:9, mse_whole=mse_wholes, mse_cv=mse_cvs, mse_t=mse_ts)
end

The experiment

Knowing the data generation process we know that the true model is using five explanatory variables.

Let us now run 1000 independent repetitions of our experiment and analyze the results:

julia> Random.seed!(12);

julia> res = reduce(vcat, [runtest(i) for i in 1:1000])
9000×5 DataFrame
  Row │ id     vars   mse_whole  mse_cv   mse_t
      │ Int64  Int64  Float64    Float64  Float64
──────┼───────────────────────────────────────────
    1 │     1      1   5.44188   5.84075  5.15904
    2 │     1      2   2.8385    3.16917  4.4904
    3 │     1      3   2.29577   2.71023  3.48133
    4 │     1      4   1.71476   2.05224  2.31772
    5 │     1      5   0.859515  1.04158  1.14846
    6 │     1      6   0.853335  1.11315  1.17749
    7 │     1      7   0.845039  1.12457  1.18426
  ⋮   │   ⋮      ⋮        ⋮         ⋮        ⋮
 8995 │  1000      4   1.41656   1.64706  2.42478
 8996 │  1000      5   0.886208  1.11717  1.2702
 8997 │  1000      6   0.885057  1.18246  1.26373
 8998 │  1000      7   0.884199  1.26832  1.25502
 8999 │  1000      8   0.849425  1.24977  1.33918
 9000 │  1000      9   0.829926  1.24833  1.35149
                                 8987 rows omitted

First we check what is the mean MSE across the experiments (this is not a super useful statistics, but let us start with it):

julia> combine(groupby(res, :vars), names(res, r"mse") .=> mean)
9×4 DataFrame
 Row │ vars   mse_whole_mean  mse_cv_mean  mse_t_mean
     │ Int64  Float64         Float64      Float64
─────┼────────────────────────────────────────────────
   1 │     1        4.81822       5.25666     5.21646
   2 │     2        3.7604        4.28551     4.2631
   3 │     3        2.74739       3.27702     3.26673
   4 │     4        1.79483       2.24701     2.23122
   5 │     5        0.872812      1.1506      1.13955
   6 │     6        0.853597      1.18245     1.16617
   7 │     7        0.834749      1.21654     1.19383
   8 │     8        0.815655      1.25266     1.2222
   9 │     9        0.796079      1.2905      1.25287

We can see that on the average mse_whole decreases with number of variables, while mse_cv and mse_t has a minimum for five variables. Also mse_cv is slightly biased upwards in comparison to mse_t, while mse_whole is significantly biased downwards. All this was expected.

However, we are interested in model selection. So let us check which models would get selected by the three criteria:

julia> best = combine(groupby(res, :id), names(res, r"mse") .=> argmin)
1000×4 DataFrame
  Row │ id     mse_whole_argmin  mse_cv_argmin  mse_t_argmin
      │ Int64  Int64             Int64          Int64
──────┼──────────────────────────────────────────────────────
    1 │     1                 9              5             5
    2 │     2                 9              5             5
    3 │     3                 9              6             5
    4 │     4                 9              5             5
    5 │     5                 9              5             5
    6 │     6                 9              5             5
    7 │     7                 9              5             6
  ⋮   │   ⋮           ⋮                ⋮             ⋮
  995 │   995                 9              5             5
  996 │   996                 9              5             8
  997 │   997                 9              5             5
  998 │   998                 9              5             6
  999 │   999                 9              5             5
 1000 │  1000                 9              5             7
                                             987 rows omitted

julia> freqtable(best, :mse_whole_argmin)
1-element Named Vector{Int64}
mse_whole_argmin  │
──────────────────┼─────
9                 │ 1000

julia> freqtable(best, :mse_cv_argmin, :mse_t_argmin)
5×5 Named Matrix{Int64}
mse_cv_argmin ╲ mse_t_argmin │   5    6    7    8    9
─────────────────────────────┼────────────────────────
5                            │ 480  134   49   24    8
6                            │ 122    2    0    1    0
7                            │  64   15    0    0    0
8                            │  42    3    2    0    0
9                            │  41    7    4    2    0

For mse_whole always a full model is selected - not something that we would want, but this is expected.

For mse_cv it selects the model that would be selected by mse_t in a bit less than 50% of cases. We can also see that in general it has a slight tendency to pick models with too many variables. However, it should be noted that also under mse_t it would be sometimes better to pick a more complex model given the concrete sampled training data.

Finally let us compare a relative MSE loss of using a whole model and using a model picked by cross validation:

julia> loss = combine(groupby(res, :id)) do sdf
           (loss_whole = sdf.mse_t[argmin(sdf.mse_whole)] / minimum(sdf.mse_t) - 1,
            loss_cv = sdf.mse_t[argmin(sdf.mse_cv)] / minimum(sdf.mse_t) - 1)
       end
1000×3 DataFrame
  Row │ id     loss_whole  loss_cv
      │ Int64  Float64     Float64
──────┼────────────────────────────────
    1 │     1  0.246816    0.0
    2 │     2  0.0582093   0.0
    3 │     3  0.068783    0.0510148
    4 │     4  0.00661844  0.0
    5 │     5  0.00934061  0.0
  ⋮   │   ⋮        ⋮            ⋮
  997 │   997  0.0243234   0.0
  998 │   998  0.0965239   0.00218267
  999 │   999  0.0966297   0.0
 1000 │  1000  0.0768627   0.0120936
                       991 rows omitted

julia> describe(loss, :all, cols=r"loss")
2×13 DataFrame
 Row │ variable    mean       std        min      q25        median      q75       max       nunique  nmissing  first     last       eltype
     │ Symbol      Float64    Float64    Float64  Float64    Float64     Float64   Float64   Nothing  Int64     Float64   Float64    DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ loss_whole  0.10089    0.0835492      0.0  0.0403303  0.0795957   0.13795   0.550086                  0  0.246816  0.0768627  Float64
   2 │ loss_cv     0.0416662  0.0827592      0.0  0.0        0.00010347  0.047352  0.550086                  0  0.0       0.0120936  Float64

We can see that indeed using cross validation gives us a significantly lower loss in comparison to using all the available features.

Conclusions

In conclusion we can say that cross validation worked relatively well for model selection in our case (given that we were working with a small sample).

Also it is interesting to note that under concrete observations an optimal set of variables to be picked for linear model estimation was not always the set of variables that we knew that truly took part in data generation process. Sometimes, given the estimation procedure used, it would be better to pick a different set of variables.