# DataFrames.jl training: implementing cross validation

# Introduction

Today I decided to show how cross validation can be implemented using the functions provided by DataFrames.jl. The objective is to discuss common patterns used when working with data frames.

In this the post I use Julia 1.6.3, DataFrames.jl 1.2.2, and GLM.jl 1.5.1.

# Preparing the data and initial analysis

Let us start by creating some synthetic data set we will work with.

```
julia> using DataFrames
julia> using GLM
julia> using Random
julia> using Statistics
julia> Random.seed!(1234);
julia> df = DataFrame(randn(50, 9), :auto);
julia> df.y = sum(eachcol(df)) .+ 1.0 .+ randn(50);
julia> df
50×10 DataFrame
Row │ x1 x2 x3 x4 x5 x6 x7 x8 x9 y
│ Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0.867347 -1.22672 0.183976 -1.87215 -0.205782 0.295222 0.183203 0.358659 -0.833507 -2.20367
2 │ -0.901744 -0.541716 -1.27635 -0.668331 -1.22338 -0.338215 0.975083 0.488578 0.0827196 -3.31337
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
49 │ -1.00978 -1.66323 0.797165 -0.644069 0.127747 0.742054 0.196089 -1.05575 0.771387 -2.364
50 │ -0.543805 -0.521229 0.103145 -1.37931 0.143105 -0.0665944 -2.34887 -1.37548 0.590872 -4.76852
46 rows omitted
```

As you can see we have created data so that the target variable `:y`

is a sum
of features `:x1`

to `:x9`

and an intercept equal to `1`

. The error term in the
model has a normal distribution with mean `0`

and variance `1`

. Similarly all
features also have a normal distribution with mean `0`

and variance `1`

.

Now we create a linear model on the whole data set to have a baseline:

```
julia> formula = Term(:y) ~ sum([Term(Symbol(:x, i)) for i in 1:9])
FormulaTerm
Response:
y(unknown)
Predictors:
x1(unknown)
x2(unknown)
x3(unknown)
x4(unknown)
x5(unknown)
x6(unknown)
x7(unknown)
x8(unknown)
x9(unknown)
julia> model = lm(formula, df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}
y ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 1.01641 0.140685 7.22 <1e-08 0.732078 1.30075
x1 0.879802 0.141336 6.22 <1e-06 0.594151 1.16545
x2 1.14157 0.160897 7.10 <1e-07 0.81639 1.46676
x3 0.820527 0.137644 5.96 <1e-06 0.542338 1.09872
x4 1.28829 0.125645 10.25 <1e-12 1.03435 1.54223
x5 1.02118 0.156195 6.54 <1e-07 0.705503 1.33687
x6 0.844008 0.146142 5.78 <1e-06 0.548643 1.13937
x7 0.70256 0.15646 4.49 <1e-04 0.386343 1.01878
x8 0.925448 0.15797 5.86 <1e-06 0.60618 1.24472
x9 0.889749 0.167878 5.30 <1e-05 0.550456 1.22904
────────────────────────────────────────────────────────────────────────
```

We calculate mean square error of the model on the training data set:

```
julia> deviance(model) / nrow(df)
0.6689313350544223
julia> mse(model, df) = sum(x->x^2, predict(model, df) - df.y) / nrow(df);
julia> mse(model, df)
0.6689313350544223
```

It seems (and indeed is) a bit low given we know that the standard deviation of
the error term in our data generation process was `1`

. We will come back to
it in a second. However, first let us perfom 10-fold cross validation.

# Doing cross validation

We start with generating a column holding the fold number for each row.
In the example we assume we do a 10-fold cross validation and will number folds
by integers starting with `0`

and ending with `9`

. We need to make sure that
folds should be assigned to rows in a random order so we need to `shuffle!`

the initially generated fold numbers:

```
julia> df.fold = shuffle!((1:nrow(df)) .% 10)
50-element Vector{Int64}:
4
6
5
7
⋮
8
6
7
```

Now we can quite easily extract out training and test data sets for each individual fold using the following function:

```
get_fold_data(df, fold) =
(train=view(df, df.fold .!= fold, :),
test=view(df, df.fold .== fold, :))
```

Note that in this code we use `view`

s to avoid excessive copying of data.

Now we are ready to calculate the cross validation mean squared error (in the code we take advantage of the fact that each fold has exactly the same size in our example):

```
julia> mean(0:9) do fold
train, test = get_fold_data(df, fold)
model_cv = lm(formula, train)
return mse(model_cv, test)
end
0.9500293257502437
```

Indeed it is higher than the MSE we have calculated above as expected.

Let us calculate the expected square error for the new observation fed to our model (I am using here an analytic formula that is suitable for our specific way of input data generation):

```
julia> mset(model) = 1 + sum(x -> (1 - x) ^ 2, coef(model));
julia> mset(model)
1.2810475573651952
```

We can see it is greater than `1`

which manes sense as this error is a sum of
two errors: random term in our data generation process plus the estimation error
of parameters of our model.

If you are not convinced that the proper formula is used to calculate this expectation then it is easy to check it using simulation:

```
julia> dfn = DataFrame(randn(10^6, 9), :auto);
julia> dfn.y = sum(eachcol(dfn)) .+ 1.0 .+ randn(10^6);
julia> mse(model, dfn)
1.2817533442813103
```

And we get a roughly similar result.

It is crucial to note that we can calculate `mse_t`

here accurately because we
know the data generation process. In practice we would not know it but would
want get some measure that would as accurately as possible order the competing
models in terms of their true, unobserved `mse_t`

.

Now you might wonder if the cross validation mean squared error should be
expected to be lower than the actual prediction expected squared error as in
our single run and if it is a better predictor of `mse_t`

than mean squared
error calculated on a training data set. We check this in the next section.

# Testing the procedure using simulation

Let us wrap our calculations in the function. We will collect three mean squared errors:

`mse_whole`

: calculated on training data set;`mse_cv`

: calculated using cross validation;`mse_t`

: expected prediction squared error.

Here is the function definition:

```
function runtest()
df = DataFrame(randn(50, 9), :auto)
df.y = sum(eachcol(df)) .+ 1.0 .+ randn(50)
model = lm(formula, df)
mse_whole = mse(model, df)
mse_t = mset(model)
df.fold = shuffle!((1:nrow(df)) .% 10)
mse_cv = mean(0:9) do fold
train, test = get_fold_data(df, fold)
model_cv = lm(formula, train)
return mse(model_cv, test)
end
return (mse_whole=mse_whole, mse_cv=mse_cv, mse_t=mse_t)
end
```

Let us now run 10,000 independent repetitions of our experiment and analyze the results:

```
julia> Random.seed!(12);
julia> res = DataFrame([runtest() for _ in 1:10_000])
10000×3 DataFrame
Row │ mse_whole mse_cv mse_t
│ Float64 Float64 Float64
───────┼──────────────────────────────
1 │ 0.767821 1.20796 1.43191
2 │ 0.595431 1.11465 1.2233
3 │ 0.801025 1.41942 1.25682
⋮ │ ⋮ ⋮ ⋮
9998 │ 0.589998 0.930069 1.24527
9999 │ 0.444815 0.685732 1.36184
10000 │ 1.05747 1.68496 1.23118
9994 rows omitted
julia> describe(res, :all)
3×13 DataFrame
Row │ variable mean std min q25 median q75 max nunique nmissing first last eltype
│ Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64 Nothing Int64 Float64 Float64 DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ mse_whole 0.799776 0.177712 0.286417 0.672772 0.785994 0.912926 1.71692 0 0.767821 1.05747 Float64
2 │ mse_cv 1.29295 0.305371 0.419517 1.07337 1.26767 1.48104 3.0517 0 1.20796 1.68496 Float64
3 │ mse_t 1.25542 0.132023 1.01737 1.16157 1.22982 1.32126 2.07322 0 1.43191 1.23118 Float64
julia> cor(Matrix(res))
3×3 Matrix{Float64}:
1.0 0.947 -0.00555373
0.947 1.0 -0.00844714
-0.00555373 -0.00844714 1.0
```

What have we learned from our specific experiment setup:

`mse_whole`

is significantly biased downwards;`mse_cv`

is slightly biased upwards on the average when multiple experiments are considered, but its variance is much higher than the one of`mse_t`

;`mse_cv`

and`mse_whole`

are significantly correlated and they both do not exhibit correlation with`mse_t`

.

# Conclusions

I hope the examples I gave of using DataFrames.jl in the context of some typical data science workflow will be useful. A next step - that I leave as an exercise - would be to e.g. check how model selection using cross validation works.