W#11: Final Project, Collaborative Git, Bootstrapping, Cross validation, Bias-Variance Tradeoff

Jan Lorenz

Final Project

A project report in a nutshell

  • You pick a dataset,
  • do some interesting question-driven data analysis with it,
  • write up a well structured and nicely formatted report about the analysis, and
  • present it at the end of the semester.

Six types of questions

  1. Descriptive: summarize a characteristic of data
  2. Exploratory: analyze to see if there are patterns, trends, or relationships between variables (hypothesis generating)
  3. Inferential: analyze patterns, trends, or relationships in representative data from a population
  4. Predictive: make predictions for individuals or groups of individuals
  5. Causal: whether changing one factor will change another factor, on average, in a population
  6. Mechanistic: explore “how” as opposed to whether

What are good questions?

A good question

  • is not too broad (= complicated to answer)
  • is not too specific (= trivial to answer)
  • can be answered or at least approached with data
  • call for one of the different types of data analysis:
    • descriptive
    • exploratory
    • inferential
    • predictive
    • (causal or mechanistic – not the typical choices for a short term data analysis projects. They should not be central but may be touched.)

Tools and advice for questions

Descriptive

  • What is an observation? Answer this in any data analysis!
  • What are the variables?
  • Summary statistics and basic visualizations
  • Category frequencies and numerical distributions

Exploratory

  • Questions specific to the topic
  • Compute/summarize relevant variables: percentages, ratios, differences, …)
  • Scatter plots
  • Correlations
  • Cluster Analysis
  • Crafted computations and visualizations for your particular questions

Tools and advice for questions

Inferential

  • Describe the population the data represents
  • Can it be considered a random sample?
  • Think theoretically about variable selection and data transformations
  • Interpret coefficients of models
  • Be careful interpreting coefficients when variables are highly correlated
  • Are effects statistical significant?
  • Hypothesis test: Describe null distribution

Predictive

  • Regression or classification?
  • Do a train/test split to evaluate model performance (Optional, do crossvalidation)
  • Select sensible performance metrics for the problem (from confusion matrix or regression metrics)
  • Maybe compare different models and tune hyperparameters
  • Do feature selection and feature engineering to improve the performance!

Modeling Mindsets

Some methods:
Linear Model: Inferential and Predictive!
Logistic Regression: Inferential and Predictive!
k-means Clustering: Exploratory

Let’s check your questions in the Course Organization Repo!

Collaborative Work with Git

Learning goal: First experiences with collaborative data science work with git

Step 1: git clone project-Y-USERNAMES

Team formation is mostly complete. Repositories project-1, FinalProject_2, … are created. You are to deliver your project reports in your repository.

  • Find your project repository
  • Copy the URL
  • Go to RStudio
    • New Project > Form Version Control > Git > Paste your URL
    • The project is created

Step 2: First Team member commits and pushes

One team member does the following:

  • Open the file report.qmd
  • Write your name in the first line at author in the YAML
  • git add the file report.qmd
  • git commit (enter “Added name” as commit message)
  • git push

Step 3: Other team members pull

  • Do git pull in the RStudio interface.

What does git pull do?

  • git pull does two things: fetching new stuff and merging it with existing stuff
    1. First it git fetchs the commit from the remote repository (GitHub) to the local machine
    2. Then it git merges the commit with the latest commit on your local machine.
  • When we are lucky this works with no problems. (Should be the case with new files.)

Step 4: Merge independent changes

Git can merge changes in the same file when there are no conflicts. Let’s try.

  • The second team member:
    • Add your name in the author section of the YAML, save the file, add the file in the Git pane and make a commit.
    • Save, add the file in the Git pane, commit with message “Next author name”, push.
  • The third (or first) team member:
    • Change the title in the YAML to something meaningful (and also add your name if third team member), save the file, add the file in the Git pane and make a commit.
    • Try to push. You should receive an error. Read it carefully, often it tells you what to do. Here: Do git pull first. You cannot push because remotely there is a newer commit (the one your colleague just made).
    • Pull. This should result in message about a successfull auto-merge. Check that both are there: Your line and the line of your colleague. If you receive several hints instead, first read the next slide!

??? git configuration for divergent branches

If you pull for the first time in a local git repository, git may complain like this:

Read that carefully. It advises to configure with git config pull.rebase false as the default version.

How to do the configuration?

  • Copy the line git config pull.rebase false and close the window.
  • Go to the Terminal pane (not the console, the one besides that). This is a terminal not for R but to speak with the computer in general. Paste the command and press enter. Now you are done and your next git pull should work.

Step 5: Push and pull the other way round

  • The third/first member:
    • The successful merge creates a new commit, which you can directly push.
    • Push.
  • The second team member (and all others):
    • Pull the changes of your colleague.

Practice a bit more pulling and pushing commits and check the merging.

Step 6: Create a merge conflict

  • First and second team members:
    • Write a different sentence after “Executive Summary.” in YAML abstract:.
    • Each git add and git commit on local machines.
  • First member: git push
  • Second member:
    • git pull. That should result in a conflict. If you receive several hints instead, first read the slide two slides before!
    • The conflict should show directly in the file with markings like this

>>>>>>>>
one option of text,
======== a separator,
the other option, and
<<<<<<<.

Step 7: Solve the conflict

  • The second member
    • You have to solve this conflict now!
    • Solving is by editing the text
    • Decide for an option or make a new text
    • Thereby, remove the >>>>>,=====,<<<<<<
    • When you are done: git add, git commit, and git push.

Now you know how to solve merge conflicts. Practice a bit in your team.

Working in VSCode: The workflow is the same because it relies on git not on the editor of choice.

Advice: Collaborative work with git

  • Whenever you start a work session: First pull to see if there is anything new. That way you reduce the need for merges.
  • Inform your colleagues when you pushed new commits.
  • Coordinate the work, e.g. discuss who works on what part and maybe when. However, git allows to also work without full coordination and in parallel.
  • When you finish your work session, end with pushing a nice commit. That means. The file should render. You made comments when there are loose ends and todo’s.
  • You can also use the issues section of the GitHub repository for things to do.
  • When you work on different parts of the file, be aware that also a successful merge can create problems. Example: Your colleague changed the data import, while you worked on graphics. Maybe after the merge the imported data is not what you need for your chunk. Then coordinate.
  • Commit and push often. This avoids that potential merge conflicts become large.

Bootstrapping

Purpose: Quantify uncertainty

  • Uncertainty is a central concern in inferential data analysis.
  • We want to estimate a parameter of a population from a sample.

Central inferential assumption:

  • Our data is a random sample from the population we are interested in.
    • No selection biases.

Examples:

  • We want to know average height of all people but we only have a sample of people.
  • We want know the mean estimate of all possible participants of the ox meat weigh guessing competition from the sample ballots of participants we have.
  • We want to learn something about penguins from the data of the penguins in palmerpenguins::penguins

Practical example: Galton’s Data

Competition: Guess the weight of the meat of an Ox? (Galton, 1907)

What is the mean of all possible participants \(\mu\)?

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
galton <- read_csv("data/galton.csv")
Rows: 787 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): Estimate, id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
galton |> ggplot(aes(Estimate)) + geom_histogram(binwidth = 5)

The (arithmetic) mean of the 787 estimates is \(\hat\mu =\) 1196.71.

How sure can we be that \(\hat\mu\) is close to \(\mu\) and how can we quantify?1

Practical example: Palmer Penguins

A linear model for body mass of penguins: \(y_i = \beta_0 + \beta_1 x_i + \dots + \beta_m x_m + \varepsilon_i\).

library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.5     ✔ rsample      1.2.1
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.3.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.1
✔ recipes      1.1.0     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library(palmerpenguins)

Attaching package: 'palmerpenguins'
The following object is masked from 'package:modeldata':

    penguins
peng_workflow <- workflow() |> add_recipe(recipe(body_mass_g ~ ., data = penguins) |> step_rm(year)) |> 
  add_model(linear_reg() |> set_engine("lm"))
peng_fit <- peng_workflow |> fit(data = penguins)
peng_fit |> tidy() |> mutate_if(is.numeric, round, 7)
# A tibble: 9 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        -1500.     576.      -2.61  0.00961  
2 speciesChinstrap    -260.      88.6     -2.94  0.00352  
3 speciesGentoo        988.     137.       7.20  0        
4 islandDream          -13.1     58.5     -0.224 0.823    
5 islandTorgersen      -48.1     60.9     -0.789 0.431    
6 bill_length_mm        18.2      7.14     2.55  0.0113   
7 bill_depth_mm         67.6     19.8      3.41  0.000734 
8 flipper_length_mm     16.2      2.94     5.52  0.0000001
9 sexmale              387.      48.1      8.04  0        

The column estimate shows the coefficients \(\hat\beta_0, \hat\beta_1, \dots, \hat\beta_m\) of the linear model for the variables named in term

How sure can we be that the \(\hat\beta\)’s are close to the \(\beta\)’s we are interested in?1

Bootstrapping1 idea

  1. Take a bootstrap sample: a random sample taken with replacement from the original sample with the same size.
  2. Calculate the statistics for the bootstrap sample.
  3. Repeat steps 1. and 2. many times to create a bootstrap distribution - a distribution of bootstrap statistics
  4. Use the bootstrap distribution to quantify the uncertainty.

What is the statistic in the Galton example? The mean
What are the statistics in the penguins example? The coefficients

Bootstrapping in practice

  • You can do the bootstrapping procedure in various ways (including programming in base R or python).
  • We will use convenient functions from rsample form the tidymodels packages.

Resamplings: Galton’s Data

The function sample selects size random elements from a vector x with or without replace.

set.seed(224)
sample(galton$Estimate, size = 3, replace = TRUE)
[1] 1177 1232 1242
sample(galton$Estimate, size = 3, replace = TRUE)
[1] 1270 1440 1232

For tibbles we can use slice_sample.

galton |> slice_sample(n = 3, replace = TRUE)
# A tibble: 3 × 2
  Estimate    id
     <dbl> <dbl>
1     1189   320
2     1203   367
3     1181   278

A bootstrap sample is with replace and of the same size as the original sample:

nrow(galton)
[1] 787
galton |> slice_sample(n = 787, replace = TRUE)
# A tibble: 787 × 2
   Estimate    id
      <dbl> <dbl>
 1     1347   774
 2     1221   489
 3     1112    85
 4     1320   766
 5     1266   697
 6     1145   149
 7     1271   711
 8     1215   442
 9     1238   596
10     1163   198
# ℹ 777 more rows

Do some mean bootstrapping manually:

galton |> slice_sample(n = 787, replace = TRUE) |> 
 summarize(mean(Estimate))
# A tibble: 1 × 1
  `mean(Estimate)`
             <dbl>
1            1194.
galton |> slice_sample(n = 787, replace = TRUE) |> 
 summarize(mean(Estimate))
# A tibble: 1 × 1
  `mean(Estimate)`
             <dbl>
1            1197.
galton |> slice_sample(n = 787, replace = TRUE) |> 
 summarize(mean(Estimate))
# A tibble: 1 × 1
  `mean(Estimate)`
             <dbl>
1            1193.

Bootstrap set: Galton’s Data

The function bootstraps from rsample does this type of bootstrapping for us.

1,000 resamples:

boots_galton <- galton |> bootstraps(times = 1000)
boots_galton
# Bootstrap sampling 
# A tibble: 1,000 × 2
   splits            id           
   <list>            <chr>        
 1 <split [787/295]> Bootstrap0001
 2 <split [787/293]> Bootstrap0002
 3 <split [787/296]> Bootstrap0003
 4 <split [787/303]> Bootstrap0004
 5 <split [787/283]> Bootstrap0005
 6 <split [787/295]> Bootstrap0006
 7 <split [787/295]> Bootstrap0007
 8 <split [787/285]> Bootstrap0008
 9 <split [787/291]> Bootstrap0009
10 <split [787/275]> Bootstrap0010
# ℹ 990 more rows

The column splits contains a list of split objects.

boots_galton$splits[[1]]
<Analysis/Assess/Total>
<787/295/787>

From each split we can extract the bootstrapped resample with analysis.

boots_galton$splits[[1]] |> analysis()
# A tibble: 787 × 2
   Estimate    id
      <dbl> <dbl>
 1     1160   192
 2     1260   678
 3     1147   153
 4     1221   484
 5     1204   373
 6     1165   211
 7     1112    85
 8     1415   780
 9     1081    43
10     1168   220
# ℹ 777 more rows

(The assessment set contains the observations that were not selected randomly in the resample. We do not use it in the following.)

Bootstrap distribution: Galton’s Data

We iterate with a map function1 over the splits and extract the analysis set and calculate the mean.

boots_galton_mean <- boots_galton |> 
  mutate(
   mean = map_dbl(splits, 
                  \(x) analysis(x)$Estimate |> mean()))
boots_galton_mean
# Bootstrap sampling 
# A tibble: 1,000 × 3
   splits            id             mean
   <list>            <chr>         <dbl>
 1 <split [787/295]> Bootstrap0001 1194.
 2 <split [787/293]> Bootstrap0002 1196.
 3 <split [787/296]> Bootstrap0003 1199.
 4 <split [787/303]> Bootstrap0004 1196.
 5 <split [787/283]> Bootstrap0005 1194.
 6 <split [787/295]> Bootstrap0006 1197.
 7 <split [787/295]> Bootstrap0007 1195.
 8 <split [787/285]> Bootstrap0008 1194.
 9 <split [787/291]> Bootstrap0009 1197.
10 <split [787/275]> Bootstrap0010 1193.
# ℹ 990 more rows

Plot the bootstrap distribution of the mean.

boots_galton_mean |> 
 ggplot(aes(x = mean)) + geom_histogram() +
 theme_minimal(base_size = 24)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Compare distribution of estimates

galton |> ggplot(aes(Estimate)) + geom_histogram(binwidth = 5) +
 theme_minimal(base_size = 24)

Quantify uncertainty? Standard error

Compute the standard deviation of the bootstrap distribution.

sd(boots_galton_mean$mean)
[1] 2.732243

The standard deviation of the bootstrap distribution is called the standard error of the statistic.

Do not confuse standard error (of a statistic) and standard deviation (of a variable).

The standard error of the mean can also be estimated directly from the original sample \(x\) as \(\frac{\text{sd}(x)}{\sqrt{n}}\) where \(n\) is the sample size.

sd(galton$Estimate) / sqrt(nrow(galton))
[1] 2.623085

Insight: The standard error decreases with sample size! However, it decreases only with the square root of the sample size.

Question: You want to shrink the standard error by half. How much larger does the sample size need to be? 4 times

Confidence interval

Another way to quantify uncertainty is to compute a confidence interval (CI) for a certain confidence level:
The true value of the statistic is expected to lie with the CI with certain probability (the confidence level).

Compute the 95% confidence interval of the bootstrap distribution with quantile-functions:

boots_galton_mean |> 
  summarize(
    lower = quantile(mean, 0.025),
    upper = quantile(mean, 0.975))
# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1 1191. 1202.
boots_galton_mean |> 
 ggplot(aes(x = mean)) + geom_histogram() +
 geom_vline(xintercept = quantile(boots_galton_mean$mean, 0.025), color = "red") +
 geom_vline(xintercept = quantile(boots_galton_mean$mean, 0.975), color = "red") +
 annotate("text", x = 1197, y = 10, label = "95% of estimates", color = "white", size = unit(12, "pt")) +
 theme_minimal(base_size = 24)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Confidence Interval: Meaning

boots_galton_mean |> 
  summarize(
    lower = quantile(mean, 0.025),
    upper = quantile(mean, 0.975))
# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1 1191. 1202.

What is the correct interpretation?

  1. 95% of the estimates in this sample lie between 1191 and 1202.
  2. We are 95% confident that the mean estimate of all potential participants is between 1191 and 1202.
  3. We are 95% confident that the mean estimate of this sample is between 1191 and 1202.

Correct: 2.
1. The confidence is about a parameter (here the population mean) not about sample values!
3. We know the mean of the sample precisely!
The confidence interval assesses where 95% of the values are when do new samples.

CI: Precision vs. Confidence

  • Can’t we increase the confidence level to 99% to get a more precise estimate with a more narrow confidence interval?
boots_galton_mean |> 
  summarize(
    lower = quantile(mean, 0.005),
    upper = quantile(mean, 0.995))
# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1 1190. 1204.

No, it is the other way round: The smaller and more precise the confidence interval, the lower must the confidence level be.

Bootstrap linear model coefficients

peng_workflow <- workflow() |> add_recipe(recipe(body_mass_g ~ ., data = penguins) |> step_rm(year)) |> 
  add_model(linear_reg() |> set_engine("lm"))
peng_fit <- peng_workflow |> fit(data = penguins)
peng_fit |> tidy() |> mutate_if(is.numeric, round, 7)
# A tibble: 9 × 5
  term              estimate std.error statistic   p.value
  <chr>                <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)        -1500.     576.      -2.61  0.00961  
2 speciesChinstrap    -260.      88.6     -2.94  0.00352  
3 speciesGentoo        988.     137.       7.20  0        
4 islandDream          -13.1     58.5     -0.224 0.823    
5 islandTorgersen      -48.1     60.9     -0.789 0.431    
6 bill_length_mm        18.2      7.14     2.55  0.0113   
7 bill_depth_mm         67.6     19.8      3.41  0.000734 
8 flipper_length_mm     16.2      2.94     5.52  0.0000001
9 sexmale              387.      48.1      8.04  0        

Let us now bootstrap whole models instead of just mean values!

# A function to fit a model to a bootstrap sample and tidy the coefficients
fit_split <- function(split) peng_workflow |> fit(data = analysis(split)) |> tidy() 
# Make 1000 bootstrap samples and fit a model to each
boots_peng <- bootstraps(penguins, times = 1000) |> 
  mutate(coefs = map(splits, fit_split))

Bootstrap data frame

Now we have a column coefs with a list of data frames, each containing the fitted coefficients.

boots_peng
# Bootstrap sampling 
# A tibble: 1,000 × 3
   splits            id            coefs           
   <list>            <chr>         <list>          
 1 <split [344/125]> Bootstrap0001 <tibble [9 × 5]>
 2 <split [344/131]> Bootstrap0002 <tibble [9 × 5]>
 3 <split [344/123]> Bootstrap0003 <tibble [9 × 5]>
 4 <split [344/116]> Bootstrap0004 <tibble [9 × 5]>
 5 <split [344/130]> Bootstrap0005 <tibble [9 × 5]>
 6 <split [344/126]> Bootstrap0006 <tibble [9 × 5]>
 7 <split [344/126]> Bootstrap0007 <tibble [9 × 5]>
 8 <split [344/129]> Bootstrap0008 <tibble [9 × 5]>
 9 <split [344/129]> Bootstrap0009 <tibble [9 × 5]>
10 <split [344/129]> Bootstrap0010 <tibble [9 × 5]>
# ℹ 990 more rows

We can unnest the list column to make a much longer but unnested data frame:

boots_peng |> unnest(coefs)
# A tibble: 9,000 × 7
   splits            id            term    estimate std.error statistic  p.value
   <list>            <chr>         <chr>      <dbl>     <dbl>     <dbl>    <dbl>
 1 <split [344/125]> Bootstrap0001 (Inter… -1972.      630.     -3.13   1.91e- 3
 2 <split [344/125]> Bootstrap0001 specie…  -294.       94.9    -3.10   2.09e- 3
 3 <split [344/125]> Bootstrap0001 specie…   958.      137.      6.97   1.84e-11
 4 <split [344/125]> Bootstrap0001 island…    -1.14     57.7    -0.0197 9.84e- 1
 5 <split [344/125]> Bootstrap0001 island…   -71.9      61.8    -1.16   2.45e- 1
 6 <split [344/125]> Bootstrap0001 bill_l…    22.3       8.00    2.78   5.68e- 3
 7 <split [344/125]> Bootstrap0001 bill_d…    80.1      20.6     3.88   1.26e- 4
 8 <split [344/125]> Bootstrap0001 flippe…    16.8       3.00    5.61   4.40e- 8
 9 <split [344/125]> Bootstrap0001 sexmale   353.       51.2     6.89   2.94e-11
10 <split [344/131]> Bootstrap0002 (Inter… -1945.      585.     -3.33   9.82e- 4
# ℹ 8,990 more rows

Some coefficient distributions

library(patchwork)
g1 <- boots_peng |> unnest(coefs) |> filter(term == "islandDream") |> 
  ggplot(aes(x = estimate)) + geom_histogram() + xlab("coefficients islandDream") +
  geom_vline(xintercept = 0, color = "gray") + theme_minimal(base_size = 24)
g2 <- boots_peng |> unnest(coefs) |> filter(term == "bill_length_mm") |> 
  ggplot(aes(x = estimate)) + geom_histogram() + xlab("coefficients bill_length_mm") +
  geom_vline(xintercept = 0, color = "gray") + theme_minimal(base_size = 24)
g3 <- boots_peng |> unnest(coefs) |> filter(term == "flipper_length_mm") |> 
  ggplot(aes(x = estimate)) + geom_histogram() + xlab("coefficients flipper_length_mm") +
  geom_vline(xintercept = 0, color = "gray") + theme_minimal(base_size = 24)
g1 + g2 + g3
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
  • We could derive bootstrapped p-values from these distributions.

computed p-values from each fit

  • Each bootstrap fit also delivers computed p-values for each coefficient.1
g1 <- boots_peng |> unnest(coefs) |> filter(term == "islandDream") |> ggplot(aes(x = p.value)) + geom_histogram() + xlab("coefficients islandDream") + theme_minimal(base_size = 24)
g2 <- boots_peng |> unnest(coefs) |> filter(term == "bill_length_mm") |> ggplot(aes(x = p.value)) + geom_histogram() + xlab("coefficients bill_length_mm") + theme_minimal(base_size = 24)
g3 <- boots_peng |> unnest(coefs) |> filter(term == "flipper_length_mm") |> ggplot(aes(x = p.value)) + geom_histogram() + xlab("coefficients flipper_length_mm") + theme_minimal(base_size = 24)
g1 + g2 + g3
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • computed p-values in resamples vary widely (see islandDream) when bootstrapped p-values show insignificance.

Cross validation: Another resampling method

How to evaluate performance on training data only?

  • Model performance changes with the random selection of the training data. How can we then reliably compare models?
  • Anyway, the training data is not a good source for model performance. It is not an independent piece of information. Predicting the training data only reveals what the model already “knows”.
  • Also, we should save the testing data only for the final validation, so we should not use it systematically to compare models.

A solution: Cross validation

Cross validation

More specifically, \(v\)-fold cross validation:

  • Shuffle your data and make a partition with \(v\) parts
    • Recall from set theory: A partition is a division of a set into mutually disjoint parts which union cover the whole set. Here applied to observations (rows) in a data frame.
  • Use 1 part for validation, and the remaining \(v-1\) parts for training
  • Repeat \(v\) times

Cross validation

Wisdom of the crowd, Bias and Diversity

Galton’s data

What is the weight of the meat of this ox?

galton <- read_csv("data/galton.csv")
Rows: 787 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): Estimate, id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
galton |> ggplot(aes(Estimate)) + geom_histogram(binwidth = 5) + geom_vline(xintercept = 1198, color = "green") + 
 geom_vline(xintercept = mean(galton$Estimate), color = "red")

787 estimates, true value 1198, mean 1196.7

RMSE Galton’s data

Describe the estimation game as a predictive model:

  • All estimates are made to predict the same value: the truth.
    • In contrast to the regression model, the estimate come from people and not from a regression formula.
  • The truth is the same for all.
    • In contrast to the regression model, the truth is one value and not a value for each prediction
rmse_galton <- galton |> 
 mutate(true_value = 1198) |>
 rmse(truth = true_value, Estimate)
rmse_galton
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        73.6

MSE, Variance, and Bias of estimates

In a crowd estimation, \(n\) estimators delivered the estimates \(\hat{y}_1,\dots,\hat{y}_n\). Let us look at the following measures

  • \(\bar{y} = \frac{1}{n}\sum_{i = 1}^n \hat{y}_i^2\) is the mean estimate, it is the aggregated estimate of the crowd

  • \(\text{MSE} = \text{RMSE}^2 = \frac{1}{n}\sum_{i = 1}^n (\text{truth} - \hat{y}_i)^2\)

  • \(\text{Variance} = \frac{1}{n}\sum_{i = 1}^n (\hat{y}_i - \bar{y})^2\)

  • \(\text{Bias-squared} = (\bar{y} - \text{truth})^2\) which is the square difference between truth and mean estimate.

There is a mathematical relation (a math exercise to check):

\[\text{MSE} = \text{Bias-squared} + \text{Variance}\]

Testing for Galton’s data

\[\text{MSE} = \text{Bias-squared} + \text{Variance}\]

MSE <- (rmse_galton$.estimate)^2 
MSE
[1] 5409.795
Variance <- var(galton$Estimate)*(nrow(galton)-1)/nrow(galton)
# Note, we had to correct for the divisor (n-1) in the classical statistical definition
# to get the sample variance instead of the estimate for the population variance
Variance
[1] 5408.132
Bias_squared <- (mean(galton$Estimate) - 1198)^2
Bias_squared
[1] 1.663346
Bias_squared + Variance
[1] 5409.795

The diversity prediction theorem1

  • MSE is a measure for the average individuals error
  • Bias-squared is a measure for the collective error
  • Variance is a measure for the diversity of estimates around the mean estimate

The mathematical relation \[\text{MSE} = \text{Bias-squared} + \text{Variance}\] can be formulated as

Collective error = Individual error - Diversity

Interpretation: The higher the diversity the lower the collective error!

Why is this message a bit suggestive?

The mathematical relation \[\text{MSE} = \text{Bias-squared} + \text{Variance}\] can be formulated as

Collective error = Individual error - Diversity

Interpretation: The higher the diversity the lower the collective error!

  • \(\text{MSE}\) and \(\text{Variance}\) are not independent!
  • Activities to increase diversity (Variance) typically also increase the average individual error (MSE).
  • For example, if we just add more random estimates with same mean but wild variance to our sample we increase both and do not gain any decrease of the collective error.

Accuracy for numerical estimate

  • For binary classifiers accuracy has a simple definition: Fraction of correct classifications.
    • It can be further informed by other more specific measures taken from the confusion matrix (sensitivity, specificity)

How about numerical estimators?
For example outcomes of estimation games, or linear regression models?

  • Accuracy is for example measured by (R)MSE
  • \(\text{MSE} = \text{Bias-squared} + \text{Variance}\) shows us that we can make a
    bias-variance decomposition
  • That means some part of the error is a systematic (the bias) and another part due to random variation (the variance).
  • The bias-variance tradeoff is also an important concept in statistical learning!

2-d Accuracy: Trueness and Precision

According to ISO 5725-1 Standard: Accuracy (trueness and precision) of measurement methods and results - Part 1: General principles and definitions. there are two dimension of accuracy of numerical measurement.

What is a wise crowd?

Assume the dots are estimates. Which is a wise crowd?

  • Of course, high trueness and high precision! But, …
  • Focusing on the crowd being wise instead of its individuals: High trueness, low precision.

Bias-variance trade-off in a script

penguins_bias_variance.R