A Quick Tip for Cluster Bootstrap with the 'boot' Package

Publish date: Jan 22, 2021
Last updated: Jan 22, 2021

Tags:

Motivation

Non-parametric bootstrap is not a massive pain in the neck if each unit has a single observation. Cluster boostrap, however, requires some extra steps when we are going to manipulate the bootstrapped data further based on an ID variable. Once bootstrapped, a unit should be uniquely identified even if it was sampled multiple times. For example, we might want to take the last observation of each unit, but it is challenging to do it if there are many last observations from the same unit.

Solution

My solution is pretty simple.

Supply the ID vector to the boot::boot function, instead of the whole data.
Define a function to calculate your statistic in which a new ID variable is generated that uniquely identify each unit.

Let’s make a fake data set to demonstrate the solution.

library(tidyr)
library(purrr)
library(dplyr)

n_id <- 1000

n_t_fu <- 5

probs <- c(0.2, 0.5, 0.7, 0.8, 0.9)

set.seed(626)

fake_tbl <-
  crossing(
    patient_id = 1:n_id,
    t_fu = 1:n_t_fu
  ) %>%
  split(
    .,
    .$t_fu
  ) %>%
  map2_dfr(
    1:n_t_fu,
    ~ mutate(
      .x,
      event = rbinom(n_id, 1, prob = probs[[.y]])
    )
  )

print(fake_tbl, n = 30)

## # A tibble: 5,000 x 3
##    patient_id  t_fu event
##         <int> <int> <int>
##  1          1     1     0
##  2          2     1     0
##  3          3     1     0
##  4          4     1     1
##  5          5     1     0
##  6          6     1     0
##  7          7     1     0
##  8          8     1     0
##  9          9     1     0
## 10         10     1     0
## 11         11     1     1
## 12         12     1     0
## 13         13     1     0
## 14         14     1     0
## 15         15     1     0
## 16         16     1     0
## 17         17     1     1
## 18         18     1     0
## 19         19     1     0
## 20         20     1     0
## 21         21     1     0
## 22         22     1     0
## 23         23     1     0
## 24         24     1     1
## 25         25     1     1
## 26         26     1     1
## 27         27     1     1
## 28         28     1     0
## 29         29     1     0
## 30         30     1     0
## # … with 4,970 more rows

Then, let’s define a fucntion to calculate incidence at each time (t_fu).

calculate_stat <- function(.id, .index, .data, .id_name, .t_name, .event_name) {

  id_new <- seq_along(.id)

  id_boot <- .id[.index]

  data_boot <-
    map2_dfr(
      id_boot,
      id_new,
      function(arg1, arg2) {
        .data %>%
          filter(!!ensym(.id_name) == arg1) %>%
          mutate({{ .id_name }} := arg2)
      }
    )

  data_boot %>%
    group_by({{ .t_name }}) %>%
    summarize(incidence = mean({{ .event_name }})) %>%
    pull(incidence)

}

It can be used with the boot package as follows:

RNGkind("L'Ecuyer-CMRG")
set.seed(626304)
boot_result <-
  boot::boot(
  unique(fake_tbl$patient_id),
  statistic = calculate_stat,
  R = 1000,
  parallel = "multicore",
  ncpus = 8,
  .data = fake_tbl,
  .id_name = patient_id,
  .t_name = t_fu,
  .event_name = event
)