A Quick Tip for Cluster Bootstrap with the 'boot' Package
Publish date: Jan 22, 2021
Last updated: Jan 22, 2021
Last updated: Jan 22, 2021
Motivation
Non-parametric bootstrap is not a massive pain in the neck if each unit has a single observation. Cluster boostrap, however, requires some extra steps when we are going to manipulate the bootstrapped data further based on an ID variable. Once bootstrapped, a unit should be uniquely identified even if it was sampled multiple times. For example, we might want to take the last observation of each unit, but it is challenging to do it if there are many last observations from the same unit.
Solution
My solution is pretty simple.
- Supply the ID vector to the
boot::boot
function, instead of the whole data. - Define a function to calculate your statistic in which a new ID variable is generated that uniquely identify each unit.
Let’s make a fake data set to demonstrate the solution.
library(tidyr)
library(purrr)
library(dplyr)
n_id <- 1000
n_t_fu <- 5
probs <- c(0.2, 0.5, 0.7, 0.8, 0.9)
set.seed(626)
fake_tbl <-
crossing(
patient_id = 1:n_id,
t_fu = 1:n_t_fu
) %>%
split(
.,
.$t_fu
) %>%
map2_dfr(
1:n_t_fu,
~ mutate(
.x,
event = rbinom(n_id, 1, prob = probs[[.y]])
)
)
print(fake_tbl, n = 30)
## # A tibble: 5,000 x 3
## patient_id t_fu event
## <int> <int> <int>
## 1 1 1 0
## 2 2 1 0
## 3 3 1 0
## 4 4 1 1
## 5 5 1 0
## 6 6 1 0
## 7 7 1 0
## 8 8 1 0
## 9 9 1 0
## 10 10 1 0
## 11 11 1 1
## 12 12 1 0
## 13 13 1 0
## 14 14 1 0
## 15 15 1 0
## 16 16 1 0
## 17 17 1 1
## 18 18 1 0
## 19 19 1 0
## 20 20 1 0
## 21 21 1 0
## 22 22 1 0
## 23 23 1 0
## 24 24 1 1
## 25 25 1 1
## 26 26 1 1
## 27 27 1 1
## 28 28 1 0
## 29 29 1 0
## 30 30 1 0
## # … with 4,970 more rows
Then, let’s define a fucntion to calculate incidence at each time (t_fu
).
calculate_stat <- function(.id, .index, .data, .id_name, .t_name, .event_name) {
id_new <- seq_along(.id)
id_boot <- .id[.index]
data_boot <-
map2_dfr(
id_boot,
id_new,
function(arg1, arg2) {
.data %>%
filter(!!ensym(.id_name) == arg1) %>%
mutate({{ .id_name }} := arg2)
}
)
data_boot %>%
group_by({{ .t_name }}) %>%
summarize(incidence = mean({{ .event_name }})) %>%
pull(incidence)
}
It can be used with the boot
package as follows:
RNGkind("L'Ecuyer-CMRG")
set.seed(626304)
boot_result <-
boot::boot(
unique(fake_tbl$patient_id),
statistic = calculate_stat,
R = 1000,
parallel = "multicore",
ncpus = 8,
.data = fake_tbl,
.id_name = patient_id,
.t_name = t_fu,
.event_name = event
)