The purpose of this practical is to use some of R’s packages to solve a missing data problem by imputing missing values in a ‘real’ dataset. This can either be a dataset you’ve brought to this course yourself or the selfreport data from the mice package. The ‘selfreport’ data is described in Section 7.3 of the first edition of the book Flexible Imputation of Missing Data (van Buuren, S. (2012). Flexible Imputation of Missing Data. CRC/Chapman & Hall, FL: Boca Raton).
Suggested packages:
library(mice) # for visualisation & pmm
library(VIM) # visualisation & several hotdeck imputation methods (fast implementation)
library(simputation) # common interface for several imputation methods
library(magrittr) # for pipe operator
library(dplyr)
library(naniar) # additional package for practical
library(visdat) # visualisation of datasets
library(UpSetR) # Upset plots
library(ggplot2) # plots
The selfreport data contains data from two studies, one from the mgg study, the survey dataset, where height and weight are selfreported measures (hr and wr) and one from the krul study, the calibration dataset, where the height and weight measures are both selfreported and measured (variables hm and wm). The goal for the self-report data is to obtain correct estimates of the prevalence of obesity (Body Mass Index > 32) in the survey dataset. Note that the Body Mass Index (bmi) is computed as follows:
\[\text{bmi} = \frac{\text{weight in kg}}{(\text{height in m})^2} \]
First, check if there is a description of the selfreport data in the mice package documentation:
?selfreport
## starting httpd help server ... done
The selfreport data contains data from two studies, one from the mgg study where height and weight are selfreported measures (hr and wr) and one from the krul study where the measures are both selfreported and measured (variables hm and wm).
Then, investigate the missing data patterns
head(selfreport, n = 10)
## src id pop age sex hm wm hr wr prg edu etn
## 1 mgg 10001 NL 27 Male NA NA 190 85 <NA> Middle Autochtone
## 2 mgg 10002 NL 38 Male NA NA 189 93 <NA> Low Autochtone
## 3 mgg 10003 NL 21 Male NA NA 200 110 <NA> Low Autochtone
## 4 mgg 10004 NL 52 Male NA NA 176 80 <NA> Low Autochtone
## 5 mgg 10005 NL 44 Male NA NA 183 79 <NA> Middle Autochtone
## 6 mgg 10006 NL 31 Female NA NA 167 80 Not pregnant Middle Autochtone
## 7 mgg 10007 NL 42 Female NA NA 187 73 Not pregnant High Autochtone
## 8 mgg 10008 NL 39 Male NA NA 180 85 <NA> High Allochtone
## 9 mgg 10009 NL 36 Female NA NA 178 70 Not pregnant Middle Autochtone
## 10 mgg 10010 NL 38 Female NA NA 174 55 Not pregnant High Allochtone
## web bm br
## 1 No NA 23.54571
## 2 No NA 26.03511
## 3 No NA 27.50000
## 4 No NA 25.82645
## 5 No NA 23.58984
## 6 No NA 28.68514
## 7 No NA 20.87563
## 8 No NA 26.23457
## 9 No NA 22.09317
## 10 No NA 18.16620
summary(selfreport)
## src id pop age sex
## krul:1257 Min. :10001 NL:2060 Min. :18.00 Female:1098
## mgg : 803 1st Qu.:10612 IT: 0 1st Qu.:29.00 Male : 962
## Median :12302 US: 0 Median :41.00
## Mean :13299 Mean :42.06
## 3rd Qu.:15958 3rd Qu.:54.00
## Max. :17079 Max. :75.00
##
## hm wm hr wr
## Min. :143.6 Min. : 44.0 Min. :143.0 Min. : 30.00
## 1st Qu.:166.5 1st Qu.: 66.0 1st Qu.:168.0 1st Qu.: 66.00
## Median :173.3 Median : 75.8 Median :175.0 Median : 75.20
## Mean :174.0 Mean : 77.8 Mean :174.9 Mean : 77.79
## 3rd Qu.:181.1 3rd Qu.: 86.6 3rd Qu.:182.0 3rd Qu.: 87.00
## Max. :207.3 Max. :149.7 Max. :214.0 Max. :165.00
## NA's :803 NA's :803
## prg edu etn web
## Pregnant : 0 Low : 286 Autochtone: 657 No :1858
## Delivery last 6 months: 0 Middle: 323 Allochtone: 146 Yes: 202
## Not pregnant : 403 High : 194 Unknown : 0
## NA's :1657 NA's :1257 NA's :1257
##
##
##
## bm br
## Min. :16.83 Min. :10.76
## 1st Qu.:21.98 1st Qu.:22.08
## Median :24.64 Median :24.71
## Mean :25.68 Mean :25.39
## 3rd Qu.:28.12 3rd Qu.:27.67
## Max. :55.17 Max. :52.08
## NA's :803
The measures weight and height variables are indeed missing from the mgg study
VIM::aggr(selfreport, col=c('blue','red'), numbers=TRUE, sortVars=TRUE,
labels=names(selfreport), cex.axis=.8, gap=3,
ylab=c("Histogram of missing data", "Red is Missing Data"))
##
## Variables sorted by number of missings:
## Variable Count
## prg 0.8043689
## edu 0.6101942
## etn 0.6101942
## hm 0.3898058
## wm 0.3898058
## bm 0.3898058
## src 0.0000000
## id 0.0000000
## pop 0.0000000
## age 0.0000000
## sex 0.0000000
## hr 0.0000000
## wr 0.0000000
## web 0.0000000
## br 0.0000000
Apart from measured height and weight also the measured bmi, the indicators for pregnancy, educational level and ethnicity contain missing values.
There are more individuals that have a lower selfreported BMI than measured BMI.
calibration <- selfreport %>%
filter(as.numeric(src) == 1) %>%
mutate(difference = case_when(br < bm ~ "lower",
bm < br ~ "higher",
bm == br ~ "same"))
ggplot(calibration, aes(br, bm, color = difference)) +
geom_point() +
theme_minimal() +
geom_abline(intercept = 0, slope = 1) +
labs(x = "Selfreported BMI", y = "Measured BMI", fill = "")
First define a deterministic function to compute bmi:
bmi <- function(h, w) {
return(w/(h/100)^2)
}
Define the imputation methods to use (we only impute wm, hm and bm). Make sure to define the deterministic relation between height, weight and bmi for the computation (not imputation) of bmi.
meth <- c(rep("", 5), "pmm", "pmm", rep("", 6), "~bmi(hm,wm)", "")
Get predictor matrix:
init <- mice(selfreport, maxit = 0) #get initial predictor matrix
pred <- init$pred #note that pregnancy is not imputed by default because it contains too many missings
pred[, c("src", "id", "pop", "prg", "edu", "etn", "web", "bm", "br")] <- 0 #set variables not used as predictors to 0
Impute:
imp <- mice(selfreport, pred = pred, meth = meth, seed = 66573, maxit = 20, m = 10)
age and sex in the survey dataset?Estimated prevalence in the selfreported data
reported <- as.tbl(selfreport)
reported %>%
filter(as.numeric(src) == 2) %>%
mutate(br = bmi(hr, wr)) %>%
mutate(obese = case_when(br > 32 ~ 1,
TRUE ~ 0)) %>%
mutate(age_cat = cut(age, c(17, 29, 39, 49, 59, 75))) %>%
group_by(sex, age_cat) %>%
summarise(N = n(), prevalence = sum(obese)/N*100)
## # A tibble: 10 x 4
## # Groups: sex [2]
## sex age_cat N prevalence
## <fct> <fct> <int> <dbl>
## 1 Female (17,29] 68 10.3
## 2 Female (29,39] 69 14.5
## 3 Female (39,49] 68 13.2
## 4 Female (49,59] 81 17.3
## 5 Female (59,75] 117 4.27
## 6 Male (17,29] 69 1.45
## 7 Male (29,39] 73 5.48
## 8 Male (39,49] 66 4.55
## 9 Male (49,59] 91 12.1
## 10 Male (59,75] 101 1.98
Estimated prevalence in the imputed data
imputed <- as.tbl(complete(imp, action = "long"))
imputed %>%
filter(as.numeric(src) == 2) %>%
mutate(obese = case_when(bm > 32 ~ 1,
TRUE ~ 0)) %>%
mutate(age_cat = cut(age, c(17, 29, 39, 49, 59, 75))) %>%
group_by(sex, age_cat) %>%
summarise(N = n()/10, prevalence = sum(obese)/(N*10)*100)
## # A tibble: 10 x 4
## # Groups: sex [2]
## sex age_cat N prevalence
## <fct> <fct> <dbl> <dbl>
## 1 Female (17,29] 68 10.6
## 2 Female (29,39] 69 19.9
## 3 Female (39,49] 68 16.3
## 4 Female (49,59] 81 22.7
## 5 Female (59,75] 117 7.61
## 6 Male (17,29] 69 2.03
## 7 Male (29,39] 73 7.67
## 8 Male (39,49] 66 6.21
## 9 Male (49,59] 91 14.8
## 10 Male (59,75] 101 5.54
End of practical.