The purpose of this practical is to use some of R’s packages to solve a missing data problem by imputing missing values in a ‘real’ dataset. This can either be a dataset you’ve brought to this course yourself or the selfreport data from the mice package. The ‘selfreport’ data is described in Section 7.3 of the first edition of the book Flexible Imputation of Missing Data (van Buuren, S. (2012). Flexible Imputation of Missing Data. CRC/Chapman & Hall, FL: Boca Raton).


Suggested packages:

library(mice) # for visualisation & pmm
library(VIM) # visualisation & several hotdeck imputation methods (fast implementation)
library(simputation) # common interface for several imputation methods
library(magrittr) # for pipe operator
library(dplyr)
library(naniar) # additional package for practical
library(visdat) # visualisation of datasets
library(UpSetR) # Upset plots
library(ggplot2) # plots

Selfreport data


The selfreport data contains data from two studies, one from the mgg study, the survey dataset, where height and weight are selfreported measures (hr and wr) and one from the krul study, the calibration dataset, where the height and weight measures are both selfreported and measured (variables hm and wm). The goal for the self-report data is to obtain correct estimates of the prevalence of obesity (Body Mass Index > 32) in the survey dataset. Note that the Body Mass Index (bmi) is computed as follows:

\[\text{bmi} = \frac{\text{weight in kg}}{(\text{height in m})^2} \]


  1. Check missing data patterns.

First, check if there is a description of the selfreport data in the mice package documentation:

?selfreport
## starting httpd help server ... done

The selfreport data contains data from two studies, one from the mgg study where height and weight are selfreported measures (hr and wr) and one from the krul study where the measures are both selfreported and measured (variables hm and wm).

Then, investigate the missing data patterns

head(selfreport, n = 10)
##    src    id pop age    sex hm wm  hr  wr          prg    edu        etn
## 1  mgg 10001  NL  27   Male NA NA 190  85         <NA> Middle Autochtone
## 2  mgg 10002  NL  38   Male NA NA 189  93         <NA>    Low Autochtone
## 3  mgg 10003  NL  21   Male NA NA 200 110         <NA>    Low Autochtone
## 4  mgg 10004  NL  52   Male NA NA 176  80         <NA>    Low Autochtone
## 5  mgg 10005  NL  44   Male NA NA 183  79         <NA> Middle Autochtone
## 6  mgg 10006  NL  31 Female NA NA 167  80 Not pregnant Middle Autochtone
## 7  mgg 10007  NL  42 Female NA NA 187  73 Not pregnant   High Autochtone
## 8  mgg 10008  NL  39   Male NA NA 180  85         <NA>   High Allochtone
## 9  mgg 10009  NL  36 Female NA NA 178  70 Not pregnant Middle Autochtone
## 10 mgg 10010  NL  38 Female NA NA 174  55 Not pregnant   High Allochtone
##    web bm       br
## 1   No NA 23.54571
## 2   No NA 26.03511
## 3   No NA 27.50000
## 4   No NA 25.82645
## 5   No NA 23.58984
## 6   No NA 28.68514
## 7   No NA 20.87563
## 8   No NA 26.23457
## 9   No NA 22.09317
## 10  No NA 18.16620
summary(selfreport)
##    src             id        pop            age            sex      
##  krul:1257   Min.   :10001   NL:2060   Min.   :18.00   Female:1098  
##  mgg : 803   1st Qu.:10612   IT:   0   1st Qu.:29.00   Male  : 962  
##              Median :12302   US:   0   Median :41.00                
##              Mean   :13299             Mean   :42.06                
##              3rd Qu.:15958             3rd Qu.:54.00                
##              Max.   :17079             Max.   :75.00                
##                                                                     
##        hm              wm              hr              wr        
##  Min.   :143.6   Min.   : 44.0   Min.   :143.0   Min.   : 30.00  
##  1st Qu.:166.5   1st Qu.: 66.0   1st Qu.:168.0   1st Qu.: 66.00  
##  Median :173.3   Median : 75.8   Median :175.0   Median : 75.20  
##  Mean   :174.0   Mean   : 77.8   Mean   :174.9   Mean   : 77.79  
##  3rd Qu.:181.1   3rd Qu.: 86.6   3rd Qu.:182.0   3rd Qu.: 87.00  
##  Max.   :207.3   Max.   :149.7   Max.   :214.0   Max.   :165.00  
##  NA's   :803     NA's   :803                                     
##                      prg           edu               etn        web      
##  Pregnant              :   0   Low   : 286   Autochtone: 657   No :1858  
##  Delivery last 6 months:   0   Middle: 323   Allochtone: 146   Yes: 202  
##  Not pregnant          : 403   High  : 194   Unknown   :   0             
##  NA's                  :1657   NA's  :1257   NA's      :1257             
##                                                                          
##                                                                          
##                                                                          
##        bm              br       
##  Min.   :16.83   Min.   :10.76  
##  1st Qu.:21.98   1st Qu.:22.08  
##  Median :24.64   Median :24.71  
##  Mean   :25.68   Mean   :25.39  
##  3rd Qu.:28.12   3rd Qu.:27.67  
##  Max.   :55.17   Max.   :52.08  
##  NA's   :803

The measures weight and height variables are indeed missing from the mgg study

VIM::aggr(selfreport, col=c('blue','red'), numbers=TRUE, sortVars=TRUE,
                  labels=names(selfreport), cex.axis=.8, gap=3, 
                  ylab=c("Histogram of missing data", "Red is Missing Data"))

## 
##  Variables sorted by number of missings: 
##  Variable     Count
##       prg 0.8043689
##       edu 0.6101942
##       etn 0.6101942
##        hm 0.3898058
##        wm 0.3898058
##        bm 0.3898058
##       src 0.0000000
##        id 0.0000000
##       pop 0.0000000
##       age 0.0000000
##       sex 0.0000000
##        hr 0.0000000
##        wr 0.0000000
##       web 0.0000000
##        br 0.0000000

Apart from measured height and weight also the measured bmi, the indicators for pregnancy, educational level and ethnicity contain missing values.


  1. Check the relation between measured and self reported bmi in the calibration dataset.

There are more individuals that have a lower selfreported BMI than measured BMI.

calibration <- selfreport %>% 
  filter(as.numeric(src) == 1) %>% 
  mutate(difference = case_when(br < bm ~ "lower",
                                bm < br ~ "higher",
                                bm == br ~ "same"))

ggplot(calibration, aes(br, bm, color = difference)) +
  geom_point() +
  theme_minimal() +
  geom_abline(intercept = 0, slope = 1) + 
  labs(x = "Selfreported BMI", y = "Measured BMI", fill = "")


  1. Impute missing data.

First define a deterministic function to compute bmi:

bmi <- function(h, w) {
  return(w/(h/100)^2)
}

Define the imputation methods to use (we only impute wm, hm and bm). Make sure to define the deterministic relation between height, weight and bmi for the computation (not imputation) of bmi.

meth <- c(rep("", 5), "pmm", "pmm", rep("", 6), "~bmi(hm,wm)", "")

Get predictor matrix:

init <- mice(selfreport, maxit = 0) #get initial predictor matrix
pred <- init$pred #note that pregnancy is not imputed by default because it contains too many missings
pred[, c("src", "id", "pop", "prg", "edu", "etn", "web", "bm", "br")] <- 0 #set variables not used as predictors to 0

Impute:

imp <- mice(selfreport, pred = pred, meth = meth, seed = 66573, maxit = 20, m = 10)

  1. What is the estimated prevalence (selfreported and imputed data) of obesity for different age and sex in the survey dataset?

Estimated prevalence in the selfreported data

reported <- as.tbl(selfreport)

reported %>% 
  filter(as.numeric(src) == 2) %>% 
  mutate(br = bmi(hr, wr)) %>%
  mutate(obese = case_when(br > 32 ~ 1,
                           TRUE ~ 0)) %>% 
  mutate(age_cat = cut(age, c(17, 29, 39, 49, 59, 75))) %>% 
  group_by(sex, age_cat) %>% 
  summarise(N = n(), prevalence = sum(obese)/N*100)
## # A tibble: 10 x 4
## # Groups:   sex [2]
##    sex    age_cat     N prevalence
##    <fct>  <fct>   <int>      <dbl>
##  1 Female (17,29]    68      10.3 
##  2 Female (29,39]    69      14.5 
##  3 Female (39,49]    68      13.2 
##  4 Female (49,59]    81      17.3 
##  5 Female (59,75]   117       4.27
##  6 Male   (17,29]    69       1.45
##  7 Male   (29,39]    73       5.48
##  8 Male   (39,49]    66       4.55
##  9 Male   (49,59]    91      12.1 
## 10 Male   (59,75]   101       1.98

Estimated prevalence in the imputed data

imputed <- as.tbl(complete(imp, action = "long"))

imputed %>% 
  filter(as.numeric(src) == 2) %>% 
  mutate(obese = case_when(bm > 32 ~ 1,
                           TRUE ~ 0)) %>% 
  mutate(age_cat = cut(age, c(17, 29, 39, 49, 59, 75))) %>% 
  group_by(sex, age_cat) %>% 
  summarise(N = n()/10, prevalence = sum(obese)/(N*10)*100)
## # A tibble: 10 x 4
## # Groups:   sex [2]
##    sex    age_cat     N prevalence
##    <fct>  <fct>   <dbl>      <dbl>
##  1 Female (17,29]    68      10.6 
##  2 Female (29,39]    69      19.9 
##  3 Female (39,49]    68      16.3 
##  4 Female (49,59]    81      22.7 
##  5 Female (59,75]   117       7.61
##  6 Male   (17,29]    69       2.03
##  7 Male   (29,39]    73       7.67
##  8 Male   (39,49]    66       6.21
##  9 Male   (49,59]    91      14.8 
## 10 Male   (59,75]   101       5.54

End of practical.