Statistical Programming with R

Packages and functions used

library(magrittr) # pipes
library(DAAG) # data sets and functions
  • glm() Generalized linear models
  • predict() Obtain predictions based on a model
  • confint() Obtain confidence intervals for model parameters
  • coef() Obtain a model’s coefficients
  • DAAG::CVbinary() Cross-validation for regression with a binary response

Generalized
linear models

GLMs

The (general) linear model has \[Y_i=\alpha+\beta X_i+\epsilon_i\] where the \(\epsilon_i\) are IID, \(\epsilon_i \sim N(0,\sigma^2)\)

The generalized linear model relaxes the assumptions of linearity and normality:

\[f(\mathbb{E}[Y]) = \alpha + \beta X\]

Generalized linear models allow for a range of distributions (the exponential family) for the residuals and a variety of link functions \(f\).

GLMs in R

We will work with the glm function in the stats package. If your needs grow more exotic, there are many alternative, GLM-related packages

glm(formula, family = gaussian, data, weights, subset,
    na.action, start = NULL, etastart, mustart, offset,
    control = list(...), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)

Distributions and link functions supported by glm

Distributions and link functions supported by glm

Further extensions of the general linear model

  • The nlme package for Gaussian (non)linear mixed-effects models
  • The mgcv package for generalized additive (mixed) models
  • The lcmm packages for latent class linear mixed models and joint analysis of longitudinal and time-to-event data

Logistic regression

Logistic regression

Logistic regression models log(odds) of a binary outcome as a linear function of the predictors:

\[ \text{logit}(p_i) = \log\frac{p_i}{1-p_i} = \alpha + \beta X_i \]

Where the outcome \(Y_i\) follows a binomial distribution with probability \(p_i\) of success.

Logistic regression

We explore the anesthetic dataset: Thirty patients were given an anesthetic agent maintained at a predetermined level (conc) for 15 minutes before making an incision. It was then noted whether the patient moved, i.e. jerked or twisted

anesthetic %>% head
##   move conc   logconc nomove
## 1    0  1.0 0.0000000      1
## 2    1  1.2 0.1823216      0
## 3    0  1.4 0.3364722      1
## 4    1  1.4 0.3364722      0
## 5    1  1.2 0.1823216      0
## 6    0  2.5 0.9162907      1

Logistic regression

glm(nomove ~ conc, family = binomial(link="logit"), data=anesthetic) %>% summary()
## 
## Call:
## glm(formula = nomove ~ conc, family = binomial(link = "logit"), 
##     data = anesthetic)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.76666  -0.74407   0.03413   0.68666   2.06900  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.469      2.418  -2.675  0.00748 **
## conc           5.567      2.044   2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.455  on 29  degrees of freedom
## Residual deviance: 27.754  on 28  degrees of freedom
## AIC: 31.754
## 
## Number of Fisher Scoring iterations: 5

A different approach

Say we want to analyse counts in the aggregated table:

anestot <- aggregate(anesthetic[, c("move","nomove")],  
                     by = list(conc = anesthetic$conc), FUN = sum) 
anestot
##   conc move nomove
## 1  0.8    6      1
## 2  1.0    4      1
## 3  1.2    2      4
## 4  1.4    2      4
## 5  1.6    0      4
## 6  2.5    0      2

A different approach

anestot$total <- apply(anestot[, c("move","nomove")], 1 , sum) 
anestot$prop <- anestot$nomove / anestot$total 
anestot
##   conc move nomove total      prop
## 1  0.8    6      1     7 0.1428571
## 2  1.0    4      1     5 0.2000000
## 3  1.2    2      4     6 0.6666667
## 4  1.4    2      4     6 0.6666667
## 5  1.6    0      4     4 1.0000000
## 6  2.5    0      2     2 1.0000000

A different approach

glm(prop ~ conc, family = binomial(link="logit"), weights = total, data=anestot) %>% 
  summary()
## 
## Call:
## glm(formula = prop ~ conc, family = binomial(link = "logit"), 
##     data = anestot, weights = total)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6  
##  0.20147  -0.45367   0.56890  -0.70000   0.81838   0.04826  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.469      2.419  -2.675  0.00748 **
## conc           5.567      2.044   2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15.4334  on 5  degrees of freedom
## Residual deviance:  1.7321  on 4  degrees of freedom
## AIC: 13.811
## 
## Number of Fisher Scoring iterations: 5

Logistic multiple regression

We analyse the frogs dataset, which details the presence or absence of the Southern Corroboree frog from a number of reference points in the Snowy Mountains area of New South Wales, Australia

Logistic multiple regression

Code for the previous slide:

with(frogs, pairs(cbind(distance, NoOfPools, NoOfSites, avrain, altitude, 
                        meanmax+meanmin, meanmax-meanmin), 
                  col = "gray", panel = panel.smooth, 
                  labels = c("Distance", "NoOfPools", 
                             "NoOfSites", "AvRainfall", "Altitude", 
                             "meanmax + meanmin", "meanmax - meanmin"))) 

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + NoOfSites + 
##     avrain + I(meanmax + meanmin) + I(meanmax - meanmin), family = binomial, 
##     data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9763  -0.7189  -0.2786   0.7970   2.5745  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          18.2688999 16.1381912   1.132  0.25762    
## log(distance)        -0.7583198  0.2558117  -2.964  0.00303 ** 
## log(NoOfPools)        0.5708953  0.2153335   2.651  0.00802 ** 
## NoOfSites            -0.0036201  0.1061469  -0.034  0.97279    
## avrain                0.0007003  0.0411710   0.017  0.98643    
## I(meanmax + meanmin)  1.4958055  0.3153152   4.744  2.1e-06 ***
## I(meanmax - meanmin) -3.8582668  1.2783921  -3.018  0.00254 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.65  on 205  degrees of freedom
## AIC: 211.65
## 
## Number of Fisher Scoring iterations: 5

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + I(meanmax + 
##     meanmin) + I(meanmax - meanmin), family = binomial, data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9753  -0.7224  -0.2780   0.7974   2.5736  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           18.5268     5.2673   3.517 0.000436 ***
## log(distance)         -0.7547     0.2261  -3.338 0.000844 ***
## log(NoOfPools)         0.5707     0.2152   2.652 0.007999 ** 
## I(meanmax + meanmin)   1.4985     0.3088   4.853 1.22e-06 ***
## I(meanmax - meanmin)  -3.8806     0.9002  -4.311 1.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.66  on 207  degrees of freedom
## AIC: 207.66
## 
## Number of Fisher Scoring iterations: 5

Fitted values

frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools) 
                 + I(meanmax + meanmin) + I(meanmax - meanmin),  
                 family = binomial,
                 data = frogs)
frogs.glm %>% fitted() %>% head()
##         2         3         4         5         6         7 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
frogs.glm %>% predict(type = "response") %>% head()
##         2         3         4         5         6         7 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
frogs.glm %>% predict(type = "link") %>% head() # Scale of linear predictor 
##         2         3         4         5         6         7 
## 2.7815212 2.5256832 2.2303441 1.4628085 2.6085055 0.9835086

Fitted values with approximate SE’s

pred <- frogs.glm %>% 
  predict(type = "link", se.fit = TRUE)
data.frame(fit = pred$fit, se = pred$se.fit) %>% head()
##         fit        se
## 2 2.7815212 0.6859612
## 3 2.5256832 0.4851882
## 4 2.2303441 0.4381720
## 5 1.4628085 0.4805175
## 6 2.6085055 0.5290599
## 7 0.9835086 0.3900549

Confidence intervals for the \(\beta\)

frogs.glm %>% confint()
## Waiting for profiling to be done...
##                           2.5 %     97.5 %
## (Intercept)           8.5807639 29.3392400
## log(distance)        -1.2157643 -0.3247928
## log(NoOfPools)        0.1628001  1.0106909
## I(meanmax + meanmin)  0.9219804  2.1385678
## I(meanmax - meanmin) -5.7299615 -2.1868793
frogs.glm %>% coef()
##          (Intercept)        log(distance)       log(NoOfPools) 
##           18.5268418           -0.7546587            0.5707174 
## I(meanmax + meanmin) I(meanmax - meanmin) 
##            1.4984905           -3.8805856

Cross validating predictive power

set.seed(123)
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools), 
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm)
## 
## Fold:  3 10 2 6 5 4 9 8 7 1
## Internal estimate of accuracy = 0.759
## Cross-validation estimate of accuracy = 0.745

The cross-validation measure is the proportion of predictions over the folds that are correct.

Cross validating predictive power

frogs.glm2 <- glm(pres.abs ~ log(distance) + log(NoOfPools) 
                 + I(meanmax + meanmin) + I(meanmax - meanmin),
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm2)
## 
## Fold:  1 5 9 8 7 2 10 6 4 3
## Internal estimate of accuracy = 0.778
## Cross-validation estimate of accuracy = 0.778

Cross validating predictive power

frogs.glm2 <- glm(pres.abs ~ log(distance) + log(NoOfPools) 
                 + I(meanmax + meanmin) + I(meanmax - meanmin),
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm2, nfolds = 5)
## 
## Fold:  3 2 4 5 1
## Internal estimate of accuracy = 0.778
## Cross-validation estimate of accuracy = 0.783