Packages and functions used

library(magrittr) # pipes
library(DAAG) # data sets and functions

glm() Generalized linear models
predict() Obtain predictions based on a model
confint() Obtain confidence intervals for model parameters
coef() Obtain a model’s coefficients
DAAG::CVbinary() Cross-validation for regression with a binary response

Generalized linear models

GLMs

The (general) linear model has \[Y_i=\alpha+\beta X_i+\epsilon_i\] where the \(\epsilon_i\) are IID, \(\epsilon_i \sim N(0,\sigma^2)\)

The generalized linear model relaxes the assumptions of linearity and normality:

\[f(\mathbb{E}[Y]) = \alpha + \beta X\]

Generalized linear models allow for a range of distributions for the residuals and a variety of link functions \(f\).

GLMs in R

We will work with the glm function in the stats package. If your needs grow more exotic, there are a plethora of alternative, GLM-related packages

glm(formula, family = gaussian, data, weights, subset,
    na.action, start = NULL, etastart, mustart, offset,
    control = list(...), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)

Distributions and link functions supported by `glm`

The family argument gives the distribution of the residuals and the link function. For instance:

family = gaussian(link = "identity") # This is the default for glm
family = poisson(link = "log")

Here, the link functions are the defaults and do not need to be specified. The help(family) file describes link functions accepted by glm – or you may specify your own.

Distributions and link functions supported by `glm`

Family	Default Link Function
binomial	(link = “logit”)
gaussian	(link = “identity”)
Gamma	(link = “inverse”)
inverse.gaussian	(link = “1/mu^2”)
poisson	(link = “log”)
quasi	(link = “identity”, variance = “constant”)
quasibinomial	(link = “logit”)
quasipoisson	(link = “log”)

Further extensions of the general linear model

The nlme package for Gaussian (non)linear mixed-effects models
The mgcv package for generalized additive (mixed) models
The lcmm packages for latent class linear mixed models and joint analysis of longitudinal and time-to-event data

Logistic regression

Logistic regression models log(odds) of a binary outcome as a linear function of the predictors:

\[ \text{logit}(p_i) = \log\frac{p_i}{1-p_i} = \alpha + \beta X_i \]

Where the outcome \(Y_i\) follows a binomial distribution with probability \(p_i\) of success.

Logistic regression

We explore the anesthetic dataset: Thirty patients were given an anesthetic agent maintained at a predetermined level (conc) for 15 minutes before making an incision. It was then noted whether the patient moved, i.e. jerked or twisted

anesthetic %>% head

##   move conc   logconc nomove
## 1    0  1.0 0.0000000      1
## 2    1  1.2 0.1823216      0
## 3    0  1.4 0.3364722      1
## 4    1  1.4 0.3364722      0
## 5    1  1.2 0.1823216      0
## 6    0  2.5 0.9162907      1

Logistic regression

glm(nomove ~ conc, family = binomial(link="logit"), data=anesthetic) %>% summary()

## 
## Call:
## glm(formula = nomove ~ conc, family = binomial(link = "logit"), 
##     data = anesthetic)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.76666  -0.74407   0.03413   0.68666   2.06900  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.469      2.418  -2.675  0.00748 **
## conc           5.567      2.044   2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 41.455  on 29  degrees of freedom
## Residual deviance: 27.754  on 28  degrees of freedom
## AIC: 31.754
## 
## Number of Fisher Scoring iterations: 5

A different approach

Say we want to analyse counts in the aggregated table:

anestot <- aggregate(anesthetic[, c("move","nomove")],  
                     by = list(conc = anesthetic$conc), FUN = sum) 
anestot

##   conc move nomove
## 1  0.8    6      1
## 2  1.0    4      1
## 3  1.2    2      4
## 4  1.4    2      4
## 5  1.6    0      4
## 6  2.5    0      2

A different approach

anestot$total <- apply(anestot[, c("move","nomove")], 1 , sum) 
anestot$prop <- anestot$nomove / anestot$total 
anestot

##   conc move nomove total      prop
## 1  0.8    6      1     7 0.1428571
## 2  1.0    4      1     5 0.2000000
## 3  1.2    2      4     6 0.6666667
## 4  1.4    2      4     6 0.6666667
## 5  1.6    0      4     4 1.0000000
## 6  2.5    0      2     2 1.0000000

A different approach

glm(prop ~ conc, family = binomial(link="logit"), weights = total, data=anestot) %>% 
  summary()

## 
## Call:
## glm(formula = prop ~ conc, family = binomial(link = "logit"), 
##     data = anestot, weights = total)
## 
## Deviance Residuals: 
##        1         2         3         4         5         6  
##  0.20147  -0.45367   0.56890  -0.70000   0.81838   0.04826  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -6.469      2.419  -2.675  0.00748 **
## conc           5.567      2.044   2.724  0.00645 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15.4334  on 5  degrees of freedom
## Residual deviance:  1.7321  on 4  degrees of freedom
## AIC: 13.811
## 
## Number of Fisher Scoring iterations: 5

Logistic multiple regression

We analyse the frogs dataset, which details the presence or absence of the Southern Corroboree frog from a number of reference points in the Snowy Mountains area of New South Wales, Australia

Logistic multiple regression

Code for the previous slide:

with(frogs, pairs(cbind(distance, NoOfPools, NoOfSites, avrain, altitude, 
                        meanmax+meanmin, meanmax-meanmin), 
                  col = "gray", panel = panel.smooth, 
                  labels = c("Distance", "NoOfPools", 
                             "NoOfSites", "AvRainfall", "Altitude", 
                             "meanmax + meanmin", "meanmax - meanmin")))

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + NoOfSites + 
##     avrain + I(meanmax + meanmin) + I(meanmax - meanmin), family = binomial, 
##     data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9763  -0.7189  -0.2786   0.7970   2.5745  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          18.2688999 16.1381912   1.132  0.25762    
## log(distance)        -0.7583198  0.2558117  -2.964  0.00303 ** 
## log(NoOfPools)        0.5708953  0.2153335   2.651  0.00802 ** 
## NoOfSites            -0.0036201  0.1061469  -0.034  0.97279    
## avrain                0.0007003  0.0411710   0.017  0.98643    
## I(meanmax + meanmin)  1.4958055  0.3153152   4.744  2.1e-06 ***
## I(meanmax - meanmin) -3.8582668  1.2783921  -3.018  0.00254 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.65  on 205  degrees of freedom
## AIC: 211.65
## 
## Number of Fisher Scoring iterations: 5

Logistic multiple regression

## 
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + I(meanmax + 
##     meanmin) + I(meanmax - meanmin), family = binomial, data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9753  -0.7224  -0.2780   0.7974   2.5736  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           18.5268     5.2673   3.517 0.000436 ***
## log(distance)         -0.7547     0.2261  -3.338 0.000844 ***
## log(NoOfPools)         0.5707     0.2152   2.652 0.007999 ** 
## I(meanmax + meanmin)   1.4985     0.3088   4.853 1.22e-06 ***
## I(meanmax - meanmin)  -3.8806     0.9002  -4.311 1.63e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 197.66  on 207  degrees of freedom
## AIC: 207.66
## 
## Number of Fisher Scoring iterations: 5

Fitted values

frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools) 
                 + I(meanmax + meanmin) + I(meanmax - meanmin),  
                 family = binomial,
                 data = frogs)
frogs.glm %>% fitted() %>% head()

##         2         3         4         5         6         7 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038

frogs.glm %>% predict(type = "response") %>% head()

##         2         3         4         5         6         7 
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038

frogs.glm %>% predict(type = "link") %>% head() # Scale of linear predictor

##         2         3         4         5         6         7 
## 2.7815212 2.5256832 2.2303441 1.4628085 2.6085055 0.9835086

Fitted values with approximate SE’s

pred <- frogs.glm %>% 
  predict(type = "link", se.fit = TRUE)
data.frame(fit = pred$fit, se = pred$se.fit) %>% head()

##         fit        se
## 2 2.7815212 0.6859612
## 3 2.5256832 0.4851882
## 4 2.2303441 0.4381720
## 5 1.4628085 0.4805175
## 6 2.6085055 0.5290599
## 7 0.9835086 0.3900549

Confidence intervals for the \(\beta\)

frogs.glm %>% confint()

## Waiting for profiling to be done...

##                           2.5 %     97.5 %
## (Intercept)           8.5807639 29.3392400
## log(distance)        -1.2157643 -0.3247928
## log(NoOfPools)        0.1628001  1.0106909
## I(meanmax + meanmin)  0.9219804  2.1385678
## I(meanmax - meanmin) -5.7299615 -2.1868793

frogs.glm %>% coef()

##          (Intercept)        log(distance)       log(NoOfPools) 
##           18.5268418           -0.7546587            0.5707174 
## I(meanmax + meanmin) I(meanmax - meanmin) 
##            1.4984905           -3.8805856

Cross validating predictive power

set.seed(123)
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools), 
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm)

## 
## Fold:  3 10 2 6 5 4 9 8 7 1
## Internal estimate of accuracy = 0.759
## Cross-validation estimate of accuracy = 0.745

The cross-validation measure is the proportion of predictions over the folds that are correct.

Cross validating predictive power

frogs.glm2 <- glm(pres.abs ~ log(distance) + log(NoOfPools) 
                 + I(meanmax + meanmin) + I(meanmax - meanmin),
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm2)

## 
## Fold:  1 5 9 8 7 2 10 6 4 3
## Internal estimate of accuracy = 0.778
## Cross-validation estimate of accuracy = 0.778

Cross validating predictive power

frogs.glm2 <- glm(pres.abs ~ log(distance) + log(NoOfPools) 
                 + I(meanmax + meanmin) + I(meanmax - meanmin),
                 family = binomial, data = frogs)
cv <- CVbinary(frogs.glm2, nfolds = 5)

## 
## Fold:  3 2 4 5 1
## Internal estimate of accuracy = 0.778
## Cross-validation estimate of accuracy = 0.783

Generalized linear models

Anne Vinkel (adapted from Gerko Vink)

Statistical Programming with R

Packages and functions used

Generalized linear models

GLMs

GLMs in R

Distributions and link functions supported by `glm`

Distributions and link functions supported by `glm`

Further extensions of the general linear model

Logistic regression

Logistic regression

Logistic regression

Logistic regression

A different approach

A different approach

A different approach

Logistic multiple regression

Logistic multiple regression

Logistic multiple regression

Logistic multiple regression

Fitted values

Fitted values with approximate SE’s

Confidence intervals for the \(\beta\)

Cross validating predictive power

Cross validating predictive power

Cross validating predictive power

Generalized linear models

Anne Vinkel (adapted from Gerko Vink)

Statistical Programming with R

Packages and functions used

Generalized linear models

GLMs

GLMs in R

Distributions and link functions supported by glm

Distributions and link functions supported by glm

Further extensions of the general linear model

Logistic regression

Logistic regression

Logistic regression

Logistic regression

A different approach

A different approach

A different approach

Logistic multiple regression

Logistic multiple regression

Logistic multiple regression

Logistic multiple regression

Fitted values

Fitted values with approximate SE’s

Confidence intervals for the \(\beta\)

Cross validating predictive power

Cross validating predictive power

Cross validating predictive power

Distributions and link functions supported by `glm`

Distributions and link functions supported by `glm`