library(magrittr) # pipes
library(DAAG) # data sets and functions
glm() Generalized linear modelspredict() Obtain predictions based on a modelconfint() Obtain confidence intervals for model parameterscoef() Obtain a model’s coefficientsDAAG::CVbinary() Cross-validation for regression with a binary responseThe (general) linear model has \[Y_i=\alpha+\beta X_i+\epsilon_i\] where the \(\epsilon_i\) are IID, \(\epsilon_i \sim N(0,\sigma^2)\)
The generalized linear model relaxes the assumptions of linearity and normality:
\[f(\mathbb{E}[Y]) = \alpha + \beta X\]
Generalized linear models allow for a range of distributions for the residuals and a variety of link functions \(f\).
We will work with the glm function in the stats package. If your needs grow more exotic, there are a plethora of alternative, GLM-related packages
glm(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)
glmThe family argument gives the distribution of the residuals and the link function. For instance:
family = gaussian(link = "identity") # This is the default for glm
family = poisson(link = "log")
Here, the link functions are the defaults and do not need to be specified. The help(family) file describes link functions accepted by glm – or you may specify your own.
glm| Family | Default Link Function |
|---|---|
| binomial | (link = “logit”) |
| gaussian | (link = “identity”) |
| Gamma | (link = “inverse”) |
| inverse.gaussian | (link = “1/mu^2”) |
| poisson | (link = “log”) |
| quasi | (link = “identity”, variance = “constant”) |
| quasibinomial | (link = “logit”) |
| quasipoisson | (link = “log”) |
nlme package for Gaussian (non)linear mixed-effects modelsmgcv package for generalized additive (mixed) modelslcmm packages for latent class linear mixed models and joint analysis of longitudinal and time-to-event dataLogistic regression models log(odds) of a binary outcome as a linear function of the predictors:
\[ \text{logit}(p_i) = \log\frac{p_i}{1-p_i} = \alpha + \beta X_i \]
Where the outcome \(Y_i\) follows a binomial distribution with probability \(p_i\) of success.
We explore the anesthetic dataset: Thirty patients were given an anesthetic agent maintained at a predetermined level (conc) for 15 minutes before making an incision. It was then noted whether the patient moved, i.e. jerked or twisted
anesthetic %>% head
## move conc logconc nomove
## 1 0 1.0 0.0000000 1
## 2 1 1.2 0.1823216 0
## 3 0 1.4 0.3364722 1
## 4 1 1.4 0.3364722 0
## 5 1 1.2 0.1823216 0
## 6 0 2.5 0.9162907 1
glm(nomove ~ conc, family = binomial(link="logit"), data=anesthetic) %>% summary()
##
## Call:
## glm(formula = nomove ~ conc, family = binomial(link = "logit"),
## data = anesthetic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.76666 -0.74407 0.03413 0.68666 2.06900
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.469 2.418 -2.675 0.00748 **
## conc 5.567 2.044 2.724 0.00645 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.455 on 29 degrees of freedom
## Residual deviance: 27.754 on 28 degrees of freedom
## AIC: 31.754
##
## Number of Fisher Scoring iterations: 5
Say we want to analyse counts in the aggregated table:
anestot <- aggregate(anesthetic[, c("move","nomove")],
by = list(conc = anesthetic$conc), FUN = sum)
anestot
## conc move nomove
## 1 0.8 6 1
## 2 1.0 4 1
## 3 1.2 2 4
## 4 1.4 2 4
## 5 1.6 0 4
## 6 2.5 0 2
anestot$total <- apply(anestot[, c("move","nomove")], 1 , sum)
anestot$prop <- anestot$nomove / anestot$total
anestot
## conc move nomove total prop
## 1 0.8 6 1 7 0.1428571
## 2 1.0 4 1 5 0.2000000
## 3 1.2 2 4 6 0.6666667
## 4 1.4 2 4 6 0.6666667
## 5 1.6 0 4 4 1.0000000
## 6 2.5 0 2 2 1.0000000
glm(prop ~ conc, family = binomial(link="logit"), weights = total, data=anestot) %>%
summary()
##
## Call:
## glm(formula = prop ~ conc, family = binomial(link = "logit"),
## data = anestot, weights = total)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## 0.20147 -0.45367 0.56890 -0.70000 0.81838 0.04826
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.469 2.419 -2.675 0.00748 **
## conc 5.567 2.044 2.724 0.00645 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15.4334 on 5 degrees of freedom
## Residual deviance: 1.7321 on 4 degrees of freedom
## AIC: 13.811
##
## Number of Fisher Scoring iterations: 5
We analyse the frogs dataset, which details the presence or absence of the Southern Corroboree frog from a number of reference points in the Snowy Mountains area of New South Wales, Australia
Code for the previous slide:
with(frogs, pairs(cbind(distance, NoOfPools, NoOfSites, avrain, altitude,
meanmax+meanmin, meanmax-meanmin),
col = "gray", panel = panel.smooth,
labels = c("Distance", "NoOfPools",
"NoOfSites", "AvRainfall", "Altitude",
"meanmax + meanmin", "meanmax - meanmin")))
##
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + NoOfSites +
## avrain + I(meanmax + meanmin) + I(meanmax - meanmin), family = binomial,
## data = frogs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9763 -0.7189 -0.2786 0.7970 2.5745
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.2688999 16.1381912 1.132 0.25762
## log(distance) -0.7583198 0.2558117 -2.964 0.00303 **
## log(NoOfPools) 0.5708953 0.2153335 2.651 0.00802 **
## NoOfSites -0.0036201 0.1061469 -0.034 0.97279
## avrain 0.0007003 0.0411710 0.017 0.98643
## I(meanmax + meanmin) 1.4958055 0.3153152 4.744 2.1e-06 ***
## I(meanmax - meanmin) -3.8582668 1.2783921 -3.018 0.00254 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 279.99 on 211 degrees of freedom
## Residual deviance: 197.65 on 205 degrees of freedom
## AIC: 211.65
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = pres.abs ~ log(distance) + log(NoOfPools) + I(meanmax +
## meanmin) + I(meanmax - meanmin), family = binomial, data = frogs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9753 -0.7224 -0.2780 0.7974 2.5736
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.5268 5.2673 3.517 0.000436 ***
## log(distance) -0.7547 0.2261 -3.338 0.000844 ***
## log(NoOfPools) 0.5707 0.2152 2.652 0.007999 **
## I(meanmax + meanmin) 1.4985 0.3088 4.853 1.22e-06 ***
## I(meanmax - meanmin) -3.8806 0.9002 -4.311 1.63e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 279.99 on 211 degrees of freedom
## Residual deviance: 197.66 on 207 degrees of freedom
## AIC: 207.66
##
## Number of Fisher Scoring iterations: 5
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools)
+ I(meanmax + meanmin) + I(meanmax - meanmin),
family = binomial,
data = frogs)
frogs.glm %>% fitted() %>% head()
## 2 3 4 5 6 7
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
frogs.glm %>% predict(type = "response") %>% head()
## 2 3 4 5 6 7
## 0.9416691 0.9259228 0.9029415 0.8119619 0.9314070 0.7278038
frogs.glm %>% predict(type = "link") %>% head() # Scale of linear predictor
## 2 3 4 5 6 7
## 2.7815212 2.5256832 2.2303441 1.4628085 2.6085055 0.9835086
pred <- frogs.glm %>%
predict(type = "link", se.fit = TRUE)
data.frame(fit = pred$fit, se = pred$se.fit) %>% head()
## fit se
## 2 2.7815212 0.6859612
## 3 2.5256832 0.4851882
## 4 2.2303441 0.4381720
## 5 1.4628085 0.4805175
## 6 2.6085055 0.5290599
## 7 0.9835086 0.3900549
frogs.glm %>% confint()
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 8.5807639 29.3392400
## log(distance) -1.2157643 -0.3247928
## log(NoOfPools) 0.1628001 1.0106909
## I(meanmax + meanmin) 0.9219804 2.1385678
## I(meanmax - meanmin) -5.7299615 -2.1868793
frogs.glm %>% coef()
## (Intercept) log(distance) log(NoOfPools)
## 18.5268418 -0.7546587 0.5707174
## I(meanmax + meanmin) I(meanmax - meanmin)
## 1.4984905 -3.8805856
set.seed(123)
frogs.glm <- glm(pres.abs ~ log(distance) + log(NoOfPools),
family = binomial, data = frogs)
cv <- CVbinary(frogs.glm)
##
## Fold: 3 10 2 6 5 4 9 8 7 1
## Internal estimate of accuracy = 0.759
## Cross-validation estimate of accuracy = 0.745
The cross-validation measure is the proportion of predictions over the folds that are correct.
frogs.glm2 <- glm(pres.abs ~ log(distance) + log(NoOfPools)
+ I(meanmax + meanmin) + I(meanmax - meanmin),
family = binomial, data = frogs)
cv <- CVbinary(frogs.glm2)
##
## Fold: 1 5 9 8 7 2 10 6 4 3
## Internal estimate of accuracy = 0.778
## Cross-validation estimate of accuracy = 0.778
frogs.glm2 <- glm(pres.abs ~ log(distance) + log(NoOfPools)
+ I(meanmax + meanmin) + I(meanmax - meanmin),
family = binomial, data = frogs)
cv <- CVbinary(frogs.glm2, nfolds = 5)
##
## Fold: 3 2 4 5 1
## Internal estimate of accuracy = 0.778
## Cross-validation estimate of accuracy = 0.783