The linear model

This Lecture:

  • Linear models in R
  • Checking assuptions
  • Model fit & comparison
  • Predictions

I use the following packages in this lecture

library(magrittr) #for pipes
library(mice)     #for the boys data
library(Ecdat)    #for the Benefits data
library(ggplot2)

The linear model

The function lm() is a base function in R and allows you to pose a variety of linear models.

args(lm)
## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## NULL

If we want to know what these arguments do we can ask R:

?lm

This will open a help page on the lm() function.

Benefits data

Unemployment of blue collar workers:

Column stateur:

  • state unemployment rate (in %)

Column tenure:

  • years of tenure in jobs lost

Column age:

  • age in years

Column joblost:

  • reason for loss of job (position abolished, seasonal job, not performing well enough, other)

Linear regression

Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]

Assumptions:

  • \(y_i\) conditionally normal with mean \(\mu_i=\alpha+\beta{x}_i\)
  • \(\varepsilon_i\) are \(i.i.d.\) with mean 0 and (constant) variance \(\sigma^2\)

Continuous predictors

To obtain a linear regression model with a main effect for tenure, we formulate \(stateur\sim tenure\)

fit <- Benefits %$%
  lm(stateur~tenure)
fit
## 
## Call:
## lm(formula = stateur ~ tenure)
## 
## Coefficients:
## (Intercept)       tenure  
##    7.460299     0.008957

Continuous predictors: more detail

fit %>% summary
## 
## Call:
## lm(formula = stateur ~ tenure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2693 -1.8051 -0.3693  1.4770 10.5307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.460299   0.048214 154.732   <2e-16 ***
## tenure      0.008957   0.005701   1.571    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.5 on 4875 degrees of freedom
## Multiple R-squared:  0.000506,   Adjusted R-squared:  0.000301 
## F-statistic: 2.468 on 1 and 4875 DF,  p-value: 0.1163

Continuous predictors

To obtain a linear model with just main effects for tenure and age, we formulate \(stateur\sim tenure + age\)

require(mice)
fit <- Benefits %$%
  lm(stateur ~ tenure + age)
fit
## 
## Call:
## lm(formula = stateur ~ tenure + age)
## 
## Coefficients:
## (Intercept)       tenure          age  
##     7.91128      0.02072     -0.01433

Continuous predictors: more detail

fit %>% summary
## 
## Call:
## lm(formula = stateur ~ tenure + age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3165 -1.8217 -0.3982  1.4691 10.7142 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.911283   0.130724  60.519  < 2e-16 ***
## tenure       0.020721   0.006517   3.179 0.001485 ** 
## age         -0.014327   0.003861  -3.711 0.000209 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.497 on 4874 degrees of freedom
## Multiple R-squared:  0.003322,   Adjusted R-squared:  0.002913 
## F-statistic: 8.122 on 2 and 4874 DF,  p-value: 0.0003009

Continuous predictors: interaction effects

To obtain a linear model with an interaction effect for tenure and age, we formulate \(stateur\sim tenure * age\)

fit <- Benefits %$% lm(stateur ~ tenure * age)

Equivalent

fit <- Benefits %$% lm(stateur ~ tenure + age + tenure:age)

Continuous predictors: interaction effects

fit
## 
## Call:
## lm(formula = stateur ~ tenure * age)
## 
## Coefficients:
## (Intercept)       tenure          age   tenure:age  
##    7.686119     0.076781    -0.009180    -0.001174

and with more detail

fit %>% summary
## 
## Call:
## lm(formula = stateur ~ tenure * age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4060 -1.8214 -0.3735  1.4856 10.6933 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.6861194  0.1708825  44.979  < 2e-16 ***
## tenure       0.0767806  0.0281768   2.725  0.00645 ** 
## age         -0.0091803  0.0046074  -1.992  0.04637 *  
## tenure:age  -0.0011741  0.0005742  -2.045  0.04091 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.496 on 4873 degrees of freedom
## Multiple R-squared:  0.004176,   Adjusted R-squared:  0.003563 
## F-statistic: 6.812 on 3 and 4873 DF,  p-value: 0.0001405

Categorical predictors in the linear model

If a categorical variable is entered into function lm(), it is automatically converted to a dummy set in R. The first level is always taken as the reference category. If we want another reference category we can use the function relevel() to change this.

fit_c <- Benefits %$% lm(stateur ~ joblost)

and again with more detail

fit_c %>% summary
## 
## Call:
## lm(formula = stateur ~ joblost)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4848 -1.8062 -0.4062  1.5152 10.6806 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                7.36412    0.05615 131.161  < 2e-16 ***
## joblostposition_abolished -0.04472    0.13656  -0.327    0.743    
## joblostseasonal_job_ended -0.05790    0.19582  -0.296    0.767    
## joblostslack_work          0.32072    0.07639   4.199 2.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.496 on 4873 degrees of freedom
## Multiple R-squared:  0.004427,   Adjusted R-squared:  0.003814 
## F-statistic: 7.223 on 3 and 4873 DF,  p-value: 7.81e-05

Components of the linear model

names(fit_c) # the names of the list with output objects
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"
fit_c$coef  # show the estimated coefficients
##               (Intercept) joblostposition_abolished 
##                7.36411943               -0.04471645 
## joblostseasonal_job_ended         joblostslack_work 
##               -0.05790474                0.32072122
coef(fit_c) # alternative

Checking assumptions

Linearity

Benefits %>%
  ggplot(aes(tenure, stateur)) + 
  geom_point() + 
  geom_smooth(method = "loess", col = "blue") + 
  geom_smooth(method = "lm", col = "orange")

Linearity

The loess curve suggests slight non-linearity

Adding a squared term

Benefits %$%
  lm(stateur ~ tenure + I(tenure^2)) %>%
  summary()
## 
## Call:
## lm(formula = stateur ~ tenure + I(tenure^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2209 -1.8143 -0.3809  1.4551 10.5938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.3648387  0.0619654 118.854  < 2e-16 ***
## tenure       0.0427128  0.0149071   2.865  0.00418 ** 
## I(tenure^2) -0.0013388  0.0005463  -2.451  0.01430 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.499 on 4874 degrees of freedom
## Multiple R-squared:  0.001736,   Adjusted R-squared:  0.001326 
## F-statistic: 4.238 on 2 and 4874 DF,  p-value: 0.01449

Constant error variance?

par(mfrow = c(1, 2))
fit %>%
  plot(which = c(1, 3), cex = .6)

No constant error variance!

par(mfrow = c(1, 2))
boys %$%
  lm(bmi ~ age) %>%
  plot(which = c(1, 3), cex = .6)

Normality of errors

fit %>%
  plot(which = 2, cex = .6)

The QQplot shows divergence from normality at the tails

Histograms

Outliers and influential cases

par(mfrow = c(1, 2), cex = .6)
fit %>% plot(which = c(4, 5))

There are quite some cases with \(|e_z|>2\) (right plot). There are no cases with Cook’s Distance \(>1\), but some cases stand out.

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Model fit & comparison

R-squared

summary(fit)$r.squared
## [1] 0.004176409
summary(fit)$adj.r.squared
## [1] 0.003563343

AIC and BIC

Akaike’s An Information Criterion

fit %>% 
  AIC()
## [1] 22768.67

and Bayesian Information Criterion

fit %>%
  BIC()
## [1] 22801.13

Influence of cases

DfBeta calculates the change in coefficients depicted as deviation in SE’s.

fit %>%
  dfbeta() %>%
  head(n = 7)
##     (Intercept)        tenure           age    tenure:age
## 1  0.0007342916 -2.224502e-04  3.640706e-07 -5.582413e-07
## 2  0.0035452252 -2.717886e-04 -7.398838e-05  5.377732e-06
## 3  0.0006738846 -2.839800e-04 -9.192347e-06  4.750169e-06
## 4  0.0005261632  1.964583e-05 -1.336573e-05 -2.128578e-06
## 5 -0.0005698390  9.901239e-05  5.378984e-06 -1.579094e-06
## 6 -0.0002833433  1.905711e-06  1.062632e-05 -2.695563e-07
## 7 -0.0006480895 -1.443784e-04  1.228765e-05  2.737813e-06

Prediction

Fitted values:

Let’s use the simpler anscombe data example

fit <- anscombe %$% lm(y1 ~ x1)
y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1
##           1           2           3           4           5           6 
## -0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
##           7           8           9          10          11 
## -1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
predict.lm(fit, newdata = new.x1)
##         1         2         3         4         5         6         7 
##  3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727 
##         8         9        10        11        12        13        14 
##  7.000818  7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 
##        15        16        17        18        19        20 
## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909

Predictions are draws from the regression line

pred <- predict.lm(fit, newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients
## (Intercept)   new.x1$x1 
##   3.0000909   0.5000909
fit$coefficients
## (Intercept)          x1 
##   3.0000909   0.5000909

Prediction intervals

predict(fit, interval = "prediction")
##          fit      lwr       upr
## 1   8.001000 5.067072 10.934928
## 2   7.000818 4.066890  9.934747
## 3   9.501273 6.390801 12.611745
## 4   7.500909 4.579129 10.422689
## 5   8.501091 5.531014 11.471168
## 6  10.001364 6.789620 13.213107
## 7   6.000636 2.971271  9.030002
## 8   5.000455 1.788711  8.212198
## 9   9.001182 5.971816 12.030547
## 10  6.500727 3.530650  9.470804
## 11  5.500545 2.390073  8.611017

Some other modeling devices in R

  • lm(): linear modeling
  • glm(): generalized linear modeling
  • gamlss::gamlss: generalized additive models (for location, scale and shape)
  • lme4::lmer: linear mixed effect models
  • lme4::glmer: generalized linear mixed effect models
  • lme4::nlmer: non-linear mixed effect models