Linear models

Statistical Programming with R

The linear model

This Lecture:

Linear models in R
Checking assuptions
Model fit & comparison
Predictions

I use the following packages in this lecture

library(magrittr) #for pipes
library(mice)     #for the boys data
library(Ecdat)    #for the Benefits data
library(ggplot2)

The linear model

The function lm() is a base function in R and allows you to pose a variety of linear models.

args(lm)

## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## NULL

If we want to know what these arguments do we can ask R:

?lm

This will open a help page on the lm() function.

Benefits data

Unemployment of blue collar workers:

Column stateur:

state unemployment rate (in %)

Column tenure:

years of tenure in jobs lost

Column age:

age in years

Column joblost:

reason for loss of job (position abolished, seasonal job, not performing well enough, other)

Linear regression

Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]

Assumptions:

\(y_i\) conditionally normal with mean \(\mu_i=\alpha+\beta{x}_i\)
\(\varepsilon_i\) are \(i.i.d.\) with mean 0 and (constant) variance \(\sigma^2\)

Continuous predictors

To obtain a linear regression model with a main effect for tenure, we formulate \(stateur\sim tenure\)

fit <- Benefits %$%
  lm(stateur~tenure)
fit

## 
## Call:
## lm(formula = stateur ~ tenure)
## 
## Coefficients:
## (Intercept)       tenure  
##    7.460299     0.008957

Continuous predictors: more detail

fit %>% summary

## 
## Call:
## lm(formula = stateur ~ tenure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2693 -1.8051 -0.3693  1.4770 10.5307 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.460299   0.048214 154.732   <2e-16 ***
## tenure      0.008957   0.005701   1.571    0.116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.5 on 4875 degrees of freedom
## Multiple R-squared:  0.000506,   Adjusted R-squared:  0.000301 
## F-statistic: 2.468 on 1 and 4875 DF,  p-value: 0.1163

Continuous predictors

To obtain a linear model with just main effects for tenure and age, we formulate \(stateur\sim tenure + age\)

require(mice)
fit <- Benefits %$%
  lm(stateur ~ tenure + age)
fit

## 
## Call:
## lm(formula = stateur ~ tenure + age)
## 
## Coefficients:
## (Intercept)       tenure          age  
##     7.91128      0.02072     -0.01433

Continuous predictors: more detail

fit %>% summary

## 
## Call:
## lm(formula = stateur ~ tenure + age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3165 -1.8217 -0.3982  1.4691 10.7142 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.911283   0.130724  60.519  < 2e-16 ***
## tenure       0.020721   0.006517   3.179 0.001485 ** 
## age         -0.014327   0.003861  -3.711 0.000209 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.497 on 4874 degrees of freedom
## Multiple R-squared:  0.003322,   Adjusted R-squared:  0.002913 
## F-statistic: 8.122 on 2 and 4874 DF,  p-value: 0.0003009

Continuous predictors: interaction effects

To obtain a linear model with an interaction effect for tenure and age, we formulate \(stateur\sim tenure * age\)

fit <- Benefits %$% lm(stateur ~ tenure * age)

Equivalent

fit <- Benefits %$% lm(stateur ~ tenure + age + tenure:age)

Continuous predictors: interaction effects

fit

## 
## Call:
## lm(formula = stateur ~ tenure * age)
## 
## Coefficients:
## (Intercept)       tenure          age   tenure:age  
##    7.686119     0.076781    -0.009180    -0.001174

and with more detail

fit %>% summary

## 
## Call:
## lm(formula = stateur ~ tenure * age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4060 -1.8214 -0.3735  1.4856 10.6933 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.6861194  0.1708825  44.979  < 2e-16 ***
## tenure       0.0767806  0.0281768   2.725  0.00645 ** 
## age         -0.0091803  0.0046074  -1.992  0.04637 *  
## tenure:age  -0.0011741  0.0005742  -2.045  0.04091 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.496 on 4873 degrees of freedom
## Multiple R-squared:  0.004176,   Adjusted R-squared:  0.003563 
## F-statistic: 6.812 on 3 and 4873 DF,  p-value: 0.0001405

Categorical predictors in the linear model

If a categorical variable is entered into function lm(), it is automatically converted to a dummy set in R. The first level is always taken as the reference category. If we want another reference category we can use the function relevel() to change this.

fit_c <- Benefits %$% lm(stateur ~ joblost)

and again with more detail

fit_c %>% summary

## 
## Call:
## lm(formula = stateur ~ joblost)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4848 -1.8062 -0.4062  1.5152 10.6806 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                7.36412    0.05615 131.161  < 2e-16 ***
## joblostposition_abolished -0.04472    0.13656  -0.327    0.743    
## joblostseasonal_job_ended -0.05790    0.19582  -0.296    0.767    
## joblostslack_work          0.32072    0.07639   4.199 2.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.496 on 4873 degrees of freedom
## Multiple R-squared:  0.004427,   Adjusted R-squared:  0.003814 
## F-statistic: 7.223 on 3 and 4873 DF,  p-value: 7.81e-05

Components of the linear model

names(fit_c) # the names of the list with output objects

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"

fit_c$coef  # show the estimated coefficients

##               (Intercept) joblostposition_abolished 
##                7.36411943               -0.04471645 
## joblostseasonal_job_ended         joblostslack_work 
##               -0.05790474                0.32072122

coef(fit_c) # alternative

Checking assumptions

Linearity

Benefits %>%
  ggplot(aes(tenure, stateur)) + 
  geom_point() + 
  geom_smooth(method = "loess", col = "blue") + 
  geom_smooth(method = "lm", col = "orange")

Linearity

The loess curve suggests slight non-linearity

Adding a squared term

Benefits %$%
  lm(stateur ~ tenure + I(tenure^2)) %>%
  summary()

## 
## Call:
## lm(formula = stateur ~ tenure + I(tenure^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2209 -1.8143 -0.3809  1.4551 10.5938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.3648387  0.0619654 118.854  < 2e-16 ***
## tenure       0.0427128  0.0149071   2.865  0.00418 ** 
## I(tenure^2) -0.0013388  0.0005463  -2.451  0.01430 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.499 on 4874 degrees of freedom
## Multiple R-squared:  0.001736,   Adjusted R-squared:  0.001326 
## F-statistic: 4.238 on 2 and 4874 DF,  p-value: 0.01449

Constant error variance?

par(mfrow = c(1, 2))
fit %>%
  plot(which = c(1, 3), cex = .6)

No constant error variance!

par(mfrow = c(1, 2))
boys %$%
  lm(bmi ~ age) %>%
  plot(which = c(1, 3), cex = .6)

Normality of errors

fit %>%
  plot(which = 2, cex = .6)

The QQplot shows divergence from normality at the tails

Histograms

Outliers and influential cases

par(mfrow = c(1, 2), cex = .6)
fit %>% plot(which = c(4, 5))

There are quite some cases with \(|e_z|>2\) (right plot). There are no cases with Cook’s Distance \(>1\), but some cases stand out.

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Model fit & comparison

R-squared

summary(fit)$r.squared

## [1] 0.004176409

summary(fit)$adj.r.squared

## [1] 0.003563343

AIC and BIC

Akaike’s An Information Criterion

fit %>% 
  AIC()

## [1] 22768.67

and Bayesian Information Criterion

fit %>%
  BIC()

## [1] 22801.13

Influence of cases

DfBeta calculates the change in coefficients depicted as deviation in SE’s.

fit %>%
  dfbeta() %>%
  head(n = 7)

##     (Intercept)        tenure           age    tenure:age
## 1  0.0007342916 -2.224502e-04  3.640706e-07 -5.582413e-07
## 2  0.0035452252 -2.717886e-04 -7.398838e-05  5.377732e-06
## 3  0.0006738846 -2.839800e-04 -9.192347e-06  4.750169e-06
## 4  0.0005261632  1.964583e-05 -1.336573e-05 -2.128578e-06
## 5 -0.0005698390  9.901239e-05  5.378984e-06 -1.579094e-06
## 6 -0.0002833433  1.905711e-06  1.062632e-05 -2.695563e-07
## 7 -0.0006480895 -1.443784e-04  1.228765e-05  2.737813e-06

Prediction

Fitted values:

Let’s use the simpler anscombe data example

fit <- anscombe %$% lm(y1 ~ x1)
y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1

##           1           2           3           4           5           6 
## -0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
##           7           8           9          10          11 
## -1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
predict.lm(fit, newdata = new.x1)

##         1         2         3         4         5         6         7 
##  3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727 
##         8         9        10        11        12        13        14 
##  7.000818  7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 
##        15        16        17        18        19        20 
## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909

Predictions are draws from the regression line

pred <- predict.lm(fit, newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients

## (Intercept)   new.x1$x1 
##   3.0000909   0.5000909

fit$coefficients

## (Intercept)          x1 
##   3.0000909   0.5000909

Prediction intervals

predict(fit, interval = "prediction")

##          fit      lwr       upr
## 1   8.001000 5.067072 10.934928
## 2   7.000818 4.066890  9.934747
## 3   9.501273 6.390801 12.611745
## 4   7.500909 4.579129 10.422689
## 5   8.501091 5.531014 11.471168
## 6  10.001364 6.789620 13.213107
## 7   6.000636 2.971271  9.030002
## 8   5.000455 1.788711  8.212198
## 9   9.001182 5.971816 12.030547
## 10  6.500727 3.530650  9.470804
## 11  5.500545 2.390073  8.611017

Some other modeling devices in `R`

lm(): linear modeling
glm(): generalized linear modeling
gamlss::gamlss: generalized additive models (for location, scale and shape)
lme4::lmer: linear mixed effect models
lme4::glmer: generalized linear mixed effect models
lme4::nlmer: non-linear mixed effect models

The linear model

This Lecture:

I use the following packages in this lecture

The linear model

Benefits data

Linear regression

Continuous predictors

Continuous predictors: more detail

Continuous predictors

Continuous predictors: more detail

Continuous predictors: interaction effects

Continuous predictors: interaction effects

and with more detail

Categorical predictors in the linear model

and again with more detail

Components of the linear model

Checking assumptions

Linearity

Linearity

Adding a squared term

Constant error variance?

No constant error variance!

Normality of errors

Histograms

Outliers and influential cases

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Model fit & comparison

R-squared

AIC and BIC

Influence of cases

Prediction

Fitted values:

Predict new values

Predictions are draws from the regression line

Prediction intervals

Some other modeling devices in R

Practical

Some other modeling devices in `R`