library(magrittr) #for pipes
library(mice) #for the boys data
library(Ecdat) #for the Benefits data
library(ggplot2)
The function lm() is a base function in R and allows you to pose a variety of linear models.
args(lm)
## function (formula, data, subset, weights, na.action, method = "qr",
## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE,
## contrasts = NULL, offset, ...)
## NULL
If we want to know what these arguments do we can ask R:
?lm
This will open a help page on the lm() function.
Unemployment of blue collar workers:
Column stateur:
Column tenure:
Column age:
Column joblost:
Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]
Assumptions:
To obtain a linear regression model with a main effect for tenure, we formulate \(stateur\sim tenure\)
fit <- Benefits %$%
lm(stateur~tenure)
fit
##
## Call:
## lm(formula = stateur ~ tenure)
##
## Coefficients:
## (Intercept) tenure
## 7.460299 0.008957
fit %>% summary
##
## Call:
## lm(formula = stateur ~ tenure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2693 -1.8051 -0.3693 1.4770 10.5307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.460299 0.048214 154.732 <2e-16 ***
## tenure 0.008957 0.005701 1.571 0.116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.5 on 4875 degrees of freedom
## Multiple R-squared: 0.000506, Adjusted R-squared: 0.000301
## F-statistic: 2.468 on 1 and 4875 DF, p-value: 0.1163
To obtain a linear model with just main effects for tenure and age, we formulate \(stateur\sim tenure + age\)
require(mice)
fit <- Benefits %$%
lm(stateur ~ tenure + age)
fit
##
## Call:
## lm(formula = stateur ~ tenure + age)
##
## Coefficients:
## (Intercept) tenure age
## 7.91128 0.02072 -0.01433
fit %>% summary
##
## Call:
## lm(formula = stateur ~ tenure + age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3165 -1.8217 -0.3982 1.4691 10.7142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.911283 0.130724 60.519 < 2e-16 ***
## tenure 0.020721 0.006517 3.179 0.001485 **
## age -0.014327 0.003861 -3.711 0.000209 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.497 on 4874 degrees of freedom
## Multiple R-squared: 0.003322, Adjusted R-squared: 0.002913
## F-statistic: 8.122 on 2 and 4874 DF, p-value: 0.0003009
To obtain a linear model with an interaction effect for tenure and age, we formulate \(stateur\sim tenure * age\)
fit <- Benefits %$% lm(stateur ~ tenure * age)
Equivalent
fit <- Benefits %$% lm(stateur ~ tenure + age + tenure:age)
fit
##
## Call:
## lm(formula = stateur ~ tenure * age)
##
## Coefficients:
## (Intercept) tenure age tenure:age
## 7.686119 0.076781 -0.009180 -0.001174
fit %>% summary
##
## Call:
## lm(formula = stateur ~ tenure * age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4060 -1.8214 -0.3735 1.4856 10.6933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6861194 0.1708825 44.979 < 2e-16 ***
## tenure 0.0767806 0.0281768 2.725 0.00645 **
## age -0.0091803 0.0046074 -1.992 0.04637 *
## tenure:age -0.0011741 0.0005742 -2.045 0.04091 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.496 on 4873 degrees of freedom
## Multiple R-squared: 0.004176, Adjusted R-squared: 0.003563
## F-statistic: 6.812 on 3 and 4873 DF, p-value: 0.0001405
If a categorical variable is entered into function lm(), it is automatically converted to a dummy set in R. The first level is always taken as the reference category. If we want another reference category we can use the function relevel() to change this.
fit_c <- Benefits %$% lm(stateur ~ joblost)
fit_c %>% summary
##
## Call:
## lm(formula = stateur ~ joblost)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4848 -1.8062 -0.4062 1.5152 10.6806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.36412 0.05615 131.161 < 2e-16 ***
## joblostposition_abolished -0.04472 0.13656 -0.327 0.743
## joblostseasonal_job_ended -0.05790 0.19582 -0.296 0.767
## joblostslack_work 0.32072 0.07639 4.199 2.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.496 on 4873 degrees of freedom
## Multiple R-squared: 0.004427, Adjusted R-squared: 0.003814
## F-statistic: 7.223 on 3 and 4873 DF, p-value: 7.81e-05
names(fit_c) # the names of the list with output objects
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
fit_c$coef # show the estimated coefficients
## (Intercept) joblostposition_abolished
## 7.36411943 -0.04471645
## joblostseasonal_job_ended joblostslack_work
## -0.05790474 0.32072122
coef(fit_c) # alternative
Benefits %>%
ggplot(aes(tenure, stateur)) +
geom_point() +
geom_smooth(method = "loess", col = "blue") +
geom_smooth(method = "lm", col = "orange")
The loess curve suggests slight non-linearity
Benefits %$%
lm(stateur ~ tenure + I(tenure^2)) %>%
summary()
##
## Call:
## lm(formula = stateur ~ tenure + I(tenure^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2209 -1.8143 -0.3809 1.4551 10.5938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3648387 0.0619654 118.854 < 2e-16 ***
## tenure 0.0427128 0.0149071 2.865 0.00418 **
## I(tenure^2) -0.0013388 0.0005463 -2.451 0.01430 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.499 on 4874 degrees of freedom
## Multiple R-squared: 0.001736, Adjusted R-squared: 0.001326
## F-statistic: 4.238 on 2 and 4874 DF, p-value: 0.01449
par(mfrow = c(1, 2))
fit %>%
plot(which = c(1, 3), cex = .6)
par(mfrow = c(1, 2))
boys %$%
lm(bmi ~ age) %>%
plot(which = c(1, 3), cex = .6)
fit %>%
plot(which = 2, cex = .6)
The QQplot shows divergence from normality at the tails
par(mfrow = c(1, 2), cex = .6)
fit %>% plot(which = c(4, 5))
There are quite some cases with \(|e_z|>2\) (right plot). There are no cases with Cook’s Distance \(>1\), but some cases stand out.
summary(fit)$r.squared
## [1] 0.004176409
summary(fit)$adj.r.squared
## [1] 0.003563343
Akaike’s An Information Criterion
fit %>%
AIC()
## [1] 22768.67
and Bayesian Information Criterion
fit %>%
BIC()
## [1] 22801.13
DfBeta calculates the change in coefficients depicted as deviation in SE’s.
fit %>%
dfbeta() %>%
head(n = 7)
## (Intercept) tenure age tenure:age
## 1 0.0007342916 -2.224502e-04 3.640706e-07 -5.582413e-07
## 2 0.0035452252 -2.717886e-04 -7.398838e-05 5.377732e-06
## 3 0.0006738846 -2.839800e-04 -9.192347e-06 4.750169e-06
## 4 0.0005261632 1.964583e-05 -1.336573e-05 -2.128578e-06
## 5 -0.0005698390 9.901239e-05 5.378984e-06 -1.579094e-06
## 6 -0.0002833433 1.905711e-06 1.062632e-05 -2.695563e-07
## 7 -0.0006480895 -1.443784e-04 1.228765e-05 2.737813e-06
Let’s use the simpler anscombe data example
fit <- anscombe %$% lm(y1 ~ x1)
y_hat <-
fit %>%
fitted.values()
The residual is then calculated as
y_hat - anscombe$y1
## 1 2 3 4 5 6
## -0.03900000 0.05081818 1.92127273 -1.30909091 0.17109091 0.04136364
## 7 8 9 10 11
## -1.23936364 0.74045455 -1.83881818 1.68072727 -0.17945455
If we introduce new values for the predictor x1, we can generate predicted values from the model
new.x1 <- data.frame(x1 = 1:20)
predict.lm(fit, newdata = new.x1)
## 1 2 3 4 5 6 7
## 3.500182 4.000273 4.500364 5.000455 5.500545 6.000636 6.500727
## 8 9 10 11 12 13 14
## 7.000818 7.500909 8.001000 8.501091 9.001182 9.501273 10.001364
## 15 16 17 18 19 20
## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909
pred <- predict.lm(fit, newdata = new.x1)
lm(pred ~ new.x1$x1)$coefficients
## (Intercept) new.x1$x1
## 3.0000909 0.5000909
fit$coefficients
## (Intercept) x1
## 3.0000909 0.5000909
predict(fit, interval = "prediction")
## fit lwr upr
## 1 8.001000 5.067072 10.934928
## 2 7.000818 4.066890 9.934747
## 3 9.501273 6.390801 12.611745
## 4 7.500909 4.579129 10.422689
## 5 8.501091 5.531014 11.471168
## 6 10.001364 6.789620 13.213107
## 7 6.000636 2.971271 9.030002
## 8 5.000455 1.788711 8.212198
## 9 9.001182 5.971816 12.030547
## 10 6.500727 3.530650 9.470804
## 11 5.500545 2.390073 8.611017
Rlm(): linear modelingglm(): generalized linear modelinggamlss::gamlss: generalized additive models (for location, scale and shape)lme4::lmer: linear mixed effect modelslme4::glmer: generalized linear mixed effect modelslme4::nlmer: non-linear mixed effect models