Statistical Programming with R
library(magrittr) #for pipes library(mice) #for the boys data library(Ecdat) #for the Benefits data library(ggplot2)
The function lm() is a base function in R and allows you to pose a variety of linear models.
args(lm)
## function (formula, data, subset, weights, na.action, method = "qr", ## model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, ## contrasts = NULL, offset, ...) ## NULL
If we want to know what these arguments do we can ask R:
?lm
This will open a help page on the lm() function.
Unemployment of blue collar workers:
Column stateur:
Column tenure:
Column age:
Column joblost:
Linear regression model \[ y_i=\alpha+\beta{x}_i+\varepsilon_i \]
Assumptions:
To obtain a linear regression model with a main effect for tenure, we formulate \(stateur\sim tenure\)
fit <- Benefits %$% lm(stateur~tenure) fit
## ## Call: ## lm(formula = stateur ~ tenure) ## ## Coefficients: ## (Intercept) tenure ## 7.460299 0.008957
fit %>% summary
## ## Call: ## lm(formula = stateur ~ tenure) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.2693 -1.8051 -0.3693 1.4770 10.5307 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.460299 0.048214 154.732 <2e-16 *** ## tenure 0.008957 0.005701 1.571 0.116 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.5 on 4875 degrees of freedom ## Multiple R-squared: 0.000506, Adjusted R-squared: 0.000301 ## F-statistic: 2.468 on 1 and 4875 DF, p-value: 0.1163
To obtain a linear model with just main effects for tenure and age, we formulate \(stateur\sim tenure + age\)
require(mice) fit <- Benefits %$% lm(stateur ~ tenure + age) fit
## ## Call: ## lm(formula = stateur ~ tenure + age) ## ## Coefficients: ## (Intercept) tenure age ## 7.91128 0.02072 -0.01433
fit %>% summary
## ## Call: ## lm(formula = stateur ~ tenure + age) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.3165 -1.8217 -0.3982 1.4691 10.7142 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.911283 0.130724 60.519 < 2e-16 *** ## tenure 0.020721 0.006517 3.179 0.001485 ** ## age -0.014327 0.003861 -3.711 0.000209 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.497 on 4874 degrees of freedom ## Multiple R-squared: 0.003322, Adjusted R-squared: 0.002913 ## F-statistic: 8.122 on 2 and 4874 DF, p-value: 0.0003009
To obtain a linear model with an interaction effect for tenure and age, we formulate \(stateur\sim tenure * age\)
fit <- Benefits %$% lm(stateur ~ tenure * age)
Equivalent
fit <- Benefits %$% lm(stateur ~ tenure + age + tenure:age)
fit
## ## Call: ## lm(formula = stateur ~ tenure * age) ## ## Coefficients: ## (Intercept) tenure age tenure:age ## 7.686119 0.076781 -0.009180 -0.001174
fit %>% summary
## ## Call: ## lm(formula = stateur ~ tenure * age) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.4060 -1.8214 -0.3735 1.4856 10.6933 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.6861194 0.1708825 44.979 < 2e-16 *** ## tenure 0.0767806 0.0281768 2.725 0.00645 ** ## age -0.0091803 0.0046074 -1.992 0.04637 * ## tenure:age -0.0011741 0.0005742 -2.045 0.04091 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.496 on 4873 degrees of freedom ## Multiple R-squared: 0.004176, Adjusted R-squared: 0.003563 ## F-statistic: 6.812 on 3 and 4873 DF, p-value: 0.0001405
If a categorical variable is entered into function lm(), it is automatically converted to a dummy set in R. The first level is always taken as the reference category. If we want another reference category we can use the function relevel() to change this.
fit_c <- Benefits %$% lm(stateur ~ joblost)
fit_c %>% summary
## ## Call: ## lm(formula = stateur ~ joblost) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.4848 -1.8062 -0.4062 1.5152 10.6806 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.36412 0.05615 131.161 < 2e-16 *** ## joblostposition_abolished -0.04472 0.13656 -0.327 0.743 ## joblostseasonal_job_ended -0.05790 0.19582 -0.296 0.767 ## joblostslack_work 0.32072 0.07639 4.199 2.73e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.496 on 4873 degrees of freedom ## Multiple R-squared: 0.004427, Adjusted R-squared: 0.003814 ## F-statistic: 7.223 on 3 and 4873 DF, p-value: 7.81e-05
names(fit_c) # the names of the list with output objects
## [1] "coefficients" "residuals" "effects" "rank" ## [5] "fitted.values" "assign" "qr" "df.residual" ## [9] "contrasts" "xlevels" "call" "terms" ## [13] "model"
fit_c$coef # show the estimated coefficients
## (Intercept) joblostposition_abolished ## 7.36411943 -0.04471645 ## joblostseasonal_job_ended joblostslack_work ## -0.05790474 0.32072122
coef(fit_c) # alternative
Benefits %>% ggplot(aes(tenure, stateur)) + geom_point() + geom_smooth(method = "loess", col = "blue") + geom_smooth(method = "lm", col = "orange")
The loess curve suggests slight non-linearity
Benefits %$% lm(stateur ~ tenure + I(tenure^2)) %>% summary()
## ## Call: ## lm(formula = stateur ~ tenure + I(tenure^2)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.2209 -1.8143 -0.3809 1.4551 10.5938 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.3648387 0.0619654 118.854 < 2e-16 *** ## tenure 0.0427128 0.0149071 2.865 0.00418 ** ## I(tenure^2) -0.0013388 0.0005463 -2.451 0.01430 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.499 on 4874 degrees of freedom ## Multiple R-squared: 0.001736, Adjusted R-squared: 0.001326 ## F-statistic: 4.238 on 2 and 4874 DF, p-value: 0.01449
par(mfrow = c(1, 2)) fit %>% plot(which = c(1, 3), cex = .6)
par(mfrow = c(1, 2)) boys %$% lm(bmi ~ age) %>% plot(which = c(1, 3), cex = .6)
fit %>% plot(which = 2, cex = .6)
The QQplot shows divergence from normality at the tails
par(mfrow = c(1, 2), cex = .6) fit %>% plot(which = c(4, 5))
There are quite some cases with \(|e_z|>2\) (right plot). There are no cases with Cook’s Distance \(>1\), but some cases stand out.
summary(fit)$r.squared
## [1] 0.004176409
summary(fit)$adj.r.squared
## [1] 0.003563343
Akaike’s An Information Criterion
fit %>% AIC()
## [1] 22768.67
and Bayesian Information Criterion
fit %>% BIC()
## [1] 22801.13
DfBeta calculates the change in coefficients depicted as deviation in SE’s.
fit %>% dfbeta() %>% head(n = 7)
## (Intercept) tenure age tenure:age ## 1 0.0007342916 -2.224502e-04 3.640706e-07 -5.582413e-07 ## 2 0.0035452252 -2.717886e-04 -7.398838e-05 5.377732e-06 ## 3 0.0006738846 -2.839800e-04 -9.192347e-06 4.750169e-06 ## 4 0.0005261632 1.964583e-05 -1.336573e-05 -2.128578e-06 ## 5 -0.0005698390 9.901239e-05 5.378984e-06 -1.579094e-06 ## 6 -0.0002833433 1.905711e-06 1.062632e-05 -2.695563e-07 ## 7 -0.0006480895 -1.443784e-04 1.228765e-05 2.737813e-06
Let’s use the simpler anscombe data example
fit <- anscombe %$% lm(y1 ~ x1) y_hat <- fit %>% fitted.values()
The residual is then calculated as
y_hat - anscombe$y1
## 1 2 3 4 5 6 ## -0.03900000 0.05081818 1.92127273 -1.30909091 0.17109091 0.04136364 ## 7 8 9 10 11 ## -1.23936364 0.74045455 -1.83881818 1.68072727 -0.17945455
If we introduce new values for the predictor x1, we can generate predicted values from the model
new.x1 <- data.frame(x1 = 1:20) predict.lm(fit, newdata = new.x1)
## 1 2 3 4 5 6 7 ## 3.500182 4.000273 4.500364 5.000455 5.500545 6.000636 6.500727 ## 8 9 10 11 12 13 14 ## 7.000818 7.500909 8.001000 8.501091 9.001182 9.501273 10.001364 ## 15 16 17 18 19 20 ## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909
pred <- predict.lm(fit, newdata = new.x1) lm(pred ~ new.x1$x1)$coefficients
## (Intercept) new.x1$x1 ## 3.0000909 0.5000909
fit$coefficients
## (Intercept) x1 ## 3.0000909 0.5000909
predict(fit, interval = "prediction")
## fit lwr upr ## 1 8.001000 5.067072 10.934928 ## 2 7.000818 4.066890 9.934747 ## 3 9.501273 6.390801 12.611745 ## 4 7.500909 4.579129 10.422689 ## 5 8.501091 5.531014 11.471168 ## 6 10.001364 6.789620 13.213107 ## 7 6.000636 2.971271 9.030002 ## 8 5.000455 1.788711 8.212198 ## 9 9.001182 5.971816 12.030547 ## 10 6.500727 3.530650 9.470804 ## 11 5.500545 2.390073 8.611017
Rlm(): linear modelingglm(): generalized linear modelinggamlss::gamlss: generalized additive models (for location, scale and shape)lme4::lmer: linear mixed effect modelslme4::glmer: generalized linear mixed effect modelslme4::nlmer: non-linear mixed effect models