@dataknut
)To learn how to do ols regression modelling in R (markdown) and specifically how to use broom (Robinson 2016) and car (Fox and Weisberg 2011) for diagnostics, stargazer (Hlavac 2015) for model tables, data.table (Dowle et al. 2015) for data manipulation and ggplot (Wickham 2009) for results visualisation.
For more on regression try: http://socserv.socsci.mcmaster.ca/jfox/Courses/Brazil-2009/index.html
So many birds and just one stone…
# R has a very useful built-in dataset called mtcars
# http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html
# A data frame with 32 observations on 11 variables. [, 1] mpg Miles/(US)
# gallon [, 2] cyl Number of cylinders [, 3] disp Displacement (cu.in.) [,
# 4] hp Gross horsepower [, 5] drat Rear axle ratio [, 6] wt Weight (1000
# lbs) [, 7] qsec 1/4 mile time [, 8] vs V/S [, 9] am Transmission (0 =
# automatic, 1 = manual) [,10] gear Number of forward gears [,11] carb
# Number of carburetors
# Load mtcars ----
mtcars <- mtcars
summary(mtcars) # base method
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
pairs(~mpg + disp + hp + drat + wt, labels = c("Mpg", "Displacement", "Horse power",
"Rear axle rotation", "Weight"), data = mtcars, main = "Simple Scatterplot Matrix")
Test normality of mpg (outcome variable of interest) as linear models rest on this assumption (and quite a few others…).
# test with a histogram
hist(mtcars$mpg)
# test with a qq plot
qqnorm(mtcars$mpg)
qqline(mtcars$mpg, col = 2)
shapiro.test(mtcars$mpg)
##
## Shapiro-Wilk normality test
##
## data: mtcars$mpg
## W = 0.94756, p-value = 0.1229
# if p > 0.05 => normal
# is it? Beware: shapiro-wilks is less robust as N ->
Mpg seems to approximate normality but with a slightly worrying tail effect at the right hand extreme.
Run a model trying to see if qsec predicts mpg. This will print out a basic table of results.
# qsec = time to go 1/4 mile from stationary
mpgModel1 <- lm(mpg ~ qsec, mtcars)
# results?
summary(mpgModel1)
##
## Call:
## lm(formula = mpg ~ qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8760 -3.4539 -0.7203 2.2774 11.6491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.1140 10.0295 -0.510 0.6139
## qsec 1.4121 0.5592 2.525 0.0171 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.564 on 30 degrees of freedom
## Multiple R-squared: 0.1753, Adjusted R-squared: 0.1478
## F-statistic: 6.377 on 1 and 30 DF, p-value: 0.01708
Now run the standard diagnostics over the model results.
# Diagnostics ----
message("# Diagnostic plots")
## # Diagnostic plots
plot(mpgModel1)
message("# Normality of residuals")
## # Normality of residuals
hist(mpgModel1$residuals)
qqnorm(mpgModel1$residuals)
qqline(mpgModel1$residuals, col = 2)
shapiro.test(mpgModel1$residuals)
##
## Shapiro-Wilk normality test
##
## data: mpgModel1$residuals
## W = 0.94193, p-value = 0.08494
message("# Normality of standardised residuals")
## # Normality of standardised residuals
# it is usual to do these checks for standardised residuals - but the
# results are the same add casewise diagnostics back into dataframe
mtcars$studentised.residuals <- rstudent(mpgModel1)
qqnorm(mtcars$studentised.residuals)
qqline(mtcars$studentised.residuals, col = 2)
shapiro.test(mtcars$studentised.residuals)
##
## Shapiro-Wilk normality test
##
## data: mtcars$studentised.residuals
## W = 0.93996, p-value = 0.07469
# if p > 0.05 => normal is it? But don't rely on the test espcially with
# large n
Now try using the car package (Fox and Weisberg 2011) to do the same things.
# The 'car' package has some nice graphs to help here
car::qqPlot(mpgModel1) # shows default 95% CI
## Toyota Corolla Lotus Europa
## 20 28
car::spreadLevelPlot(mpgModel1)
##
## Suggested power transformation: -1.490415
message("# Do we think the variance of the residuals is constant?")
## # Do we think the variance of the residuals is constant?
message("# Did the plot suggest a transformation? If so, why?")
## # Did the plot suggest a transformation? If so, why?
message("# autocorrelation/independence of errors")
## # autocorrelation/independence of errors
car::durbinWatsonTest(mpgModel1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.5922771 0.8065068 0
## Alternative hypothesis: rho != 0
# if p < 0.05 then a problem as implies autocorrelation what should we
# conclude? Why? Could you have spotted that in the model summary?
message("# homoskedasticity")
## # homoskedasticity
plot(mtcars$mpg, mpgModel1$residuals)
abline(h = mean(mpgModel1$residuals), col = "red") # add the mean of the residuals (yay, it's zero!)
message("# homoskedasticity: formal test")
## # homoskedasticity: formal test
car::ncvTest(mpgModel1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.956535, Df = 1, p = 0.32806
# if p > 0.05 then there is heteroskedasticity what do we conclude from the
# tests?
go back to the model - what can we conclude from it?
summary(mpgModel1)
##
## Call:
## lm(formula = mpg ~ qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8760 -3.4539 -0.7203 2.2774 11.6491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.1140 10.0295 -0.510 0.6139
## qsec 1.4121 0.5592 2.525 0.0171 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.564 on 30 degrees of freedom
## Multiple R-squared: 0.1753, Adjusted R-squared: 0.1478
## F-statistic: 6.377 on 1 and 30 DF, p-value: 0.01708
There are an infinite number of ways to report regression model results and every journal has it’s own. We look at this in detail below using stargazer (Hlavac 2015) but the guidance here is also quite useful.
So our model was mostly OK (one violated assumption?) but the r sq was quite low.
Maybe we should add the car’s weight?
# wt = weight of car
mpgModel2 <- lm(mpg ~ qsec + wt, mtcars)
# results?
summary(mpgModel2)
##
## Call:
## lm(formula = mpg ~ qsec + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3962 -2.1431 -0.2129 1.4915 5.7486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.7462 5.2521 3.760 0.000765 ***
## qsec 0.9292 0.2650 3.506 0.001500 **
## wt -5.0480 0.4840 -10.430 2.52e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.596 on 29 degrees of freedom
## Multiple R-squared: 0.8264, Adjusted R-squared: 0.8144
## F-statistic: 69.03 on 2 and 29 DF, p-value: 9.395e-12
Run diagnostics.
# Diagnostics ----
car::qqPlot(mpgModel2) # shows default 95% CI
## Chrysler Imperial Fiat 128
## 17 18
car::spreadLevelPlot(mpgModel2)
##
## Suggested power transformation: 0.4388755
message("# Do we think the variance of the residuals is constant?")
## # Do we think the variance of the residuals is constant?
message("# Did the plot suggest a transformation? If so, why?")
## # Did the plot suggest a transformation? If so, why?
message("# autocorrelation/independence of errors")
## # autocorrelation/independence of errors
car::durbinWatsonTest(mpgModel2)
## lag Autocorrelation D-W Statistic p-value
## 1 0.2438102 1.49595 0.102
## Alternative hypothesis: rho != 0
# if p < 0.05 then a problem as implies autocorrelation what should we
# conclude? Why? Could you have spotted that in the model summary?
message("# homoskedasticity")
## # homoskedasticity
plot(mtcars$mpg, mpgModel2$residuals)
abline(h = mean(mpgModel2$residuals), col = "red") # add the mean of the residuals (yay, it's zero!)
message("# homoskedasticity: formal test")
## # homoskedasticity: formal test
car::ncvTest(mpgModel2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.590986, Df = 1, p = 0.44204
# if p > 0.05 then there is heteroskedasticity
# but also
message("# additional assumption checks (now there are 2 predictors)")
## # additional assumption checks (now there are 2 predictors)
message("# -> collinearity")
## # -> collinearity
car::vif(mpgModel2)
## qsec wt
## 1.031487 1.031487
# if any values > 10 -> problem
message("# -> tolerance")
## # -> tolerance
1/car::vif(mpgModel2)
## qsec wt
## 0.9694744 0.9694744
# if any values < 0.2 -> possible problem if any values < 0.1 -> definitely
# a problem
# what do we conclude from the tests?
Whilst we’re here we should also plot the residuals for model 2 against the fitted values (as opposed to the observed values which we did earlier). h/t to https://gist.github.com/apreshill/9d33891b5f9be4669ada20f76f101baa for this.
# save the residuals via broom
resids <- augment(mpgModel2)
# plot fitted vs residuals
ggplot(resids, aes(x = .fitted, y = .resid)) + geom_point(size = 1)
As with a plot of residuals against observed values, hopefully we didn’t see any obviously strange patterns (unlike that gist).
Now compare the models
# comparing models
message("# test significant difference between models")
## # test significant difference between models
anova(mpgModel1, mpgModel2)
## Analysis of Variance Table
##
## Model 1: mpg ~ qsec
## Model 2: mpg ~ qsec + wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 928.66
## 2 29 195.46 1 733.19 108.78 2.519e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# what should we conclude from that?
Reporting should include coefficients, p values and confidence intervals for factors in each model as well as regression diagnostics so that readers can judge the goodness of fit and uncertainty for themselves (see https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108 for guidance and also https://www.csus.edu/indiv/v/vangaasbeckk/courses/200a/sup/regressionresults.pdf – with the addition of e.g. 95% confidence intervals to the tables).
The p values tell you whether the ‘effect’ (co-efficient) is statistically significant at a given confidence threshold. By convention this is usually one of p < 0.05 (5%), p < 0.01 (1%), or p < 0.001 (0.1%). Sometimes, this is (justifiably) relaxed to p < 0.1 (10%) for example. The choice of which to use is normative and driven by your appetite for Type 1 & Type 2 error risks and the uses to which you will put the results.
But only you can decide if it is IMPORTANT!
It is usually best to calculate and inspect confidence intervals for your estimates alongside the p values.
This indicates:
You can calculate them using the standard error (s.e.) from the summary:
Or use confint()
which is more precise.
Print out the summaries again and calculate 95% confidence intervals and p values each time.
# Model 1 save results as log odds the cbind function simply 'glues' the
# columns together side by side
mpgModel1Results_CI <- cbind(Coef = coef(mpgModel1), confint(mpgModel1), p = round(summary(mpgModel1)$coefficients[,
4], 3))
mpgModel1Results_CI
## Coef 2.5 % 97.5 % p
## (Intercept) -5.114038 -25.5970982 15.369022 0.614
## qsec 1.412125 0.2700654 2.554184 0.017
Notice that where p < 0.05 our 95% CI does not include 0. These tell you the same thing: that with a 95% threshold, the coefficient’s differnece from 0 is statistically significant. Notice this is not the case for the constant (Intercept).
Now we do that again but for extra practice we use a bonferroni correction to take into account the number of predictors we used. As with most things in statistics, there is active debate about the use of this…
# Model 1 bf use confint to report confidence intervals with bonferroni
# corrected level
bc_p1 <- 0.05/length(mpgModel1$coefficients)
# save results as log odds the cbind function simply 'glues' the columns
# together side by side
mpgModel1Results_bf <- cbind(Coef = coef(mpgModel1), confint(mpgModel1, level = 1 -
bc_p1), p = round(summary(mpgModel1)$coefficients[, 4], 3))
mpgModel1Results_bf
## Coef 1.25 % 98.75 % p
## (Intercept) -5.114038 -28.77937200 18.551296 0.614
## qsec 1.412125 0.09263361 2.731616 0.017
The coefficient has not change (of course) and nor has the default p value produced by the original lm model but our confidence intervals have been adjusted. You can see that we are now using the more stringent 2.5% and the intervals are wider. We would still conclude there is a statistically significant effect at this threshold but we are a bit less certain.
Is that the right interpretation?
Now do the same for model 2.
# Model 2 bf use confint to report confidence intervals with bonferroni
# corrected level
bc_p2 <- 0.05/length(mpgModel2$coefficients)
# save results as log odds the cbind function simply 'glues' the columns
# together side by side
mpgModel2Results_bf <- cbind(Coef = coef(mpgModel2), confint(mpgModel2, level = 1 -
bc_p2), p = round(summary(mpgModel2)$coefficients[, 4], 3))
mpgModel2Results_bf
## Coef 0.833 % 99.167 % p
## (Intercept) 19.746223 6.4012162 33.091229 0.001
## qsec 0.929198 0.2558134 1.602583 0.001
## wt -5.047982 -6.2777749 -3.818189 0.000
Now we’ll try reporting the two model using the stargazer package (Hlavac 2015) to get pretty tables. Note we use options to automatically create the 95% CI and to report the results on just one line per predictor. This is especially helpful for models with a lot of variables. However you will see that the default is to report p values as *.
stargazer(mpgModel1, mpgModel2, title = "Model results", ci = TRUE, single.row = TRUE,
type = "html")
Dependent variable: | ||
mpg | ||
(1) | (2) | |
qsec | 1.412** (0.316, 2.508) | 0.929*** (0.410, 1.449) |
wt | -5.048*** (-5.997, -4.099) | |
Constant | -5.114 (-24.772, 14.544) | 19.746*** (9.452, 30.040) |
Observations | 32 | 32 |
R2 | 0.175 | 0.826 |
Adjusted R2 | 0.148 | 0.814 |
Residual Std. Error | 5.564 (df = 30) | 2.596 (df = 29) |
F Statistic | 6.377** (df = 1; 30) | 69.033*** (df = 2; 29) |
Note: | p<0.1; p<0.05; p<0.01 |
We can add the p values using stargazer options. Note that this puts them under the 95% CI - we may want them in a new column.
stargazer(mpgModel1, mpgModel2, title = "Model results", ci = TRUE, report = c("vcsp*"),
single.row = TRUE, type = "html")
Dependent variable: | ||
mpg | ||
(1) | (2) | |
qsec | 1.412 (0.316, 2.508) | 0.929 (0.410, 1.449) |
p = 0.018** | p = 0.002*** | |
wt | -5.048 (-5.997, -4.099) | |
p = 0.000*** | ||
Constant | -5.114 (-24.772, 14.544) | 19.746 (9.452, 30.040) |
p = 0.614 | p = 0.001*** | |
Observations | 32 | 32 |
R2 | 0.175 | 0.826 |
Adjusted R2 | 0.148 | 0.814 |
Residual Std. Error | 5.564 (df = 30) | 2.596 (df = 29) |
F Statistic | 6.377** (df = 1; 30) | 69.033*** (df = 2; 29) |
Note: | p<0.1; p<0.05; p<0.01 |
Stargazer can simulate a range of journal output formats.
Let’s start with ‘all’ which prints everything. This is quite good for your first model report but probably wouldn’t go in an article.
stargazer(mpgModel1, mpgModel2, title = "Model results", ci = TRUE, style = "all",
single.row = TRUE, type = "html")
Dependent variable: | ||
mpg | ||
(1) | (2) | |
qsec | 1.412** (0.316, 2.508) | 0.929*** (0.410, 1.449) |
t = 2.525 | t = 3.506 | |
p = 0.018 | p = 0.002 | |
wt | -5.048*** (-5.997, -4.099) | |
t = -10.430 | ||
p = 0.000 | ||
Constant | -5.114 (-24.772, 14.544) | 19.746*** (9.452, 30.040) |
t = -0.510 | t = 3.760 | |
p = 0.614 | p = 0.001 | |
Observations | 32 | 32 |
R2 | 0.175 | 0.826 |
Adjusted R2 | 0.148 | 0.814 |
Residual Std. Error | 5.564 (df = 30) | 2.596 (df = 29) |
F Statistic | 6.377** (df = 1; 30) (p = 0.018) | 69.033*** (df = 2; 29) (p = 0.000) |
Note: | p<0.1; p<0.05; p<0.01 |
Now let’s try American Economic Review.
stargazer(mpgModel1, mpgModel2, title = "Model results", ci = TRUE, style = "aer",
single.row = TRUE, type = "html")
mpg | ||
(1) | (2) | |
qsec | 1.412** (0.316, 2.508) | 0.929*** (0.410, 1.449) |
wt | -5.048*** (-5.997, -4.099) | |
Constant | -5.114 (-24.772, 14.544) | 19.746*** (9.452, 30.040) |
Observations | 32 | 32 |
R2 | 0.175 | 0.826 |
Adjusted R2 | 0.148 | 0.814 |
Residual Std. Error | 5.564 (df = 30) | 2.596 (df = 29) |
F Statistic | 6.377** (df = 1; 30) | 69.033*** (df = 2; 29) |
Notes: | ***Significant at the 1 percent level. | |
**Significant at the 5 percent level. | ||
*Significant at the 10 percent level. |
Or perhaps Demography.
stargazer(mpgModel1, mpgModel2, title = "Model results", ci = TRUE, style = "demography",
single.row = TRUE, type = "html")
mpg | ||
Model 1 | Model 2 | |
qsec | 1.412* (0.316, 2.508) | 0.929** (0.410, 1.449) |
wt | -5.048*** (-5.997, -4.099) | |
Constant | -5.114 (-24.772, 14.544) | 19.746*** (9.452, 30.040) |
N | 32 | 32 |
R2 | 0.175 | 0.826 |
Adjusted R2 | 0.148 | 0.814 |
Residual Std. Error | 5.564 (df = 30) | 2.596 (df = 29) |
F Statistic | 6.377* (df = 1; 30) | 69.033*** (df = 2; 29) |
p < .05; p < .01; p < .001 |
Now we’ll visualise them using broom and ggplot as we’ve found that non-statisticians can interpret plots more easily. This is especially useful for showing what the 95% confidence intervals ‘mean’.
Just to confuse you we’re going to convert the data frames produced by broom in to data.tables (Dowle et al. 2015). It makes little difference here (you can use data frames where we use data.tables) but data.table::
is good to get to know for data science because it does a lot of data wrangling tasks very fast.
# Broom converts the model results into a data frame (very useful!) We then
# convert it to a data.table. We like to use the DT suffix to remind us this
# is a data.table
mpgModel1DT <- as.data.table(tidy(mpgModel1))
# add the 95% CI
mpgModel1DT$ci_lower <- mpgModel1DT$estimate - qnorm(0.975) * mpgModel1DT$std.error
mpgModel1DT$ci_upper <- mpgModel1DT$estimate + qnorm(0.975) * mpgModel1DT$std.error
mpgModel1DT <- mpgModel1DT[, `:=`(model, "Model 1")] # add model label for ggplot to pick up
# repeat for model 2
mpgModel2DT <- as.data.table(tidy(mpgModel2))
mpgModel2DT$ci_lower <- mpgModel2DT$estimate - qnorm(0.975) * mpgModel2DT$std.error
mpgModel2DT$ci_upper <- mpgModel2DT$estimate + qnorm(0.975) * mpgModel2DT$std.error
mpgModel2DT <- mpgModel2DT[, `:=`(model, "Model 2")] # add model label for ggplot to pick up
# rbind the data tables so you have long form data
modelsDT <- rbind(mpgModel1DT, mpgModel2DT)
# plot the coefficients using colour to distinguish the models and plot the
# 95% CIs
ggplot(modelsDT, aes(x = term, y = estimate, fill = model)) + geom_bar(position = position_dodge(),
stat = "identity") + geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper),
width = 0.2, position = position_dodge(0.9)) + labs(title = "Model results",
x = "Variable", y = "Coefficient", caption = paste0("Model 1 & 2 results, Error bars = 95% CI",
"\n Data: mtcars")) + coord_flip() # rotate for legibility
So now we can easily ‘see’ and interpret our results.
Analysis completed in: 7.82 seconds using knitr in RStudio with R version 3.5.1 (2018-07-02) running on x86_64-apple-darwin15.6.0.
R packages used (rms, stargazer, car, broom, ggplot2, data.table):
Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. Data.table: Extension of Data.frame. https://CRAN.R-project.org/package=data.table.
Fox, John, and Sanford Weisberg. 2011. An R Companion to Applied Regression. Second. Thousand Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion.
Hlavac, Marek. 2015. Stargazer: Well-Formatted Regression and Summary Statistics Tables. Cambridge, USA: Harvard University. http://CRAN.R-project.org/package=stargazer.
R Core Team. 2016. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, David. 2016. Broom: Convert Statistical Analysis Objects into Tidy Data Frames. https://CRAN.R-project.org/package=broom.
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
Xie, Yihui. 2016. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://CRAN.R-project.org/package=knitr.