For more information on this topic, I recommend you reading the following books:
Generalized Linear Models with Examples in R - Dunn
Vectorized Generalized Linear and Additive Models - Yee
Generalized Linear Models
In linear regression, we are fitting a model between a set of predictors and an outcome variable. The outcome variable is generally normally distributed. However, if the outcome is not normally distributed, you must use a generalized linear model to assess the association between a set of predictors and an outcome variable.
Link Functions
Due to the outcome variable being non-normally distributed, we must use link functions. The link function allows us to link the outcome variable to a linear model.
Logistic Regression
A logistic regression model is used when the outcome is binary, yes and no. The link function is the logit function that models the association between a set of linear predictors and the odds of observing yes:
\[
\log\left\{\frac{P(Y=1)}{P(Y=0)}\right\} = \boldsymbol{X^T\beta}
\] Fitting a logistic regression requires you to use the glm() function and specifying family=binomial. Then use the summary function to obtain the coefficients and broom::tidy()1
logit_glm <-glm(y ~ ap + hilo, data = bacteria, family = binomial)summary(logit_glm)
Call:
glm(formula = y ~ ap + hilo, family = binomial, data = bacteria)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1249 0.4701 0.5885 0.6778 0.8357
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3539 0.2855 4.743 2.11e-06 ***
app 0.7933 0.3748 2.116 0.0343 *
hilolo -0.4816 0.3480 -1.384 0.1664
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 217.38 on 219 degrees of freedom
Residual deviance: 209.87 on 217 degrees of freedom
AIC: 215.87
Number of Fisher Scoring iterations: 4
A \(\beta\) coefficient is interpreted as an odds ratio. The odds of success when ap=p is 2.21 times higher than when ap=a, after adjusting for hilo.
Poisson Regression
Poisson regression is used when the outcome is count data. The link function is the log of the mean count:
\[
\log(\mu)=\boldsymbol{X^T\beta}
\]
Fitting a Poisson regression requires you to use the glm() function and specifying family=poisson. Then use the summary function to obtain the coefficients and broom::tidy()2
pois_glm <-glm(recur ~ treatment + number, data = bladder1, family = poisson)summary(pois_glm)
Call:
glm(formula = recur ~ treatment + number, family = poisson, data = bladder1)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.2327 -1.7483 -0.0463 0.9781 2.2114
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.00918 0.06057 16.661 < 2e-16 ***
treatmentpyridoxine 0.25506 0.06889 3.702 0.000214 ***
treatmentthiotepa -0.45167 0.08626 -5.236 1.64e-07 ***
number 0.11603 0.01620 7.164 7.82e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 868.47 on 293 degrees of freedom
Residual deviance: 772.19 on 290 degrees of freedom
AIC: 1529.5
Number of Fisher Scoring iterations: 5
The line pois_glm %>% broom::tidy(exp=T) provides the exponentiated values of the \(\beta\) coefficients. Interpreting the number variable, as the initial number of tumors increases, the mean number of recurrences increases by a factor of 1.12, adjusting for treatment status.
Multinomial Regression
Multinomial Regression is used when the outcome is categorical variable with 3 or more categories. We fit a model using a logit link function for each category and the reference variable. For \(k=1,\ldots,m\) categories and reference level \(2\), the logit link function is modeled as:
You can fit a multinomial regression model when using the vglm() function and setting family = multinomial(refLevel=k), where k is your reference group. For more information of the data, visit this site: https://stats.oarc.ucla.edu/r/dae/multinomial-logistic-regression/.
ml <-read_dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")mglm <-vglm(factor(prog) ~factor(ses) + write, data = ml, family =multinomial(refLevel ="2"))summary(mglm)
Call:
vglm(formula = factor(prog) ~ factor(ses) + write, family = multinomial(refLevel = "2"),
data = ml)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 2.85219 1.16643 2.445 0.01448 *
(Intercept):2 5.21820 1.16351 4.485 7.30e-06 ***
factor(ses)2:1 -0.53329 0.44373 -1.202 0.22943
factor(ses)2:2 0.29139 0.47637 0.612 0.54074
factor(ses)3:1 -1.16283 0.51422 -2.261 0.02374 *
factor(ses)3:2 -0.98267 0.59553 -1.650 0.09893 .
write:1 -0.05793 0.02141 -2.706 0.00682 **
write:2 -0.11360 0.02222 -5.113 3.17e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: log(mu[,1]/mu[,2]), log(mu[,3]/mu[,2])
Residual deviance: 359.9635 on 392 degrees of freedom
Log-likelihood: -179.9817 on 392 degrees of freedom
Number of Fisher scoring iterations: 4
No Hauck-Donner effect found in any of the estimates
Reference group is level 2 of the response
The line exp(coef(mglm)) provides the exponentiated \beta coefficients for categorical variable. Interpreting write for group=1 (academic): as writing score increased by 1 unit, the odds of of being in the academic group decreases by a factor of 0.94, adjusting of social economic status.
Ordinal Regression
Ordinal regression is a subclass of multinomial regression, but the categories have an ordinal component to it. For example, a likert scale can be considered ordinal: Strongly Disagree to Strongly Agree. An ordinal regression model uses the logit link function for model the probability of observing something greater:
\[
\log\left\{\frac{P(Y\ge j)}{P(Y<j)}\right\}=\boldsymbol{X^T\beta}\ \ j\in\{1,2,\ldots,m\}
\] You can fit a ordinal regression model when using the vglm() function and setting family = propodds.
dat <-read_dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")oglm <-vglm(apply~pared+public+gpa, data = dat, family = propodds)summary(oglm)
Call:
vglm(formula = apply ~ pared + public + gpa, family = propodds,
data = dat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -2.20335 0.78440 -2.809 0.00497 **
(Intercept):2 -4.29879 0.80915 -5.313 1.08e-07 ***
pared 1.04766 0.26845 3.903 9.52e-05 ***
public -0.05867 0.28861 -0.203 0.83891
gpa 0.61575 0.26258 2.345 0.01903 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: logitlink(P[Y>=2]), logitlink(P[Y>=3])
Residual deviance: 717.0249 on 795 degrees of freedom
Log-likelihood: -358.5124 on 795 degrees of freedom
Number of Fisher scoring iterations: 4
No Hauck-Donner effect found in any of the estimates
Exponentiated coefficients:
pared public gpa
2.8509581 0.9430165 1.8510513
exp(coef(oglm))
(Intercept):1 (Intercept):2 pared public gpa
0.11043293 0.01358498 2.85095814 0.94301649 1.85105132
Intepreting gpa, as gpa increases by 1 point, the odds of being likely to apply increases by a factor of 1.851, adjusting for parental undergraduate education and plubic vs private college.
Questions
What is the link function for a gamma distribution?
When will you use gamma regression?
The lime data set contains 385 observations on small-leaved lime trees. The Foliage variable measures the foliage biomass. Fit a regression model between Foliage and the following covariates: DBH tree diameter and Age. Make sure to print out your results.
Submit your assignment as an html file. You can use either a QMD or RMD file to create the html
Footnotes
The double colon tells R to use a specific function from a package. Here, use the tidy function from the broom package.↩︎
The double colon tells R to use a specific function from a package. Here, use the tidy function from the broom package.↩︎