Lab 3: Generalized Linear Models

Published

October 26, 2022

R Packages Used

library(VGAM) # Multinomial and Ordinal Regression
Loading required package: stats4
Loading required package: splines
library(survival) # Data sets
library(MASS) # Data Set
library(GLMsData) # Data Set
library(tidyverse)
── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.5
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.1
✔ readr   2.1.2     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::select() masks MASS::select()
library(haven) # Reading Stata File

References

For more information on this topic, I recommend you reading the following books:

  1. Generalized Linear Models with Examples in R - Dunn

  2. Vectorized Generalized Linear and Additive Models - Yee

Generalized Linear Models

In linear regression, we are fitting a model between a set of predictors and an outcome variable. The outcome variable is generally normally distributed. However, if the outcome is not normally distributed, you must use a generalized linear model to assess the association between a set of predictors and an outcome variable.

Logistic Regression

A logistic regression model is used when the outcome is binary, yes and no. The link function is the logit function that models the association between a set of linear predictors and the odds of observing yes:

\[ \log\left\{\frac{P(Y=1)}{P(Y=0)}\right\} = \boldsymbol{X^T\beta} \] Fitting a logistic regression requires you to use the glm() function and specifying family=binomial. Then use the summary function to obtain the coefficients and broom::tidy()1

logit_glm <- glm(y ~ ap + hilo, data = bacteria, 
                 family = binomial)
summary(logit_glm)

Call:
glm(formula = y ~ ap + hilo, family = binomial, data = bacteria)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1249   0.4701   0.5885   0.6778   0.8357  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.3539     0.2855   4.743 2.11e-06 ***
app           0.7933     0.3748   2.116   0.0343 *  
hilolo       -0.4816     0.3480  -1.384   0.1664    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 217.38  on 219  degrees of freedom
Residual deviance: 209.87  on 217  degrees of freedom
AIC: 215.87

Number of Fisher Scoring iterations: 4
## OR ##

logit_glm %>% broom::tidy(exp=T)
# A tibble: 3 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)    3.87      0.285      4.74 0.00000211
2 app            2.21      0.375      2.12 0.0343    
3 hilolo         0.618     0.348     -1.38 0.166     

A \(\beta\) coefficient is interpreted as an odds ratio. The odds of success when ap=p is 2.21 times higher than when ap=a, after adjusting for hilo.

Poisson Regression

Poisson regression is used when the outcome is count data. The link function is the log of the mean count:

\[ \log(\mu)=\boldsymbol{X^T\beta} \]

Fitting a Poisson regression requires you to use the glm() function and specifying family=poisson. Then use the summary function to obtain the coefficients and broom::tidy()2

pois_glm <- glm(recur ~ treatment + number, data = bladder1, 
                family = poisson)
summary(pois_glm)

Call:
glm(formula = recur ~ treatment + number, family = poisson, data = bladder1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.2327  -1.7483  -0.0463   0.9781   2.2114  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          1.00918    0.06057  16.661  < 2e-16 ***
treatmentpyridoxine  0.25506    0.06889   3.702 0.000214 ***
treatmentthiotepa   -0.45167    0.08626  -5.236 1.64e-07 ***
number               0.11603    0.01620   7.164 7.82e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 868.47  on 293  degrees of freedom
Residual deviance: 772.19  on 290  degrees of freedom
AIC: 1529.5

Number of Fisher Scoring iterations: 5
pois_glm %>% broom::tidy(exp=T)
# A tibble: 4 × 5
  term                estimate std.error statistic  p.value
  <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)            2.74     0.0606     16.7  2.51e-62
2 treatmentpyridoxine    1.29     0.0689      3.70 2.14e- 4
3 treatmentthiotepa      0.637    0.0863     -5.24 1.64e- 7
4 number                 1.12     0.0162      7.16 7.82e-13

The line pois_glm %>% broom::tidy(exp=T) provides the exponentiated values of the \(\beta\) coefficients. Interpreting the number variable, as the initial number of tumors increases, the mean number of recurrences increases by a factor of 1.12, adjusting for treatment status.

Multinomial Regression

Multinomial Regression is used when the outcome is categorical variable with 3 or more categories. We fit a model using a logit link function for each category and the reference variable. For \(k=1,\ldots,m\) categories and reference level \(2\), the logit link function is modeled as:

\[ \log\left\{\frac{P(Y=j)}{P(Y=2)}\right\}=\boldsymbol{X^T\beta}\ \ j\in\{1,3,4,5,\ldots,m\} \]

You can fit a multinomial regression model when using the vglm() function and setting family = multinomial(refLevel=k), where k is your reference group. For more information of the data, visit this site: https://stats.oarc.ucla.edu/r/dae/multinomial-logistic-regression/.

ml <- read_dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")
mglm <- vglm(factor(prog) ~ factor(ses) + write, data = ml, family = multinomial(refLevel = "2"))
summary(mglm)

Call:
vglm(formula = factor(prog) ~ factor(ses) + write, family = multinomial(refLevel = "2"), 
    data = ml)

Coefficients: 
               Estimate Std. Error z value Pr(>|z|)    
(Intercept):1   2.85219    1.16643   2.445  0.01448 *  
(Intercept):2   5.21820    1.16351   4.485 7.30e-06 ***
factor(ses)2:1 -0.53329    0.44373  -1.202  0.22943    
factor(ses)2:2  0.29139    0.47637   0.612  0.54074    
factor(ses)3:1 -1.16283    0.51422  -2.261  0.02374 *  
factor(ses)3:2 -0.98267    0.59553  -1.650  0.09893 .  
write:1        -0.05793    0.02141  -2.706  0.00682 ** 
write:2        -0.11360    0.02222  -5.113 3.17e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: log(mu[,1]/mu[,2]), log(mu[,3]/mu[,2])

Residual deviance: 359.9635 on 392 degrees of freedom

Log-likelihood: -179.9817 on 392 degrees of freedom

Number of Fisher scoring iterations: 4 

No Hauck-Donner effect found in any of the estimates


Reference group is level  2  of the response
exp(coef(mglm))
 (Intercept):1  (Intercept):2 factor(ses)2:1 factor(ses)2:2 factor(ses)3:1 
    17.3256181    184.6016219      0.5866710      1.3382906      0.3125996 
factor(ses)3:2        write:1        write:2 
     0.3743102      0.9437175      0.8926126 

The line exp(coef(mglm)) provides the exponentiated \beta coefficients for categorical variable. Interpreting write for group=1 (academic): as writing score increased by 1 unit, the odds of of being in the academic group decreases by a factor of 0.94, adjusting of social economic status.

Ordinal Regression

Ordinal regression is a subclass of multinomial regression, but the categories have an ordinal component to it. For example, a likert scale can be considered ordinal: Strongly Disagree to Strongly Agree. An ordinal regression model uses the logit link function for model the probability of observing something greater:

\[ \log\left\{\frac{P(Y\ge j)}{P(Y<j)}\right\}=\boldsymbol{X^T\beta}\ \ j\in\{1,2,\ldots,m\} \] You can fit a ordinal regression model when using the vglm() function and setting family = propodds.

dat <- read_dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
oglm <- vglm(apply~pared+public+gpa, data = dat, family = propodds)
summary(oglm)

Call:
vglm(formula = apply ~ pared + public + gpa, family = propodds, 
    data = dat)

Coefficients: 
              Estimate Std. Error z value Pr(>|z|)    
(Intercept):1 -2.20335    0.78440  -2.809  0.00497 ** 
(Intercept):2 -4.29879    0.80915  -5.313 1.08e-07 ***
pared          1.04766    0.26845   3.903 9.52e-05 ***
public        -0.05867    0.28861  -0.203  0.83891    
gpa            0.61575    0.26258   2.345  0.01903 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y>=2]), logitlink(P[Y>=3])

Residual deviance: 717.0249 on 795 degrees of freedom

Log-likelihood: -358.5124 on 795 degrees of freedom

Number of Fisher scoring iterations: 4 

No Hauck-Donner effect found in any of the estimates


Exponentiated coefficients:
    pared    public       gpa 
2.8509581 0.9430165 1.8510513 
exp(coef(oglm))
(Intercept):1 (Intercept):2         pared        public           gpa 
   0.11043293    0.01358498    2.85095814    0.94301649    1.85105132 

Intepreting gpa, as gpa increases by 1 point, the odds of being likely to apply increases by a factor of 1.851, adjusting for parental undergraduate education and plubic vs private college.

Questions

  1. What is the link function for a gamma distribution?

  2. When will you use gamma regression?

  3. The lime data set contains 385 observations on small-leaved lime trees. The Foliage variable measures the foliage biomass. Fit a regression model between Foliage and the following covariates: DBH tree diameter and Age. Make sure to print out your results.

Submit your assignment as an html file. You can use either a QMD or RMD file to create the html

Footnotes

  1. The double colon tells R to use a specific function from a package. Here, use the tidy function from the broom package.↩︎

  2. The double colon tells R to use a specific function from a package. Here, use the tidy function from the broom package.↩︎