GLM

Announcement
Acknowledgement
Introduction
Exponential family distribution
Link function
Fitting a GLM
Hypothesis testing
Diagnostics
Sandwich estimation
Robust estimation

Announcement

Just stay dry and safe

Acknowledgement

Dr. Hua Zhou’s slides

Introduction

Now we have learnt regression modeling of binomial, multinomial, and count responses. How about nonnegative real responses and so on? Generalized linear model (GLM) is a generic framework that encompasses normal regression, binomial regression, multinomial regression, and others. There are two essential components of the GLM framework: a distribution for response \(Y\) and a link function that relates mean of \(Y\) to covariates \(x\).

Exponential family distribution

Definition

In GLM, the distribution of \(Y\) is from the exponential familty of distributions of form \[ f(y \mid \theta, \phi) = \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right]. \] \(\theta\) is called the canonical parameter or natural parameter and represents the location while \(\phi\) is the dispersion parameter and represents the scale. Note the canonical parameter \(\theta\) is not necessarily the mean \(\mu\).

Examples

Normal or Gaussian: \[ f(y \mid \theta, \phi) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left[ - \frac{(y - \mu)^2}{2\sigma^2} \right] = \exp \left[ \frac{y\mu - \mu^2/2}{\sigma^2} - \frac 12 \left( \frac{y^2}{\sigma^2} + \log(2\pi \sigma^2) \right) \right]. \] So we can write \[\begin{eqnarray*} \theta &=& \mu \\ \phi &=& \sigma^2 \\ a(\phi) &=& \phi \\ b(\theta) &=& \theta^2/2 \\ c(y, \phi) &=& -\frac 12 (y^2/\phi + \log(2\pi \phi)). \end{eqnarray*}\]
Binomial: If we treat \(Y\) as a binomial proportion; that is \(nY \sim \text{Bin}(n, p)\), then the density is \[\begin{eqnarray*} & & f(y \mid \theta, \phi) = \binom{n}{ny} p^{ny} (1 -p)^{n(1-y)} \\ &=& \exp \left[ ny \log p + n(1 - y) \log (1 - p) + \log \binom{n}{ny} \right] \\ &=& \exp \left[ \frac{y \log \frac{p}{1 - p} + \log (1 - p)}{1/n} + \log \binom{n}{ny} \right]. \end{eqnarray*}\] So we see \[\begin{eqnarray*} \theta &=& \log \frac{p}{1 - p} \\ \phi &=& 1 \\ a(\phi) &=& \frac{1}{n} \\ b(\theta) &=& - \log (1 - p) = \log (1 + \exp \theta) \\ c(y, \phi) &=& \log \binom{n}{ny}. \end{eqnarray*}\]
Poisson: \[ f(y \mid \theta, \phi) = e^{-\mu} \frac{\mu^y}{y!} = \exp (y \log \mu - \mu - \log y!). \] So we have \[\begin{eqnarray*} \theta &=& \log \mu \\ \phi &=& 1 \\ a(\phi) &=& 1 \\ b(\theta) &=& \exp \theta \\ c(y, \phi) &=& - \log y!. \end{eqnarray*}\]
Gamma has density \[ f(y \mid \nu, \lambda) = \frac{1}{\Gamma(\nu)} \lambda^{\nu} y^{\nu - 1} e^{-\lambda y}, \quad y > 0, \] where \(\nu\) is the shape parameter and \(\lambda\) is the scale parameter. For the purpose of GLM, it’s convenient to reparameterize by \(\lambda = \nu / \mu\) to get \[ f(y) = \frac{1}{\Gamma(\nu)} \left( \frac{\nu}{\mu} \right)^{\nu} y^{\nu - 1} e^{-y\nu / \mu} = \exp \left\{ \frac{- y \mu^{-1} - \log \mu}{\nu^{-1}} + (\nu-1) \log y + \nu \log \nu - \log \Gamma(\nu) \right\}. \] Now \(\mathbb{E}Y = \mu\) and \(\operatorname{Var}(Y) = \mu^2 / \nu = (\mathbb{E} Y)^2 / \nu\). So we have \[\begin{eqnarray*} \theta &=& - \mu^{-1} \\ \phi &=& \nu^{-1} \\ a(\phi) &=& \phi \\ b(\theta) &=& - \log (- \theta) \\ c(y, \phi) &=& (\phi^{-1} - 1) \log y - \phi^{-1} \log (\phi) - \log \Gamma(\phi^{-1}). \end{eqnarray*}\] Some books remove the minus sign in the canonical parameter/link which is fine provided we take account of this in any derivations. For the canonical link \(\eta = \mu^{-1}\), the systematic component can only be non-negative, which could cause problems. Other possible link functions are log link \(\eta = \log \mu\) and identity link \(\eta = \mu\).
Many other distributions.

Moments

Exponential family distributions have mean and variance \[\begin{eqnarray*} \mathbb{E}Y &=& \mu = b'(\theta) \\ \operatorname{Var}Y &=& \sigma^2 = b''(\theta) a(\phi). \end{eqnarray*}\] Show this in Friday’s Lab session. Thus the function \(b\) determines the moments of \(Y\).

Link function

Given the linear predictor or systematic component \[ \eta = \mathbf{x}^T \boldsymbol{\beta}, \] the link function, \(g\), relates the mean \(\mathbb{E} Y = \mu\) to the covariates \[ \eta = g(\mu). \]
In principal, any monotone, continuous, and differentiable function can be a link function. But there are some convenient and common choices for the standard GLMs.
- For Gaussian linear model, the identity link, \(\eta = \mu\), is the obvious choice.
- For binomial model, we saw logit, probit and cloglog (complementary log-log transformation).
- For Poisson model, a standard choice is \(\eta = \log \mu\).
The canonical link has \(g\) such that \(\eta = g(\mu) = \theta\), the canonical parameter of the exponential family distribution. This means that \(g(b'(\theta))=\theta\). If a canonical link is used, \(\mathbf{X}^T \mathbf{y}\) is sufficient for \(\boldsymbol{\beta}\). The canonical link is mathematically and computationally convenient and is often the natural choice of link.

Family	Canonical Link	Variance Function
Normal	\(\eta=\mu\)	1
Poisson	\(\eta=\log \mu\)	\(\mu\)
Bernoulli	\(\eta=\log \left( \frac{\mu}{1 - \mu} \right)\)	\(\mu (1 - \mu)\)
Gamma	\(\eta = \mu^{-1}\)	\(\mu^2\)
Inverse Gaussian	\(\eta = \mu^{-2}\)	\(\mu^3\)

Fitting a GLM

GLM regreesion coefficients are estimated by MLE. Recall that the Newton-Raphson algorithm for maximizing a log-likelihood \(\ell(\boldsymbol{\beta})\) proceeds as \[ \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + s [- \nabla^2 \ell(\boldsymbol{\beta}^{(t)})]^{-1} \nabla \ell(\boldsymbol{\beta}^{(t)}), \] where \(s>0\) is a step length, \(\nabla \ell\) is the score (gradient) vector, and \(-\nabla^2 \ell\) is the observed information matrix (negative Hessian).
For GLM, \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \\ \nabla \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{(y_i - \mu_i) \mu_i'(\eta_i)}{\sigma_i^2} \mathbf{x}_i \\ - \nabla^2 \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{[\mu_i'(\eta_i)]^2}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T - \sum_{i=1}^n \frac{(y_i - \mu_i) \mu_i''(\eta_i)}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T \\ & & + \sum_{i=1}^n \frac{(y_i - \mu_i) [\mu_i'(\eta_i)]^2 (d \sigma_i^{2} / d\mu_i)}{\sigma_i^4} \mathbf{x}_i \mathbf{x}_i^T. \end{eqnarray*}\] Show this in Friday’s Lab session. For GLMs with canonical links, the second term and third term cancel using the fact \[ \frac{d\mu_i}{d \eta_i} = \frac{d\mu_i}{d \theta_i} = \frac{d \, b'(\theta_i)}{d \theta_i} = b''(\theta_i) = \frac{\sigma_i^2}{a(\phi)}. \] Therefore for canonical link the negative Hessian is positive semidefinte and Newton’s algorithm with line search is stable.

See Biostat 216 lecture notes for a refresher of matrix calculus.
How about non-canonical link? We use the expected (Fisher) information matrix \[ \mathbb{E} [- \nabla^2 \ell(\boldsymbol{\beta})] = \sum_{i=1}^n \frac{[\mu_i'(\eta_i)]^2}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T = \mathbf{X}^T \mathbf{W} \mathbf{X} \succeq 0, \] where \(\mathbf{W} = \text{diag}([\mu_i'(\eta_i)]^2/\sigma_i^2)\). This modified Newton-Raphson algorithm is called the Fisher scoring algorithm.
Take the Binomial logistic regression as an example \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n [y_i \log p_i + (n_i - y_i) \log (1 - p_i)] = \sum_{i=1}^n [y_i \mathbf{x}_i^T \boldsymbol{\beta} - n_i \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}})] \\ \nabla \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \left( y_i \mathbf{x}_i - n_i \frac{\exp \mathbf{x}_i^T \boldsymbol{\beta}}{1 + \exp \mathbf{x}_i^T \boldsymbol{\beta}} \mathbf{x}_i \right) = \sum_{i=1}^n (y_i - n_i p_i) \mathbf{x}_i = \mathbf{X}^T (\mathbf{y} - \widehat{\boldsymbol{\mu}}) \\ - \nabla^2 \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n n_i p_i (1 - p_i) \mathbf{x}_i \mathbf{x}_i^T = \mathbf{X}^T \mathbf{W} \mathbf{X}, \quad \mathbf{W} = \text{diag}(w_1, \ldots, w_n), w_i = n_i p_i (1 - p_i) \\ \mathbb{E} [- \nabla^2 \ell(\boldsymbol{\beta})] &=& - \nabla^2 \ell(\boldsymbol{\beta}). \end{eqnarray*}\] The Fisher scoring algorithmn proceeds as \[\begin{eqnarray*} \boldsymbol{\beta}^{(t+1)} &=& \boldsymbol{\beta}^{(t)} + s(\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)}) \\ &=& (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} [\mathbf{X} \boldsymbol{\beta}^{(t)} + s (\mathbf{W}^{(t)})^{-1} (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)})] \\ &=& (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}, \end{eqnarray*}\] where \[ \mathbf{z}^{(t)} = \mathbf{X} \boldsymbol{\beta}^{(t)} + s (\mathbf{W}^{(t)})^{-1} (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)}) \] are working responses. In this sense, the Fisher scoring algorithm for GLM is also called the IRWLS (iteratively reweighted least squares).
Recall the insecticide data bliss we used for binomial regression.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(faraway)
bliss

##   dead alive conc
## 1    2    28    0
## 2    8    22    1
## 3   15    15    2
## 4   23     7    3
## 5   27     3    4

Let’s manually implement the first iteration of the Fisher scoring algorithm.

p <- bliss$dead / 30
eta <- logit(p)
z <- eta
w <- 30 * p * (1 - p)
lmod <- lm(z ~ conc, weights = w, data = bliss)
coef(lmod)

## (Intercept)        conc 
##   -2.302462    1.153587

The Fisher scoring algorithm converges quickly after a few iterations.

y = bliss$dead
for (iter in 1:5) {
  eta <- lmod$fit
  p <- ilogit(eta)
  w <- 30 * p * (1 - p)
  z <- eta + (y - 30 * p) / w
  lmod <- lm(z ~ conc, weights = w, data = bliss)
  cat(iter, coef(lmod), "\n")
}

## 1 -2.323672 1.161847 
## 2 -2.32379 1.161895 
## 3 -2.32379 1.161895 
## 4 -2.32379 1.161895 
## 5 -2.32379 1.161895

Compare to the glm fit.

glm(cbind(dead, alive) ~ conc, family = binomial, data = bliss) %>%
  summary()

## 
## Call:
## glm(formula = cbind(dead, alive) ~ conc, family = binomial, data = bliss)
## 
## Deviance Residuals: 
##       1        2        3        4        5  
## -0.4510   0.3597   0.0000   0.0643  -0.2045  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3238     0.4179  -5.561 2.69e-08 ***
## conc          1.1619     0.1814   6.405 1.51e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 64.76327  on 4  degrees of freedom
## Residual deviance:  0.37875  on 3  degrees of freedom
## AIC: 20.854
## 
## Number of Fisher Scoring iterations: 4

After finding the MLE \(\widehat{\boldsymbol{\beta}}\), the variance is obtained by \[ \operatorname{Var} \widehat{\boldsymbol{\beta}} = \widehat{\phi} (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1}. \]

xm <- model.matrix(lmod)
wm <- diag(w)
sqrt(diag(solve(t(xm) %*% wm %*% xm)))

## (Intercept)        conc 
##   0.4178878   0.1814158

Hypothesis testing

When considering the choice of model for data, two extremes are the null or intercept-only model and the full or saturated model.
- The null model means there’s no relation between predictors and the response. Usually it means we fit a common mean \(\mu\) for all \(y\).
- The full model means data is explained exactly. Typically it means we need to use \(n\) parameters for \(n\) data points.
To assess the goodness of fit of a model, we might consider likelihood ratio test. For independent observations from exponential family with \(a_i(\phi) = \phi\), this simplifies to \[ \frac{D(\mathbf{y}, \widehat{\boldsymbol{\mu}})}{\phi} = \frac{2 \sum_i [y_i(\tilde \theta_i - \hat \theta_i) - b(\tilde \theta_i) + b(\hat \theta_i)]}{\phi}, \] where \(\tilde \theta\) are the estimates under the full model and \(\hat \theta\) are the estimates under the model of interest. \(D(\mathbf{y}, \widehat{\boldsymbol{\mu}})\) is called the deviance and \(D(\mathbf{y}, \widehat{\boldsymbol{\mu}}) / \phi\) is the scaled deviance.
An alternative measure of goodness of fit is the Pearson’s \(X^2\) statistic \[ X^2 = \sum_i \frac{(y_i - \hat \mu_i)^2}{\operatorname{Var}(\hat \mu_i)}. \]

GLM	Deviance
Gaussian	\(\sum_i (y_i - \hat \mu_i)^2\)
Poisson	\(2\sum_i [y_i \log(y_i / \hat \mu_i) - (y_i - \hat \mu_i)]\)
Binomial	\(2 \sum_i [y_i \log(y_i / \hat \mu_i) + (n_i - y_i) \log((n_i - y_i)/(n_i - \hat \mu_i))]\)
Gamma	\(2 \sum_i [- \log(y_i / \hat \mu_i) + (y_i - \hat \mu_i) / \hat \mu_i]\)
Inverse Gaussian	\(\sum_i (y_i - \hat \mu_i)^2 / (\hat \mu_i^2 y_i)\)

For goodness of fit test, we use the fact that, under certain conditions, provided the model is correct, the scaled Deviance and the Pearson’s \(X^2\) statistic are both asymptotically \(\chi^2\) with degrees of freedom equal to the difference of the numbers of identifiable parameters.
For Gaussian, \(\phi\) is unknown so this test cannot be used. For binomial and Poisson, \(\phi=1\) so the test is practical. However the accuracy of asymptotic approximation is dubious for smaller batch sizes. For binary responses, the approximation is worthless.
To compare two nested models \(\Omega\) and \(\omega\), difference of the scaled deviance \(D_\omega - D_\Omega\) is asymptotically \(\chi^2\) with degrees of freedom equal to the difference in the number of identifiable parameters in the two models. For Gaussian model and other models where the disperson \(\phi\) is unknown, we can insert an estimate of \(\phi\) and compute an \(F\) test \[ \frac{(D_\omega - D_\Omega) / (\text{df}_{\omega} - \text{df}_{\Omega})}{\hat \phi}, \] where \(\hat \phi = X^2 / (n-p)\) is a good estimate of the dispersion. For Gaussian, the F-test is exact. For other models, the F-test is approximate.
Recall the insecticide data bliss we used for binomial regression. The goodness of fit test by analysis deviance shows that this model fits data well. Comparing the fit to the null model also shows that the predictor conc is highly significant.

bliss

##   dead alive conc
## 1    2    28    0
## 2    8    22    1
## 3   15    15    2
## 4   23     7    3
## 5   27     3    4

modl <- glm(cbind(dead, alive) ~ conc, family = binomial, data = bliss)
summary(modl)

## 
## Call:
## glm(formula = cbind(dead, alive) ~ conc, family = binomial, data = bliss)
## 
## Deviance Residuals: 
##       1        2        3        4        5  
## -0.4510   0.3597   0.0000   0.0643  -0.2045  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3238     0.4179  -5.561 2.69e-08 ***
## conc          1.1619     0.1814   6.405 1.51e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 64.76327  on 4  degrees of freedom
## Residual deviance:  0.37875  on 3  degrees of freedom
## AIC: 20.854
## 
## Number of Fisher Scoring iterations: 4

Diagnostics

Residuals

Pearson residual \[ r_p = \frac{y - \hat \mu}{\sqrt{\operatorname{Var}(\hat \mu)}}. \]
Deviance residual \[ r_D = \text{sign}(y - \hat \mu) \sqrt{d_i}, \] where \(d_i\) are summands in the calculation of deviance.
These are different kinds of residuals from the Binomial regression of the intesticide data bliss.

residuals(modl, type = "deviance") # deviance residuals

##           1           2           3           4           5 
## -0.45101510  0.35969607  0.00000000  0.06430235 -0.20449347

residuals(modl, type = "pearson") # Pearson residuals

##             1             2             3             4             5 
## -4.325234e-01  3.643729e-01 -3.648565e-15  6.414687e-02 -2.081068e-01

residuals(modl, type = "response") # response - fitted values

##             1             2             3             4             5 
## -2.250510e-02  2.834353e-02 -3.330669e-16  4.989802e-03 -1.082823e-02

bliss$dead / 30.0 - fitted(modl)

##             1             2             3             4             5 
## -2.250510e-02  2.834353e-02 -3.330669e-16  4.989802e-03 -1.082823e-02

residuals(modl, type = "working") # working response

##             1             2             3             4             5 
## -2.770876e-01  1.561410e-01 -1.332268e-15  2.748820e-02 -1.333195e-01

modl$residuals

##             1             2             3             4             5 
## -2.770876e-01  1.561410e-01 -1.332268e-15  2.748820e-02 -1.333195e-01

We mostly use the deviance residuals for diagnostics.

Leverage and influence

For GLM, the hat matrix is \[ \mathbf{H} = \mathbf{W}^{1/2} \mathbf{X} (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{1/2}, \] where \(\mathbf{W}\) is the weight matrix at the fitted model. Diagonal elements of \(\mathbf{H}\) are the leverages \(h_i\). A larger value of leverage indicates that the fit may be sensitive to the response at case \(i\). Its predictor values are unusual in some way.

influence(modl)$hat

##         1         2         3         4         5 
## 0.4255049 0.4133068 0.3223765 0.4133068 0.4255049

The studentized residuals are \[ r_{SD} = \frac{r_D}{\sqrt{\hat \phi (1 - h_i)}}. \]

rstudent(modl)

##           1           2           3           4           5 
## -0.58478586  0.47213544  0.00000000  0.08386629 -0.27183519

Leverage only measures the potential to affect the fit whereas measures of influence (change in coefficients if we omit a case) more directly access the effect of each case on the fit.

influence(modl)$coef

##    (Intercept)         conc
## 1 -0.214001479  0.080663550
## 2  0.155671882 -0.047087302
## 3  0.000000000  0.000000000
## 4 -0.005841678  0.008417729
## 5  0.049263918 -0.036573429

Alternatively we can examine the Cook statistics \[ D_i = \frac{(\widehat{\boldsymbol{\beta}}_{(i)} - \widehat{\boldsymbol{\beta}})^T (\mathbf{X}^T \mathbf{W} \mathbf{X}) (\widehat{\boldsymbol{\beta}}_{(i)} - \widehat{\boldsymbol{\beta}})}{p \widehat \phi}. \]

cooks.distance(modl)

##            1            2            3            4            5 
## 1.205927e-01 7.970999e-02 4.673053e-30 2.470424e-03 2.791738e-02

Residual plots

We revisit Poisson regression example on the Galápagos data.

gala <- as_tibble(gala, rownames = "Island") %>%
  print(n = Inf)

## # A tibble: 30 x 8
##    Island       Species Endemics     Area Elevation Nearest Scruz Adjacent
##    <chr>          <dbl>    <dbl>    <dbl>     <dbl>   <dbl> <dbl>    <dbl>
##  1 Baltra            58       23   25.1         346     0.6   0.6    1.84 
##  2 Bartolome         31       21    1.24        109     0.6  26.3  572.   
##  3 Caldwell           3        3    0.21        114     2.8  58.7    0.78 
##  4 Champion          25        9    0.1          46     1.9  47.4    0.18 
##  5 Coamano            2        1    0.05         77     1.9   1.9  904.   
##  6 Daphne.Major      18       11    0.34        119     8     8      1.84 
##  7 Daphne.Minor      24        0    0.08         93     6    12      0.34 
##  8 Darwin            10        7    2.33        168    34.1 290.     2.85 
##  9 Eden               8        4    0.03         71     0.4   0.4   18.0  
## 10 Enderby            2        2    0.18        112     2.6  50.2    0.1  
## 11 Espanola          97       26   58.3         198     1.1  88.3    0.570
## 12 Fernandina        93       35  634.         1494     4.3  95.3 4669.   
## 13 Gardner1          58       17    0.570        49     1.1  93.1   58.3  
## 14 Gardner2           5        4    0.78        227     4.6  62.2    0.21 
## 15 Genovesa          40       19   17.4          76    47.4  92.2  129.   
## 16 Isabela          347       89 4669.         1707     0.7  28.1  634.   
## 17 Marchena          51       23  129.          343    29.1  85.9   59.6  
## 18 Onslow             2        2    0.01         25     3.3  45.9    0.1  
## 19 Pinta            104       37   59.6         777    29.1 120.   129.   
## 20 Pinzon           108       33   18.0         458    10.7  10.7    0.03 
## 21 Las.Plazas        12        9    0.23         94     0.5   0.6   25.1  
## 22 Rabida            70       30    4.89        367     4.4  24.4  572.   
## 23 SanCristobal     280       65  552.          716    45.2  66.6    0.570
## 24 SanSalvador      237       81  572.          906     0.2  19.8    4.89 
## 25 SantaCruz        444       95  904.          864     0.6   0      0.52 
## 26 SantaFe           62       28   24.1         259    16.5  16.5    0.52 
## 27 SantaMaria       285       73  171.          640     2.6  49.2    0.1  
## 28 Seymour           44       16    1.84        147     0.6   9.6   25.1  
## 29 Tortuga           16        8    1.24        186     6.8  50.9   18.0  
## 30 Wolf              21       12    2.85        253    34.1 255.     2.33

modp <- glm(Species ~ . - Endemics - Island, family = poisson, data = gala)
summary(modp)

## 
## Call:
## glm(formula = Species ~ . - Endemics - Island, family = poisson, 
##     data = gala)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.2752  -4.4966  -0.9443   1.9168  10.1849  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***
## Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***
## Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***
## Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***
## Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***
## Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  716.85  on 24  degrees of freedom
## AIC: 889.68
## 
## Number of Fisher Scoring iterations: 5

For GLM, it’s better to plot the deviance residuals against the linear predictors \(\widehat{\eta}\) rather than the predicted responses. We expect to see a constant variance of the deviance residuals.

gala %>%
  mutate(devres  = residuals(modp, type = "deviance"),
         linpred = predict(modp, type = "link")) %>%
  ggplot() + 
  geom_point(mapping = aes(x = linpred, y = devres)) + 
  labs(x = expression(hat(eta)), y = "Deviance residual")

If we plot response residuals \(y_i - \widehat{\mu}_i\) against the linear predictor \(\widehat{\eta}\), then we see variance increases with \(\widehat{\mu}\) which is consistent with a Poisson model.

gala %>%
  mutate(resres  = residuals(modp, type = "response"),
         linpred = predict(modp, type = "link")) %>%
  ggplot() + 
  geom_point(mapping = aes(x = linpred, y = resres)) + 
  labs(x = expression(hat(eta)), y = "Response residual")

Check linearity of predictors

Let’s plot the response Species against the predictor Area. We see majority of islands have a small area with a few exceptions.

gala %>%
  ggplot() + 
  geom_point(mapping = aes(x = Area, y = Species))

Log transform of Area reveals a curvilinear relationship.

gala %>%
  ggplot() + 
  geom_point(mapping = aes(x = log(Area), y = Species))

Poisson regression uses a log link, which needs to be taken into account. Plotting the linearized response (working response) \[ z = \eta + (y - \mu) \frac{d\eta}{d \mu} \] against log(Area) shows a linear relationship.

gala %>%
  mutate(eta = predict(modp, type = "link"),
         mu  = predict(modp, type = "response"),
         res = residuals(modp, type = "response")) %>%
  ggplot() + 
  geom_point(mapping = aes(x = log(Area), y = eta + res / mu)) + 
  labs(x = "log(Area)", y = "Linearized response")

From the same reasoning, we fit a model with log transformation of all predictors. We see a substantial reduction in deviance.

modpl <- glm(Species ~ log(Area) + log(Elevation) + log(Nearest) + log(Scruz + 0.5) + log(Adjacent), family = poisson, data = gala)
summary(modpl)

## 
## Call:
## glm(formula = Species ~ log(Area) + log(Elevation) + log(Nearest) + 
##     log(Scruz + 0.5) + log(Adjacent), family = poisson, data = gala)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4176  -2.7323  -0.4655   2.5544   8.3385  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.331143   0.286899  11.611  < 2e-16 ***
## log(Area)         0.350684   0.018004  19.479  < 2e-16 ***
## log(Elevation)    0.032465   0.057047   0.569  0.56929    
## log(Nearest)     -0.039966   0.014126  -2.829  0.00467 ** 
## log(Scruz + 0.5) -0.037242   0.013783  -2.702  0.00689 ** 
## log(Adjacent)    -0.089481   0.006945 -12.885  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.7  on 29  degrees of freedom
## Residual deviance:  360.0  on 24  degrees of freedom
## AIC: 532.83
## 
## Number of Fisher Scoring iterations: 5

Half normal plot

Q-Q plot of the residuals is the standard way to check the normality assumption on the errors. For GLM, it’s better use a half-normal plot that compares the sorted absolute residuals and the quantiles of the half-normal distribution \[ \Phi^{-1} \left( \frac{n+i}{2n + i} \right), \quad i=1,\ldots,n. \] The residuals are not expected to be normally distributed, so we are not looking for an approximate straight line. We only seek outliers which may be identified as points off the trend. A half-normal plot is better for this purpose because in a sense the resolution of the plot is doubled by having all the points in one tail.

halfnorm(rstudent(modpl))

gali <- influence(modpl)
halfnorm(gali$hat)

Sandwich estimation

To illustrate with the linear model, suppose we relax the assumption of the linear regression model \(\mathbf{Y} \sim \text{N}(\mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I})\) to \[ \mathbf{Y} \sim \text{N}(\mathbf{X} \boldsymbol{\beta}, \boldsymbol{\Omega}), \] where \(\boldsymbol{\Omega}\) is an unknow covariance matrix. We still use the least squares solution estimate \[ \widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. \] Then \[ \operatorname{Var} \widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} (\mathbf{X}^T \boldsymbol{\Omega} \mathbf{X}) (\mathbf{X}^T \mathbf{X})^{-1}. \]
For correct inference, we would like to plug in an estimate of \(\boldsymbol{\Omega}\) and use this sandwich estimator.
Generalization to GLM is similar.

library(sandwich)
glm(Species ~ . - Endemics - Island, family = poisson, data = gala) %>%
  vcovHC() %>%
  diag() %>%
  sqrt()

##  (Intercept)         Area    Elevation      Nearest        Scruz     Adjacent 
## 0.3185550946 0.0012176343 0.0011939774 0.0228021845 0.0065382200 0.0006212657

Standard errors are closer to those from the overdispersion Poisson model.

Robust estimation

CUBIF (conditionally unbiased inluence function) method bounds the maximum influence each observation can exert.

library(robust)
glmRob(Species ~ log(Area) + log(Adjacent), family = poisson, data = gala) %>%
  summary()

GLM

MATH-7360 Data Analysis

Dr. Xiang Ji @ Tulane University

Oct 21, 2020