Dr. Hua Zhou’s slides
Now we have learnt regression modeling of binomial, multinomial, and count responses. How about nonnegative real responses and so on? Generalized linear model (GLM) is a generic framework that encompasses normal regression, binomial regression, multinomial regression, and others. There are two essential components of the GLM framework: a distribution for response \(Y\) and a link function that relates mean of \(Y\) to covariates \(x\).
In GLM, the distribution of \(Y\) is from the exponential familty of distributions of form \[ f(y \mid \theta, \phi) = \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right]. \] \(\theta\) is called the canonical parameter or natural parameter and represents the location while \(\phi\) is the dispersion parameter and represents the scale. Note the canonical parameter \(\theta\) is not necessarily the mean \(\mu\).
Normal or Gaussian: \[ f(y \mid \theta, \phi) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left[ - \frac{(y - \mu)^2}{2\sigma^2} \right] = \exp \left[ \frac{y\mu - \mu^2/2}{\sigma^2} - \frac 12 \left( \frac{y^2}{\sigma^2} + \log(2\pi \sigma^2) \right) \right]. \] So we can write \[\begin{eqnarray*} \theta &=& \mu \\ \phi &=& \sigma^2 \\ a(\phi) &=& \phi \\ b(\theta) &=& \theta^2/2 \\ c(y, \phi) &=& -\frac 12 (y^2/\phi + \log(2\pi \phi)). \end{eqnarray*}\]
Binomial: If we treat \(Y\) as a binomial proportion; that is \(nY \sim \text{Bin}(n, p)\), then the density is \[\begin{eqnarray*} & & f(y \mid \theta, \phi) = \binom{n}{ny} p^{ny} (1 -p)^{n(1-y)} \\ &=& \exp \left[ ny \log p + n(1 - y) \log (1 - p) + \log \binom{n}{ny} \right] \\ &=& \exp \left[ \frac{y \log \frac{p}{1 - p} + \log (1 - p)}{1/n} + \log \binom{n}{ny} \right]. \end{eqnarray*}\] So we see \[\begin{eqnarray*} \theta &=& \log \frac{p}{1 - p} \\ \phi &=& 1 \\ a(\phi) &=& \frac{1}{n} \\ b(\theta) &=& - \log (1 - p) = \log (1 + \exp \theta) \\ c(y, \phi) &=& \log \binom{n}{ny}. \end{eqnarray*}\]
Poisson: \[ f(y \mid \theta, \phi) = e^{-\mu} \frac{\mu^y}{y!} = \exp (y \log \mu - \mu - \log y!). \] So we have \[\begin{eqnarray*} \theta &=& \log \mu \\ \phi &=& 1 \\ a(\phi) &=& 1 \\ b(\theta) &=& \exp \theta \\ c(y, \phi) &=& - \log y!. \end{eqnarray*}\]
Gamma has density \[ f(y \mid \nu, \lambda) = \frac{1}{\Gamma(\nu)} \lambda^{\nu} y^{\nu - 1} e^{-\lambda y}, \quad y > 0, \] where \(\nu\) is the shape parameter and \(\lambda\) is the scale parameter. For the purpose of GLM, it’s convenient to reparameterize by \(\lambda = \nu / \mu\) to get \[ f(y) = \frac{1}{\Gamma(\nu)} \left( \frac{\nu}{\mu} \right)^{\nu} y^{\nu - 1} e^{-y\nu / \mu} = \exp \left\{ \frac{- y \mu^{-1} - \log \mu}{\nu^{-1}} + (\nu-1) \log y + \nu \log \nu - \log \Gamma(\nu) \right\}. \] Now \(\mathbb{E}Y = \mu\) and \(\operatorname{Var}(Y) = \mu^2 / \nu = (\mathbb{E} Y)^2 / \nu\). So we have \[\begin{eqnarray*} \theta &=& - \mu^{-1} \\ \phi &=& \nu^{-1} \\ a(\phi) &=& \phi \\ b(\theta) &=& - \log (- \theta) \\ c(y, \phi) &=& (\phi^{-1} - 1) \log y - \phi^{-1} \log (\phi) - \log \Gamma(\phi^{-1}). \end{eqnarray*}\] Some books remove the minus sign in the canonical parameter/link which is fine provided we take account of this in any derivations. For the canonical link \(\eta = \mu^{-1}\), the systematic component can only be non-negative, which could cause problems. Other possible link functions are log link \(\eta = \log \mu\) and identity link \(\eta = \mu\).
Many other distributions.
Exponential family distributions have mean and variance \[\begin{eqnarray*} \mathbb{E}Y &=& \mu = b'(\theta) \\ \operatorname{Var}Y &=& \sigma^2 = b''(\theta) a(\phi). \end{eqnarray*}\] Show this in Friday’s Lab session. Thus the function \(b\) determines the moments of \(Y\).
Given the linear predictor or systematic component \[ \eta = \mathbf{x}^T \boldsymbol{\beta}, \] the link function, \(g\), relates the mean \(\mathbb{E} Y = \mu\) to the covariates \[ \eta = g(\mu). \]
In principal, any monotone, continuous, and differentiable function can be a link function. But there are some convenient and common choices for the standard GLMs.
For Gaussian linear model, the identity link, \(\eta = \mu\), is the obvious choice.
For binomial model, we saw logit, probit and cloglog (complementary log-log transformation).
For Poisson model, a standard choice is \(\eta = \log \mu\).
The canonical link has \(g\) such that \(\eta = g(\mu) = \theta\), the canonical parameter of the exponential family distribution. This means that \(g(b'(\theta))=\theta\). If a canonical link is used, \(\mathbf{X}^T \mathbf{y}\) is sufficient for \(\boldsymbol{\beta}\). The canonical link is mathematically and computationally convenient and is often the natural choice of link.
Family | Canonical Link | Variance Function |
---|---|---|
Normal | \(\eta=\mu\) | 1 |
Poisson | \(\eta=\log \mu\) | \(\mu\) |
Bernoulli | \(\eta=\log \left( \frac{\mu}{1 - \mu} \right)\) | \(\mu (1 - \mu)\) |
Gamma | \(\eta = \mu^{-1}\) | \(\mu^2\) |
Inverse Gaussian | \(\eta = \mu^{-2}\) | \(\mu^3\) |
GLM regreesion coefficients are estimated by MLE. Recall that the Newton-Raphson algorithm for maximizing a log-likelihood \(\ell(\boldsymbol{\beta})\) proceeds as \[ \boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + s [- \nabla^2 \ell(\boldsymbol{\beta}^{(t)})]^{-1} \nabla \ell(\boldsymbol{\beta}^{(t)}), \] where \(s>0\) is a step length, \(\nabla \ell\) is the score (gradient) vector, and \(-\nabla^2 \ell\) is the observed information matrix (negative Hessian).
For GLM, \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \\ \nabla \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{(y_i - \mu_i) \mu_i'(\eta_i)}{\sigma_i^2} \mathbf{x}_i \\ - \nabla^2 \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \frac{[\mu_i'(\eta_i)]^2}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T - \sum_{i=1}^n \frac{(y_i - \mu_i) \mu_i''(\eta_i)}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T \\ & & + \sum_{i=1}^n \frac{(y_i - \mu_i) [\mu_i'(\eta_i)]^2 (d \sigma_i^{2} / d\mu_i)}{\sigma_i^4} \mathbf{x}_i \mathbf{x}_i^T. \end{eqnarray*}\] Show this in Friday’s Lab session. For GLMs with canonical links, the second term and third term cancel using the fact \[ \frac{d\mu_i}{d \eta_i} = \frac{d\mu_i}{d \theta_i} = \frac{d \, b'(\theta_i)}{d \theta_i} = b''(\theta_i) = \frac{\sigma_i^2}{a(\phi)}. \] Therefore for canonical link the negative Hessian is positive semidefinte and Newton’s algorithm with line search is stable.
See Biostat 216 lecture notes for a refresher of matrix calculus.
How about non-canonical link? We use the expected (Fisher) information matrix \[ \mathbb{E} [- \nabla^2 \ell(\boldsymbol{\beta})] = \sum_{i=1}^n \frac{[\mu_i'(\eta_i)]^2}{\sigma_i^2} \mathbf{x}_i \mathbf{x}_i^T = \mathbf{X}^T \mathbf{W} \mathbf{X} \succeq 0, \] where \(\mathbf{W} = \text{diag}([\mu_i'(\eta_i)]^2/\sigma_i^2)\). This modified Newton-Raphson algorithm is called the Fisher scoring algorithm.
Take the Binomial logistic regression as an example \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n [y_i \log p_i + (n_i - y_i) \log (1 - p_i)] = \sum_{i=1}^n [y_i \mathbf{x}_i^T \boldsymbol{\beta} - n_i \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}})] \\ \nabla \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n \left( y_i \mathbf{x}_i - n_i \frac{\exp \mathbf{x}_i^T \boldsymbol{\beta}}{1 + \exp \mathbf{x}_i^T \boldsymbol{\beta}} \mathbf{x}_i \right) = \sum_{i=1}^n (y_i - n_i p_i) \mathbf{x}_i = \mathbf{X}^T (\mathbf{y} - \widehat{\boldsymbol{\mu}}) \\ - \nabla^2 \ell(\boldsymbol{\beta}) &=& \sum_{i=1}^n n_i p_i (1 - p_i) \mathbf{x}_i \mathbf{x}_i^T = \mathbf{X}^T \mathbf{W} \mathbf{X}, \quad \mathbf{W} = \text{diag}(w_1, \ldots, w_n), w_i = n_i p_i (1 - p_i) \\ \mathbb{E} [- \nabla^2 \ell(\boldsymbol{\beta})] &=& - \nabla^2 \ell(\boldsymbol{\beta}). \end{eqnarray*}\] The Fisher scoring algorithmn proceeds as \[\begin{eqnarray*} \boldsymbol{\beta}^{(t+1)} &=& \boldsymbol{\beta}^{(t)} + s(\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)}) \\ &=& (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} [\mathbf{X} \boldsymbol{\beta}^{(t)} + s (\mathbf{W}^{(t)})^{-1} (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)})] \\ &=& (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}, \end{eqnarray*}\] where \[ \mathbf{z}^{(t)} = \mathbf{X} \boldsymbol{\beta}^{(t)} + s (\mathbf{W}^{(t)})^{-1} (\mathbf{y} - \widehat{\boldsymbol{\mu}}^{(t)}) \] are working responses. In this sense, the Fisher scoring algorithm for GLM is also called the IRWLS (iteratively reweighted least squares).
Recall the insecticide data bliss
we used for binomial regression.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(faraway)
bliss
## dead alive conc
## 1 2 28 0
## 2 8 22 1
## 3 15 15 2
## 4 23 7 3
## 5 27 3 4
Let’s manually implement the first iteration of the Fisher scoring algorithm.
p <- bliss$dead / 30
eta <- logit(p)
z <- eta
w <- 30 * p * (1 - p)
lmod <- lm(z ~ conc, weights = w, data = bliss)
coef(lmod)
## (Intercept) conc
## -2.302462 1.153587
The Fisher scoring algorithm converges quickly after a few iterations.
y = bliss$dead
for (iter in 1:5) {
eta <- lmod$fit
p <- ilogit(eta)
w <- 30 * p * (1 - p)
z <- eta + (y - 30 * p) / w
lmod <- lm(z ~ conc, weights = w, data = bliss)
cat(iter, coef(lmod), "\n")
}
## 1 -2.323672 1.161847
## 2 -2.32379 1.161895
## 3 -2.32379 1.161895
## 4 -2.32379 1.161895
## 5 -2.32379 1.161895
Compare to the glm fit.
glm(cbind(dead, alive) ~ conc, family = binomial, data = bliss) %>%
summary()
##
## Call:
## glm(formula = cbind(dead, alive) ~ conc, family = binomial, data = bliss)
##
## Deviance Residuals:
## 1 2 3 4 5
## -0.4510 0.3597 0.0000 0.0643 -0.2045
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3238 0.4179 -5.561 2.69e-08 ***
## conc 1.1619 0.1814 6.405 1.51e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 64.76327 on 4 degrees of freedom
## Residual deviance: 0.37875 on 3 degrees of freedom
## AIC: 20.854
##
## Number of Fisher Scoring iterations: 4
xm <- model.matrix(lmod)
wm <- diag(w)
sqrt(diag(solve(t(xm) %*% wm %*% xm)))
## (Intercept) conc
## 0.4178878 0.1814158
When considering the choice of model for data, two extremes are the null or intercept-only model and the full or saturated model.
The null model means there’s no relation between predictors and the response. Usually it means we fit a common mean \(\mu\) for all \(y\).
The full model means data is explained exactly. Typically it means we need to use \(n\) parameters for \(n\) data points.
To assess the goodness of fit of a model, we might consider likelihood ratio test. For independent observations from exponential family with \(a_i(\phi) = \phi\), this simplifies to \[ \frac{D(\mathbf{y}, \widehat{\boldsymbol{\mu}})}{\phi} = \frac{2 \sum_i [y_i(\tilde \theta_i - \hat \theta_i) - b(\tilde \theta_i) + b(\hat \theta_i)]}{\phi}, \] where \(\tilde \theta\) are the estimates under the full model and \(\hat \theta\) are the estimates under the model of interest. \(D(\mathbf{y}, \widehat{\boldsymbol{\mu}})\) is called the deviance and \(D(\mathbf{y}, \widehat{\boldsymbol{\mu}}) / \phi\) is the scaled deviance.
An alternative measure of goodness of fit is the Pearson’s \(X^2\) statistic \[ X^2 = \sum_i \frac{(y_i - \hat \mu_i)^2}{\operatorname{Var}(\hat \mu_i)}. \]
GLM | Deviance |
---|---|
Gaussian | \(\sum_i (y_i - \hat \mu_i)^2\) |
Poisson | \(2\sum_i [y_i \log(y_i / \hat \mu_i) - (y_i - \hat \mu_i)]\) |
Binomial | \(2 \sum_i [y_i \log(y_i / \hat \mu_i) + (n_i - y_i) \log((n_i - y_i)/(n_i - \hat \mu_i))]\) |
Gamma | \(2 \sum_i [- \log(y_i / \hat \mu_i) + (y_i - \hat \mu_i) / \hat \mu_i]\) |
Inverse Gaussian | \(\sum_i (y_i - \hat \mu_i)^2 / (\hat \mu_i^2 y_i)\) |
For goodness of fit test, we use the fact that, under certain conditions, provided the model is correct, the scaled Deviance and the Pearson’s \(X^2\) statistic are both asymptotically \(\chi^2\) with degrees of freedom equal to the difference of the numbers of identifiable parameters.
For Gaussian, \(\phi\) is unknown so this test cannot be used. For binomial and Poisson, \(\phi=1\) so the test is practical. However the accuracy of asymptotic approximation is dubious for smaller batch sizes. For binary responses, the approximation is worthless.
To compare two nested models \(\Omega\) and \(\omega\), difference of the scaled deviance \(D_\omega - D_\Omega\) is asymptotically \(\chi^2\) with degrees of freedom equal to the difference in the number of identifiable parameters in the two models. For Gaussian model and other models where the disperson \(\phi\) is unknown, we can insert an estimate of \(\phi\) and compute an \(F\) test \[ \frac{(D_\omega - D_\Omega) / (\text{df}_{\omega} - \text{df}_{\Omega})}{\hat \phi}, \] where \(\hat \phi = X^2 / (n-p)\) is a good estimate of the dispersion. For Gaussian, the F-test is exact. For other models, the F-test is approximate.
Recall the insecticide data bliss
we used for binomial regression. The goodness of fit test by analysis deviance shows that this model fits data well. Comparing the fit to the null model also shows that the predictor conc
is highly significant.
bliss
## dead alive conc
## 1 2 28 0
## 2 8 22 1
## 3 15 15 2
## 4 23 7 3
## 5 27 3 4
modl <- glm(cbind(dead, alive) ~ conc, family = binomial, data = bliss)
summary(modl)
##
## Call:
## glm(formula = cbind(dead, alive) ~ conc, family = binomial, data = bliss)
##
## Deviance Residuals:
## 1 2 3 4 5
## -0.4510 0.3597 0.0000 0.0643 -0.2045
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3238 0.4179 -5.561 2.69e-08 ***
## conc 1.1619 0.1814 6.405 1.51e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 64.76327 on 4 degrees of freedom
## Residual deviance: 0.37875 on 3 degrees of freedom
## AIC: 20.854
##
## Number of Fisher Scoring iterations: 4
Pearson residual \[ r_p = \frac{y - \hat \mu}{\sqrt{\operatorname{Var}(\hat \mu)}}. \]
Deviance residual \[ r_D = \text{sign}(y - \hat \mu) \sqrt{d_i}, \] where \(d_i\) are summands in the calculation of deviance.
These are different kinds of residuals from the Binomial regression of the intesticide data bliss
.
residuals(modl, type = "deviance") # deviance residuals
## 1 2 3 4 5
## -0.45101510 0.35969607 0.00000000 0.06430235 -0.20449347
residuals(modl, type = "pearson") # Pearson residuals
## 1 2 3 4 5
## -4.325234e-01 3.643729e-01 -3.648565e-15 6.414687e-02 -2.081068e-01
residuals(modl, type = "response") # response - fitted values
## 1 2 3 4 5
## -2.250510e-02 2.834353e-02 -3.330669e-16 4.989802e-03 -1.082823e-02
bliss$dead / 30.0 - fitted(modl)
## 1 2 3 4 5
## -2.250510e-02 2.834353e-02 -3.330669e-16 4.989802e-03 -1.082823e-02
residuals(modl, type = "working") # working response
## 1 2 3 4 5
## -2.770876e-01 1.561410e-01 -1.332268e-15 2.748820e-02 -1.333195e-01
modl$residuals
## 1 2 3 4 5
## -2.770876e-01 1.561410e-01 -1.332268e-15 2.748820e-02 -1.333195e-01
We mostly use the deviance residuals for diagnostics.
influence(modl)$hat
## 1 2 3 4 5
## 0.4255049 0.4133068 0.3223765 0.4133068 0.4255049
rstudent(modl)
## 1 2 3 4 5
## -0.58478586 0.47213544 0.00000000 0.08386629 -0.27183519
influence(modl)$coef
## (Intercept) conc
## 1 -0.214001479 0.080663550
## 2 0.155671882 -0.047087302
## 3 0.000000000 0.000000000
## 4 -0.005841678 0.008417729
## 5 0.049263918 -0.036573429
cooks.distance(modl)
## 1 2 3 4 5
## 1.205927e-01 7.970999e-02 4.673053e-30 2.470424e-03 2.791738e-02
gala <- as_tibble(gala, rownames = "Island") %>%
print(n = Inf)
## # A tibble: 30 x 8
## Island Species Endemics Area Elevation Nearest Scruz Adjacent
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baltra 58 23 25.1 346 0.6 0.6 1.84
## 2 Bartolome 31 21 1.24 109 0.6 26.3 572.
## 3 Caldwell 3 3 0.21 114 2.8 58.7 0.78
## 4 Champion 25 9 0.1 46 1.9 47.4 0.18
## 5 Coamano 2 1 0.05 77 1.9 1.9 904.
## 6 Daphne.Major 18 11 0.34 119 8 8 1.84
## 7 Daphne.Minor 24 0 0.08 93 6 12 0.34
## 8 Darwin 10 7 2.33 168 34.1 290. 2.85
## 9 Eden 8 4 0.03 71 0.4 0.4 18.0
## 10 Enderby 2 2 0.18 112 2.6 50.2 0.1
## 11 Espanola 97 26 58.3 198 1.1 88.3 0.570
## 12 Fernandina 93 35 634. 1494 4.3 95.3 4669.
## 13 Gardner1 58 17 0.570 49 1.1 93.1 58.3
## 14 Gardner2 5 4 0.78 227 4.6 62.2 0.21
## 15 Genovesa 40 19 17.4 76 47.4 92.2 129.
## 16 Isabela 347 89 4669. 1707 0.7 28.1 634.
## 17 Marchena 51 23 129. 343 29.1 85.9 59.6
## 18 Onslow 2 2 0.01 25 3.3 45.9 0.1
## 19 Pinta 104 37 59.6 777 29.1 120. 129.
## 20 Pinzon 108 33 18.0 458 10.7 10.7 0.03
## 21 Las.Plazas 12 9 0.23 94 0.5 0.6 25.1
## 22 Rabida 70 30 4.89 367 4.4 24.4 572.
## 23 SanCristobal 280 65 552. 716 45.2 66.6 0.570
## 24 SanSalvador 237 81 572. 906 0.2 19.8 4.89
## 25 SantaCruz 444 95 904. 864 0.6 0 0.52
## 26 SantaFe 62 28 24.1 259 16.5 16.5 0.52
## 27 SantaMaria 285 73 171. 640 2.6 49.2 0.1
## 28 Seymour 44 16 1.84 147 0.6 9.6 25.1
## 29 Tortuga 16 8 1.24 186 6.8 50.9 18.0
## 30 Wolf 21 12 2.85 253 34.1 255. 2.33
modp <- glm(Species ~ . - Endemics - Island, family = poisson, data = gala)
summary(modp)
##
## Call:
## glm(formula = Species ~ . - Endemics - Island, family = poisson,
## data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.2752 -4.4966 -0.9443 1.9168 10.1849
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***
## Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***
## Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***
## Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***
## Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***
## Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: 889.68
##
## Number of Fisher Scoring iterations: 5
gala %>%
mutate(devres = residuals(modp, type = "deviance"),
linpred = predict(modp, type = "link")) %>%
ggplot() +
geom_point(mapping = aes(x = linpred, y = devres)) +
labs(x = expression(hat(eta)), y = "Deviance residual")
If we plot response residuals \(y_i - \widehat{\mu}_i\) against the linear predictor \(\widehat{\eta}\), then we see variance increases with \(\widehat{\mu}\) which is consistent with a Poisson model.
gala %>%
mutate(resres = residuals(modp, type = "response"),
linpred = predict(modp, type = "link")) %>%
ggplot() +
geom_point(mapping = aes(x = linpred, y = resres)) +
labs(x = expression(hat(eta)), y = "Response residual")
Species
against the predictor Area
. We see majority of islands have a small area with a few exceptions.gala %>%
ggplot() +
geom_point(mapping = aes(x = Area, y = Species))
Area
reveals a curvilinear relationship.gala %>%
ggplot() +
geom_point(mapping = aes(x = log(Area), y = Species))
gala %>%
mutate(eta = predict(modp, type = "link"),
mu = predict(modp, type = "response"),
res = residuals(modp, type = "response")) %>%
ggplot() +
geom_point(mapping = aes(x = log(Area), y = eta + res / mu)) +
labs(x = "log(Area)", y = "Linearized response")
modpl <- glm(Species ~ log(Area) + log(Elevation) + log(Nearest) + log(Scruz + 0.5) + log(Adjacent), family = poisson, data = gala)
summary(modpl)
##
## Call:
## glm(formula = Species ~ log(Area) + log(Elevation) + log(Nearest) +
## log(Scruz + 0.5) + log(Adjacent), family = poisson, data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.4176 -2.7323 -0.4655 2.5544 8.3385
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.331143 0.286899 11.611 < 2e-16 ***
## log(Area) 0.350684 0.018004 19.479 < 2e-16 ***
## log(Elevation) 0.032465 0.057047 0.569 0.56929
## log(Nearest) -0.039966 0.014126 -2.829 0.00467 **
## log(Scruz + 0.5) -0.037242 0.013783 -2.702 0.00689 **
## log(Adjacent) -0.089481 0.006945 -12.885 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.7 on 29 degrees of freedom
## Residual deviance: 360.0 on 24 degrees of freedom
## AIC: 532.83
##
## Number of Fisher Scoring iterations: 5
halfnorm(rstudent(modpl))
gali <- influence(modpl)
halfnorm(gali$hat)
To illustrate with the linear model, suppose we relax the assumption of the linear regression model \(\mathbf{Y} \sim \text{N}(\mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I})\) to \[ \mathbf{Y} \sim \text{N}(\mathbf{X} \boldsymbol{\beta}, \boldsymbol{\Omega}), \] where \(\boldsymbol{\Omega}\) is an unknow covariance matrix. We still use the least squares solution estimate \[ \widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. \] Then \[ \operatorname{Var} \widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} (\mathbf{X}^T \boldsymbol{\Omega} \mathbf{X}) (\mathbf{X}^T \mathbf{X})^{-1}. \]
For correct inference, we would like to plug in an estimate of \(\boldsymbol{\Omega}\) and use this sandwich estimator.
Generalization to GLM is similar.
library(sandwich)
glm(Species ~ . - Endemics - Island, family = poisson, data = gala) %>%
vcovHC() %>%
diag() %>%
sqrt()
## (Intercept) Area Elevation Nearest Scruz Adjacent
## 0.3185550946 0.0012176343 0.0011939774 0.0228021845 0.0065382200 0.0006212657
Standard errors are closer to those from the overdispersion Poisson model.
library(robust)
glmRob(Species ~ log(Area) + log(Adjacent), family = poisson, data = gala) %>%
summary()