Matrix Algebra of Regression &
Introduction to Generalized Linear Models

Introduction

  • Recall the general linear model (GLM),

y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k

  • This can be simplified using matrix expressions,

\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon},

  • where

    • \mathbf{Y} is the n \times 1 column vector of responses.
    • \mathbf{X} is the n \times p design matrix.
    • \boldsymbol{\beta} is the p \times 1 column vector of regression coefficients.
    • \boldsymbol{\varepsilon} is the n \times 1 column vector of error terms.

Regression in Matrix Form

  • Looking at the (literal) matrices,

\begin{align*} \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} &= \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n, (p-1)} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p-1} \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{bmatrix} \end{align*}

  • where

    • \mathbf{Y} is the n \times 1 column vector of responses.
    • \mathbf{X} is the n \times p design matrix.
    • \boldsymbol{\beta} is the p \times 1 column vector of regression coefficients.
    • \boldsymbol{\varepsilon} is the n \times 1 column vector of error terms.

Vector of Error Terms

  • Recall the error term, \boldsymbol{\varepsilon} = \left[ \varepsilon_1, \ \varepsilon_2, \ \cdots, \ \varepsilon_n\right]^\text{T}, a vector of independent normal random variables with

    • a n \times 1 expectation vector, \text{E}\left[\boldsymbol{\varepsilon}\right] = 0 and

    • a n \times n variance-covariance matrix \sigma^2 \mathbf{I},

\sigma^2 \mathbf{I} = \sigma^2 \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & 1 \end{bmatrix} = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix}

  • Note that we do not have to assume independence …

    • … we just have to account for dependence if the data is not independent.

Estimation of \boldsymbol{\beta} and \boldsymbol{\varepsilon}_i

  • We estimate \boldsymbol{\beta},

\boldsymbol{\hat{\beta}} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{Y}

  • … and find the predicted (or fitted) value,

\mathbf{\hat{Y}} = \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}'\mathbf{Y}

  • … and find the residual,

\boldsymbol{\hat{\varepsilon}} \equiv \mathbf{e} = \mathbf{Y} - \mathbf{\hat{Y}} = \mathbf{Y} - \mathbf{X}\boldsymbol{\hat{\beta}}

Design Matrix!

  • Let’s talk about the design matrix for a second.

\mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n, (p-1)} \end{bmatrix}

  • Why is there a column of 1’s?

  • What is p?

  • Note that for estimation to happen, \mathbf{X} must be full rank.

    • A r \times r matrix is said to be full rank if its rank is r.

    • This means that the matrix is nonsingular (it has an inverse).

    • Full rank also means that all columns are linearly independent.

Design Matrix!

  • Full rank also means that all columns are linearly independent.

  • Why am I harping on this?

  • This is why we have reference groups for categorical predictors!!

    • If we were to include all c classes, we would no longer have a full rank design matrix.

    • This is why when we do not use indicator variables, software “chooses” a reference group for us.

      • We literally cannot include all groups as predictors because we cannot perform the estimation.

Hat Matrix!

  • Let’s revisit the predicted value,

\begin{align*} \mathbf{\hat{Y}} &= \mathbf{X} \boldsymbol{\hat{\beta}} \\ &= \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}'\mathbf{Y} \\ &= \left( \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}' \right) \mathbf{Y} \end{align*}

  • We call \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}' the hat matrix.

    • ………. because it gives \mathbf{Y} its hat.
  • Note that the hat matrix is symmetric and idempotent (i.e., \mathbf{H}\mathbf{H} = \mathbf{H}).

Hat Matrix and Residuals

  • We have an alternate way of expressing the residuals,

\begin{align*} \mathbf{e} &= \mathbf{Y} - \mathbf{\hat{Y}} \\ &= \mathbf{Y} - \mathbf{H}\mathbf{Y} \\ &= \left(\mathbf{I} - \mathbf{H} \right) \mathbf{Y} \end{align*}

  • This is why you will see references to elements from the hat matrix (h_{ii}) for calculations related to residual analysis.

    • e.g., standardized residuals, studentized residuals, Cook’s distance, etc.

Wrap Up on Matrices

  • Note that we can go through the matrix-based calculations for everything we’ve learned so far, but that is not the point of this course.

  • Goal: give you an appreciation/understanding of what’s going on under the hood.

  • If you are interested, you can read more here:

Introduction to Generalized Linear Models

  • We will now extend to generalized linear models (GzLM).

    • These models allow us to model non-normal responses.
  • GzLMs have three components:

    • Random component: \mathbf{Y}, the response.

    • Linear predictor: \mathbf{X}\boldsymbol{\beta}, where \boldsymbol{\beta} is the parameter vector and \mathbf{X} is the n \times p design matrix that contains p-1 explanatory variables for the n observations.

    • Link function: g\left(\text{E}[\mathbf{Y}]\right) = \mathbf{X}\boldsymbol{\beta}

Random Component

  • The random component consists of the response, y

    • y has independent observations

    • y’s probability density or mass function is in the exponential family

  • A random variable, y, has a distribution in the exponential family if it can be written in the following form: f(y | \theta, \phi) = \exp \left\{ \frac{y\theta-b(\theta)}{a(\phi)} + c(y|\phi) \right\},

  • for specified functions a(\cdot), b(\cdot), and c(\cdot).

    • \theta is the parameter of interest (canonical or natural)

    • \phi is usually a nuissance parameter (scale or dispersion)

Random Component

  • Let’s show that the normal distribution is in the exponential family.

  • Assume y \sim N(\mu, \sigma^2); we will treat \sigma^2 as a nuisance parameter.

\begin{align*} f(y | \theta, \phi) &= \left(\frac{1}{2\pi \sigma^2} \right)^{1/2} \exp\left\{ \frac{-1}{2\sigma^2} (y-\mu)^2 \right\} \\ &= \exp\left\{ \frac{-y^2}{2\sigma^2} + \frac{y\mu}{\sigma^2} - \frac{\mu^2}{2\sigma^2} - \frac{1}{2} \ln\left(2\pi\sigma^2\right) \right\} \\ &= \exp\left\{\frac{y\mu-\mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2} \ln\left(2\pi\sigma^2\right) \right\} \\ &= \exp \left\{ \frac{y\theta-b(\theta)}{a(\phi)} + c(y|\phi) \right\} \end{align*}

Random Component

\begin{align*} f(y | \theta, \phi) &= \exp\left\{\frac{y\mu-\mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2} \ln\left(2\pi\sigma^2\right) \right\} \\ &= \exp \left\{ \frac{y\theta-b(\theta)}{a(\phi)} + c(y|\phi) \right\} \end{align*}

  • Linking the pieces,

    • \theta = \mu
    • \phi = \sigma^2
    • a(\phi) = \phi
    • b(\theta) = \mu^2/2 = \theta^2/2
    • c(y|\phi) = -(1/2 \ln(2\pi\sigma^2) + y^2/(2\sigma^2))

Random Component

  • Why are we restricted to the exponential family?

    • There are general expressions for model likelihood equations.

    • There are asymptotic distributions of estimators for model parameters (i.e., \hat{\beta}).

      • This allows us to “know” things!
    • We have an algorithm for fitting models.

Linear Predictors

  • Let us first discuss notation.

    • x_{ij} is the jth explanatory variable for observation i

      • i=1, ... , n

      • j=1, ..., p-1

    • Each observation (or subject) has a vector, x_i = \left[1, x_{i,1}, ..., x_{i, p-1}\right]

      • The leading 1 is for the intercept, \beta_0.

      • These are put together into \mathbf{X}, the design matrix.

Linear Predictors

  • The linear predictor relates parameters (\eta_i) pertaining to E[y_i] to the explanatory variables using a linear combination, \eta_i = \sum_{j=1}^{p-1} \beta_j x_{ij}

  • In matrix form, \boldsymbol{\eta} = \mathbf{X} \boldsymbol{\beta}

  • This is a linear predictor because it is linear in the parameters, \boldsymbol{\beta}.

    • The predictors can be nonlinear.

      • e.g., time2 (quadratic term), x_1 x_2 (interaction)

Linear Predictors

  • We call \mathbf{X} the design matrix.

    • There are n rows (one for each observation).

    • There are p columns (p-1 predictors – the pth column is for the intercept).

  • In most studies, we have p < n.

    • Goal: summarize data using a smaller number of parameters.
  • Some studies have p >>> n, e.g., genetics.

    • These studies need special methodology that is not covered in this course.
  • GzLMs treat y_i as random, x_i as fixed.

    • These are called fixed-effects models.

    • There are also random effects \to fixed effects + random effects = mixed models.

      • Random effects are beyond the scope of this course.

Wrap Up: Generalized Linear Models

  • We have talked about the underlying matrix algebra in regression analysis.

  • We also introduced generalized linear model.

    • You can read more about the GzLM here.
  • Moving forward, we will be learning how to construct regression models using the following distributions:

    • Gamma
    • Beta*
    • Binomial
    • Multinomial
    • Poisson
    • Negative binomial

Wrap Up: Generalized Linear Models

  • Note that this course is just the beginning of modeling.

  • There are more complex models that are beyond the scope of this course!

    • Start thinking now about what you would like to study in Capstone.
  • In all of your classes, consider the topics building blocks and ask yourself how they relate to topics you’ve learned about in other courses.

    • e.g., if you go on to take Time Series and Spatial Analysis, you will see extensions of the gzlm.