y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon},
where
\begin{align*} \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} &= \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n, (p-1)} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p-1} \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{bmatrix} \end{align*}
where
Recall the error term, \boldsymbol{\varepsilon} = \left[ \varepsilon_1, \ \varepsilon_2, \ \cdots, \ \varepsilon_n\right]^\text{T}, a vector of independent normal random variables with
a n \times 1 expectation vector, \text{E}\left[\boldsymbol{\varepsilon}\right] = 0 and
a n \times n variance-covariance matrix \sigma^2 \mathbf{I},
\sigma^2 \mathbf{I} = \sigma^2 \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & 1 \end{bmatrix} = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & 0 \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix}
Note that we do not have to assume independence …
\boldsymbol{\hat{\beta}} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\mathbf{Y}
\mathbf{\hat{Y}} = \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}'\mathbf{Y}
\boldsymbol{\hat{\varepsilon}} \equiv \mathbf{e} = \mathbf{Y} - \mathbf{\hat{Y}} = \mathbf{Y} - \mathbf{X}\boldsymbol{\hat{\beta}}
\mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,(p-1)} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_{n,1} & x_{n,2} & \cdots & x_{n, (p-1)} \end{bmatrix}
Why is there a column of 1’s?
What is p?
Note that for estimation to happen, \mathbf{X} must be full rank.
A r \times r matrix is said to be full rank if its rank is r.
This means that the matrix is nonsingular (it has an inverse).
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Full rank also means that all columns are linearly independent.
Why am I harping on this?
This is why we have reference groups for categorical predictors!!
If we were to include all c classes, we would no longer have a full rank design matrix.
This is why when we do not use indicator variables, software “chooses” a reference group for us.
\begin{align*} \mathbf{\hat{Y}} &= \mathbf{X} \boldsymbol{\hat{\beta}} \\ &= \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}'\mathbf{Y} \\ &= \left( \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}' \right) \mathbf{Y} \end{align*}
We call \mathbf{X}\left(\mathbf{X}'\mathbf{X}\right)^{-1} \mathbf{X}' the hat matrix.
Note that the hat matrix is symmetric and idempotent (i.e., \mathbf{H}\mathbf{H} = \mathbf{H}).
\begin{align*} \mathbf{e} &= \mathbf{Y} - \mathbf{\hat{Y}} \\ &= \mathbf{Y} - \mathbf{H}\mathbf{Y} \\ &= \left(\mathbf{I} - \mathbf{H} \right) \mathbf{Y} \end{align*}
This is why you will see references to elements from the hat matrix (h_{ii}) for calculations related to residual analysis.
We will now extend to generalized linear models (GzLM).
GzLMs have three components:
Random component: \mathbf{Y}, the response.
Linear predictor: \mathbf{X}\boldsymbol{\beta}, where \boldsymbol{\beta} is the parameter vector and \mathbf{X} is the n \times p design matrix that contains p-1 explanatory variables for the n observations.
Link function: g\left(\text{E}[\mathbf{Y}]\right) = \mathbf{X}\boldsymbol{\beta}
The random component consists of the response, y
y has independent observations
y’s probability density or mass function is in the exponential family
A random variable, y, has a distribution in the exponential family if it can be written in the following form: f(y | \theta, \phi) = \exp \left\{ \frac{y\theta-b(\theta)}{a(\phi)} + c(y|\phi) \right\},
for specified functions a(\cdot), b(\cdot), and c(\cdot).
\theta is the parameter of interest (canonical or natural)
\phi is usually a nuissance parameter (scale or dispersion)
Let’s show that the normal distribution is in the exponential family.
Assume y \sim N(\mu, \sigma^2); we will treat \sigma^2 as a nuisance parameter.
\begin{align*} f(y | \theta, \phi) &= \left(\frac{1}{2\pi \sigma^2} \right)^{1/2} \exp\left\{ \frac{-1}{2\sigma^2} (y-\mu)^2 \right\} \\ &= \exp\left\{ \frac{-y^2}{2\sigma^2} + \frac{y\mu}{\sigma^2} - \frac{\mu^2}{2\sigma^2} - \frac{1}{2} \ln\left(2\pi\sigma^2\right) \right\} \\ &= \exp\left\{\frac{y\mu-\mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2} \ln\left(2\pi\sigma^2\right) \right\} \\ &= \exp \left\{ \frac{y\theta-b(\theta)}{a(\phi)} + c(y|\phi) \right\} \end{align*}
\begin{align*} f(y | \theta, \phi) &= \exp\left\{\frac{y\mu-\mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2} \ln\left(2\pi\sigma^2\right) \right\} \\ &= \exp \left\{ \frac{y\theta-b(\theta)}{a(\phi)} + c(y|\phi) \right\} \end{align*}
Linking the pieces,
Why are we restricted to the exponential family?
There are general expressions for model likelihood equations.
There are asymptotic distributions of estimators for model parameters (i.e., \hat{\beta}).
We have an algorithm for fitting models.
Let us first discuss notation.
x_{ij} is the jth explanatory variable for observation i
i=1, ... , n
j=1, ..., p-1
Each observation (or subject) has a vector, x_i = \left[1, x_{i,1}, ..., x_{i, p-1}\right]
The leading 1 is for the intercept, \beta_0.
These are put together into \mathbf{X}, the design matrix.
The linear predictor relates parameters (\eta_i) pertaining to E[y_i] to the explanatory variables using a linear combination, \eta_i = \sum_{j=1}^{p-1} \beta_j x_{ij}
In matrix form, \boldsymbol{\eta} = \mathbf{X} \boldsymbol{\beta}
This is a linear predictor because it is linear in the parameters, \boldsymbol{\beta}.
The predictors can be nonlinear.
We call \mathbf{X} the design matrix.
There are n rows (one for each observation).
There are p columns (p-1 predictors – the pth column is for the intercept).
In most studies, we have p < n.
Some studies have p >>> n, e.g., genetics.
GzLMs treat y_i as random, x_i as fixed.
These are called fixed-effects models.
There are also random effects \to fixed effects + random effects = mixed models.
The link function, g(\cdot), connects the random component with the linear predictor.
Let \mu_i = \text{E}[y_i]:
In the exponential family representation, a certain parameter serves as its natural parameter.
The link function that transforms \mu_i to the natural parameter, \theta, is called the canonical link.
We have talked about the underlying matrix algebra in regression analysis.
We also introduced generalized linear model.
Moving forward, we will be learning how to construct regression models using the following distributions:
Note that this course is just the beginning of modeling.
There are more complex models that are beyond the scope of this course!
In all of your classes, consider the topics building blocks and ask yourself how they relate to topics you’ve learned about in other courses.