There’s obviously a lot to say about linear regression, which is the workhorse of applied statistics. I don’t want to say (type) most of it, quite frankly. But what kind of map would this be if it didn’t at least have a small page on linear regression. If you’re deeply interested in applied linear regression, perhaps you should go read Applied Linear Regression.
Linear regression assumes the mean response is a linear function of the covariates:
and we assume that . (For non-constant variance use weighted least squares)). We thus assume the model
where the errors are assumed to be mean zero and capture the differences between the observed response and the expected response. (Add a constant column to if it makes you happy.)
Typically one estimates by minimizing the “residual sum-of-squares (RSS)”, which is the distance between our predictions ( ) and the true values , i.e.,
The resulting is the (ordinary) least squares estimate.
The main assumptions behind linear regression are:
- linear relationship between response and covariates (surprise surprise)
- errors are independent and homoscedastic (fancy way of saying they’re the same for each covariate)
- covariates are non-colinear (if broken then your coefficients can be unstable and lose meaning, making statistics computed on them useless).
- You’ll often see people make the assumption that the errors are normally distributed. This isn’t strictly necessary for actually fitting your model, but if you want to construct confidence intervals or perform any hypothesis testing on the coefficients, then you need this kind of assumption.
Linear regression is, of course, a generalized linear model with a linear link function,
In practice, once you’ve fit a model, you should run some diagnostic checks. This usually involves plotting the residuals (the differences between your model predictions and the labels) and ensuring they are roughly mean-zero and random.