The linear model
A scientific model is an approximation about the behaviour of natural systems, based on observations of them. Galileo's heliocentric model of Earth's solar system is a good example of this.
A "linear" model is the approximation of a system using a line equation, which in early education usually takes the form \(y = ax + b\) (sometimes the \(a\) is replaced with an \(m\)). In this equation, \(x\) and \(y\) are variables, and \(a\) and \(b\) are parameters, which assume constant values that define the slope and y-intercept of the line, respectively. It is common, especially in statistics, to refer to these parameters with the Greek letter \(\beta\); so in what follows I'll be using \(\beta_0\) to refer to the intercept and \(\beta_1\) to refer to the slope.
So how can a linear model be useful for an aspiring scientist? As an example, suppose you live in the early 20th-century, when the automobile industry was just getting off the ground, and you notice that more people seem to be getting into accidents after having consumed alcohol than when sober. You formulate this as a hypothesis: alcohol consumption impairs reaction time. You can express this as a linear model:
$$\hat{RT} = \beta_0 + \beta_1 BAC,$$
where \(\hat{RT}\) is the predicted reaction time, in milliseconds (called the outcome variable), and \(BAC\) is blood alcohol concentration, in grams per millilitre (called the predictor variable).
As a scientist, you design an experiment and measure your observations from a randomly selected, representative sample. You define your null hypothesis (the thing you are trying to disprove) as a complete lack of relationship between \(BAC\) and \(RT\) in the population of interest. This corresponds to a horizontal line, where \(\beta_1=0\):
$$H_0: \beta_1=0$$ $$H_a: \beta_1 \neq 0$$
What do you do next?
Fitting a linear model to observations
The first thing you want to know in this scenario is: "how well can I approximate the data with my model?"
Problem is, you don't know what \(\beta\) parameters to use. You can get an idea about this by plotting your variables against one another as a scatterplot:
These plots need a bit of explanation. For the scatterplot at top, each circle is a single participant, and its \(x\) and \(y\) positions represent the values of the two variables you measured from that participant (e.g., \(BAC\) and \(RT\)). The red line is the line we are trying to use to approximate the relationship between these variables. This line allows us to predict the value of \(y\) from an observed value of \(x\).
This predicted value of \(y\) is referred to as \(\hat{y}\) (pronounced "y hat"), and the fact that we are making a prediction can be expressed like this:
$$\hat{y}=\beta_0+\beta_1 x$$
The quality of our model can be assessed by looking at how well the model's prediction \(\hat{y}\) deviates from the actual observed value \(y\) for each data point. In other words, how far do our data points fall from the line? This is called residual error, and is shown above as the green lines.
In the above plots, you can use the slider to play with the value of \(\beta_1\) to try and minimize the model error. The line equation for that value is given in the red box. The model error can be quantified as a single value called mean squared error, or \(MSE\), shown in the green box. \(MSE\) can be obtained by squaring the residual errors (which makes them positive) and averaging across all data points:
$$MSE = \frac{\sum{(y-\hat{y})^2}}{N},$$
where \(N\) is the number of data points.
Our goal here is to find the values of \(\beta\) that best fit our observations, and this corresponds to the values where \(MSE\) is minimal. In the bottom plot, we are plotting \(MSE\) as a function of the different values of \(\beta_1\) that you are exploring with the slider. If you move the slider enough, you can see that this forms a U-shaped curve, with a minimum around 0.62.
This is is the slope we are looking for!
The simplest way to solve this minimization problem (without playing with a slider) is through a method called ordinary least squares, commonly abbreviated as OLS, which for a single predictor variable is derived here (disclaimer: math!). The basic logic is: as long as our \(MSE\) values form a U-shaped (also called convex) curve, we can use calculus to determine analytically where the minimum is [1,2]. We first take the derivative (rate of change of \(MSE\) over values of \(\beta_1\)) and then determine at which value of \(\beta_1\) it is exactly 0 [3]:
$$\frac{d MSE}{d \beta_1}=0$$
Statistical inference on linear models
Going back to our original example, suppose we do use OLS to obtain an optimal model for the relationship between \(BAC\) and \(RT\). Where do we go from here?
Our next task is to show that our estimated \(\beta\) parameters generalize to our population of interest. In other words, can we use this evidence to reject \(H_0\) and thereby infer that a relationship exists between our two variables? To do this, we need to show that our estimated \(\beta_1\) is further away from zero than would be expected by chance (within some acceptance threshold \(p < \alpha\)) from a random sample of size \(N\).
Intuitively, the issue we are grappling with is whether the difference between our model and the null model (i.e., where \(\beta_1=0\)) is sufficiently greater than the overall variation in our data. These are plotted below:
On the left, we've got the null model: in this case, \(\beta_1=0\) and our best guess at \(y\) is its mean: \(\hat{y}=\bar{y}\). In other words, \(x\) gives us no information whatsoever about \(y\). The deviations of each data point with the mean (\(y-\bar{y}\)), or total deviations, are shown as orange lines. In the middle, we've estimated a model to fit our data, shown again as a red line. The differences between this and the null model (\(\hat{y}-\bar{y}\)), or model deviations, are shown as purple lines. On the right, the remaining variability (\(y-\hat{y}\)), or residual deviations, are the same as we saw in the plot above, and are again coloured green.
Hopefully it is clear that the model and residual deviations are the sum of the total deviations, but also that the balance between model to residual squared deviations (called \(SS_M\) and \(SS_R\), respectively), gives us an estimate of how well our model explains the data. In the extreme cases: for the null model, all deviation is residual and model deviation is 0: \(SS_T=SS_R\) (where \(SS_T\) is the sum of squared total deviations); and for a perfect model, all data points lie on the red line, and residual deviation is 0: \(SS_T=SS_M\).
Indeed, based on this balance, we can quantify the proportion of the variance of \(y\) that is explained by our model as:
$$R^2=\frac{SS_M}{SS_T}$$
You may note that this is analogous to the \(r^2\) for Pearson correlations.
Another thing we can do is get a sampling distribution for this ratio that can be used to derive a p value (the probability of observing this ratio or higher when the null hypothesis is true). For this purpose, however, we use a slightly different ratio based on "mean" sums of squares:
$$MS_M = \frac{SS_M}{df_M}$$ $$MS_R = \frac{SS_R}{df_R}$$
I put "mean" in quotes because, confusingly, these are not the same as the \(MSE\) introduced above. Instead, they are divided by the degrees of freedom (\(df\)) for each quantity (more about those here). The reason we introduce df's at this point is to deal with sample size and model complexity. As is true of any sampling distribution, parameter estimates will have more variability for smaller sample sizes. Similarly, they will have more variability for more complex models (i.e., the number of predictor variables we are using to fit the data), because you can always overfit sample data with more parameters. This is reflected in the equations for these df's:
$$df_M=k$$ $$df_R=N-1-k,$$
where \(k\) is the number of predictor variables (in the case of a single \(x\) variable, this is 1). Both more predictor variables and fewer participants reduce the ratio, to account for increased variability.
Conveniently, the ratio of model to residual mean sums of squares has an F distribution under the null hypothesis, and is referred to as the F ratio:
$$F(df_M,df_R)=\frac{MS_M}{MS_R}$$
Comparing our F ratio to this distribution allows us to determine a p value. In the plot below [4], you can play with \(k\), \(N\), \(SS_M\), and \(SS_R\), to see their effect on the shape of the F distribution and the p value for our linear model. The p value is computed as the area under the F probability density function (PDF) between our F ratio (red line) and \(+\infty\), shown as the blue shaded area.
Interpreting a simple linear regression result
Okay, so let's say you've obtained a p-value for your reaction time experiment, which is less than your pre-specified threshold of \(\alpha=0.05\). You can now go ahead and reject \(H_0\), on which basis you infer that there is, indeed, a likely relationship between \(RT\) and \(BAC\) in the general population.
Can you make the bolder inference that higher \(BAC\) causes slower \(RT\)? This depends on your research design. If you had randomly (and blindly) assigned alcohol to participants, then you could fairly confidently make this claim, as there are no conceivable confounding variables left uncontrolled. If, on the other hand, you had let participants determine their own alcohol intake, then you might have a problem: a third factor might conceivably have influenced both their choice of imbibement and their reaction time.
In general, unless your research design is strictly experimental (i.e., the predictor variable was manipulated in a controlled manner), it is fairly difficult to support a causal inference from a linear regression result. The more epidemiological the design (i.e., measuring many variables from a large sample with few or no controls), the harder this becomes. This is because it is possible/likely that unmeasured factors exist, which could influence or mediate the association between two measured ones. If we were, for instance, to use a questionnaire to ask people for their average alcohol consumption and frequency of automobile accidents, and find an association between these variables, we could not be certain that one variable causes the other.
One way to approach confounding is to measure the variables we think may be confounders. For example, if we hypothesized that males drink more on average than females, or that younger people partake more often than do older people, then we could add sex and age as covariates in our linear model.
That brings us to multiple linear regression, which is another kettle of fish that I will serve up in a future blog post. :)
Incidentally, the use of squared model errors instead of absolute values is largely so that the \(MSE\) curve shown above is U-shaped (smooth) rather than V-shaped, for which the derivative at the minimum is undefined.
Handily, it can be shown that this so-called cost function for any input set will be convex (here's some more math), which makes OLS quite ubiquitous.
Actually, we need to use partial derivatives here, as we are minimizing with respect to both \(\beta_0\) and \(\beta_1\) (and more parameters for multiple linear regression), but for our purposes this illustration gets the point across!
Javascript code for computing the F-ratio probability density function (PDF) and cumulative distribution function (CDF) was obtained from this excellent site.