R Packages Used
library(ggplot2)
library(ggtext)08 12 2025
Before diving into assumptions and diagnostic tests, we need to understand residuals—the cornerstone of all regression diagnostics. Many students confuse residuals with error terms, but understanding the distinction is crucial for interpreting diagnostic plots and tests.
Residuals are the observed differences between actual values and predicted values from your fitted model:
\[e_i = y_i - \hat{y}_i\]
where:
In plain language: A residual tells you how far off your prediction was for each observation. If your model predicts a customer will spend 100 EUR but they actually spent 120 EUR, the residual is 20 EUR (see also Figure 1).
Students often use “residual” and “error” interchangeably, but they are fundamentally different concepts:
| Concept | Symbol | What It Is | Can We Observe It? |
|---|---|---|---|
| Error term | \(\epsilon_i\) | True, unknown deviation from the population regression line | No - it’s a theoretical concept |
| Residual | \(e_i\) | Observed deviation from our fitted sample regression line | Yes - we calculate it from our data |
The error term (\(\epsilon_i\)) is the true, unobservable deviation in the population model:
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
This represents the “true” relationship in the population. We never observe \(\epsilon_i\) because we don’t know the true population parameters (\(\beta_0\) and \(\beta_1\)).
The residual (\(e_i\)) is what we actually observe from our fitted sample model:
\[y_i = \hat{\beta}_0 + \hat{\beta}_1 x_i + e_i\]
This is our estimate based on sample data. We calculate \(e_i\) because we have estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
Assumptions are about error terms, not residuals: When we say “errors are normally distributed” or “errors have constant variance,” we’re making statements about the unobservable \(\epsilon_i\) in the population.
Diagnostics use residuals to learn about errors: Since we can’t observe \(\epsilon_i\), we use residuals \(e_i\) as our best approximation. We examine residuals hoping they reveal the properties of the true errors.
Residuals are estimates of errors: Under the regression assumptions, residuals should behave similarly to errors. If they don’t (showing patterns, non-constant variance, etc.), it suggests the assumptions are violated.
You’ll encounter several types of residuals in diagnostic output:
1. Raw residuals (\(e_i\)): \[e_i = y_i - \hat{y}_i\]
These are what we’ve been discussing—the basic difference between observed and predicted values.
2. Standardized residuals (\(e_i^*\)): \[e_i^* = \frac{e_i}{\hat{\sigma}\sqrt{1-h_i}}\]
where \(\hat{\sigma}\) is the residual standard error and \(h_i\) is the leverage. Standardized residuals have approximately unit variance, making them comparable across observations. Values beyond ±2 or ±3 suggest potential outliers.
Leverage (denoted as \(h_i\) or “hat value”) measures how unusual or extreme an observation’s predictor values (X values) are compared to the rest of the data. It quantifies how far an observation is from the center of the predictor space.
Mathematical definition:
In simple linear regression, leverage for observation \(i\) is:
\[h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^{n}(x_j - \bar{x})^2}\]
In multiple regression, leverage values come from the diagonal of the “hat matrix” \(\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\), which “puts the hat” on \(\mathbf{y}\) to get \(\hat{\mathbf{y}}\).
Key properties:
Interpretation:
Why leverage matters:
High leverage observations have greater potential to influence the regression line because they’re far from the center of the data. Think of a lever: observations far from the fulcrum (center) have more power to move the line.
However, high leverage alone doesn’t make a point influential:
Example:
Imagine studying the relationship between study hours and exam scores for students who studied 10-20 hours. A student who studied 40 hours would have high leverage—they’re far from the typical range. If they score as predicted by the pattern, they confirm it. If they score much lower or higher than expected, they could dramatically change your estimated relationship.
Why leverage appears in standardized residuals:
The formula \(e_i^* = \frac{e_i}{\hat{\sigma}\sqrt{1-h_i}}\) adjusts for the fact that observations with high leverage naturally have smaller residuals because the regression line is “pulled” toward them. Without this adjustment, high leverage points would appear to fit better than they actually do, masking potential problems.
3. Studentized residuals:
Similar to standardized but use a different estimate of variance that excludes observation \(i\). More robust for identifying outliers.
Why standardize? Raw residuals naturally have different variances depending on the leverage of each observation. Standardization accounts for this, putting all residuals on the same scale for fair comparison.
Residuals are the empirical manifestation of all our modeling assumptions. Here’s why they’re so important:
1. They reveal assumption violations:
2. They’re model-free diagnostics:
Residuals don’t require you to know the “true” model. They simply show you what’s left unexplained after fitting your model. Large, systematic patterns in residuals mean your model is missing something important.
3. They provide a common framework:
Almost every diagnostic tool examines residuals in some way:
4. They quantify model performance:
The magnitude of residuals tells you about prediction accuracy. Ideally, residuals should be:
When you examine diagnostic plots, you’re asking: “Do these residuals behave like random noise, or do they show systematic patterns that suggest model problems?”
If residuals are truly random (as they should be when assumptions hold), they should:
Any deviation from this ideal suggests where your model or assumptions may be failing.
Imagine you’re modeling house prices based on square footage:
But for another house:
If most large houses have large positive residuals, this pattern suggests your linear model is inadequate—perhaps you need a quadratic term or the relationship changes at higher square footage.
Linear regression relies on several key assumptions about the data and the relationship between variables. Understanding these assumptions is crucial because violations can lead to:
What it means: The relationship between each predictor variable and the outcome variable is linear. In mathematical terms, the conditional expectation of Y given X follows a linear function: \(E[Y|X] = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p\)
Why it matters: Linear regression models can only capture linear relationships. If the true relationship is curved or more complex, the model will systematically mispredict values, leading to biased estimates and poor fit.
What happens if violated: The model will show systematic patterns in residuals (curves, waves), predictions will be systematically wrong in certain ranges, and R² will underestimate the true strength of the relationship.
What it means: Each observation is independent of all other observations. The residual (error) for one observation does not depend on the residual for any other observation: \(Cov(\epsilon_i, \epsilon_j) = 0\) for \(i \neq j\).
Why it matters: Independence is required for valid standard errors and hypothesis tests. When observations are correlated (e.g., repeated measurements from the same person, or time series data), standard errors are typically underestimated, leading to overconfident conclusions.
What happens if violated: Standard errors are incorrect (usually too small), confidence intervals are too narrow, p-values are too small (Type I error inflation), and you may falsely conclude that relationships are significant.
Common violations: Time series data (autocorrelation), clustered data (students within schools), repeated measures (multiple observations per subject), spatial data (nearby locations are similar).
What it means: The variance of the residuals is constant across all levels of the predictor variables: \(Var(\epsilon_i) = \sigma^2\) for all \(i\). The “spread” of residuals should be the same whether you’re predicting low or high values.
Why it matters: Heteroscedasticity (non-constant variance) leads to inefficient estimates and incorrect standard errors. While coefficient estimates remain unbiased, hypothesis tests and confidence intervals become unreliable.
What happens if violated: Standard errors are incorrect, confidence intervals are wrong (may be too wide or too narrow), hypothesis tests are invalid, and the model is inefficient (there are better ways to estimate the relationships).
Common causes: The variance of Y naturally increases with X (e.g., spending variance increases with income), measurement error that varies, or misspecified functional form (wrong model).
What it means: The residuals (errors) follow a normal distribution: \(\epsilon_i \sim N(0, \sigma^2)\). Note that this assumption is about the residuals, not the variables themselves.
Why it matters: Normality is primarily required for valid hypothesis tests and confidence intervals, especially in small samples. The Central Limit Theorem helps here: with large samples (n > 30-50), inference remains approximately valid even with moderate departures from normality.
What happens if violated: Confidence intervals and hypothesis tests may be inaccurate, especially in small samples. Predictions and coefficient estimates remain unbiased, but uncertainty quantification becomes unreliable.
Important note: This is the least critical assumption for large samples due to the Central Limit Theorem. Focus more on linearity, independence, and homoscedasticity.
What it means: No single observation has disproportionate influence on the regression estimates. While not a formal distributional assumption, influential points can dramatically change your conclusions.
Why it matters: One or two unusual observations can completely change slope estimates, intercepts, and significance tests. You need to know if your results depend on just a few data points.
What happens if violated: Coefficients may be pulled toward outliers, standard errors may be inflated, R² may be artificially high or low, and conclusions may not generalize.
Key distinction: - Outliers have unusual Y values (large residuals) - High leverage points have unusual X values - Influential points have both high leverage AND large residuals
| Assumption | Description | Visual Test | Quantitative Test |
|---|---|---|---|
| Linearity | The relationship between predictors and outcome is linear | Residuals vs Fitted plot (Tukey-Anscombe) | RESET test (Ramsey) |
| Independence | Observations are independent of each other | Residuals vs Order plot; ACF plot | Durbin-Watson test |
| Homoscedasticity | Constant variance of residuals across all levels of predictors | Residuals vs Fitted; Scale-Location plot | Breusch-Pagan test; White test |
| Normality | Residuals follow a normal distribution | QQ plot; Histogram of residuals | Shapiro-Wilk test; Kolmogorov-Smirnov test |
| No influential outliers | No single observation has undue influence on the model | Residuals vs Leverage plot; Cook’s distance plot | Cook’s distance (> 0.5 or > 1); DFFITS |
Visual diagnostics are your first and most important line of defense. R automatically generates four diagnostic plots with plot(lm_model) that provide a comprehensive visual assessment. Always examine these plots before moving to quantitative tests.
Purpose: Checks linearity and homoscedasticity simultaneously.
How it works: Plots residuals (\(e_i = y_i - \hat{y}_i\)) against fitted values (\(\hat{y}_i\)). If the linearity and homoscedasticity assumptions hold, residuals should be randomly scattered around zero with no patterns or trends.
What to look for:
Interpretation: The smoothed line (blue) should be approximately horizontal at zero. Any systematic deviation indicates assumption violations.
Purpose: Checks whether residuals follow a normal distribution.
How it works: Plots the quantiles of standardized residuals against theoretical quantiles from a standard normal distribution. If residuals are normally distributed, points should fall along the diagonal reference line.
What to look for:
Interpretation: Minor deviations are often acceptable, especially with larger sample sizes (n > 30-50) where the Central Limit Theorem provides robustness. Focus on gross departures from normality rather than minor wiggles.
Purpose: Checks homoscedasticity more clearly than the Residuals vs Fitted plot.
How it works: Plots the square root of standardized residuals (\(\sqrt{|e_i^*|}\)) against fitted values. The square root transformation stabilizes variance and makes patterns of heteroscedasticity easier to detect.
What to look for:
Interpretation: This plot makes heteroscedasticity violations more apparent than the standard Residuals vs Fitted plot because the square root transformation amplifies patterns in variance.
Purpose: Identifies influential observations that disproportionately affect the regression model.
How it works: Plots standardized residuals against leverage (hat values). Includes Cook’s distance contours to identify problematic points that are both unusual in their predictor values (high leverage) and poorly predicted by the model (large residuals).
Cook’s Distance measures the influence of each observation on the fitted values. It quantifies how much all fitted values change when observation \(i\) is deleted from the analysis:
\[D_i = \frac{\sum_{j=1}^{n}(\hat{y}_j - \hat{y}_{j(i)})^2}{p \cdot MSE}\]
where \(\hat{y}_{j(i)}\) is the predicted value for observation \(j\) when observation \(i\) is excluded, \(p\) is the number of parameters, and \(MSE\) is the mean squared error.
Alternatively, Cook’s distance combines leverage and residual size:
\[D_i = \frac{(e_i^*)^2}{p} \times \frac{h_i}{1-h_i}\]
where \(e_i^*\) is the standardized residual and \(h_i\) is the leverage (hat value).
Interpretation:
Contours on the plot:
The dashed curves on the Residuals vs Leverage plot show Cook’s distance contours (typically at 0.5 and 1.0). Points that fall outside these contours have high influence.
What makes a point influential?
Two factors contribute to influence:
A point needs BOTH high leverage AND a large residual to be truly influential. High leverage alone (if well-predicted) or large residuals alone (if typical X values) are less concerning.
What to do about influential points:
What to look for:
Key concepts:
While visual diagnostics should always be your primary tool, quantitative tests can provide additional confirmation and are useful when you need to report formal test results. However, be aware that these tests can be overly sensitive in large samples, detecting trivial violations that don’t meaningfully affect your results.
Purpose: Formal test of whether residuals follow a normal distribution.
How it works: Tests the null hypothesis \(H_0\): the data come from a normal distribution. The test statistic (W) measures how well the ordered residuals match the expected pattern from a normal distribution, with \(W\) ranging from 0 to 1 (higher values indicate better normality).
R code:
Decision rule:
Important notes:
Kolmogorov-Smirnov Test:
ks.test(residuals(model), "pnorm", mean=0, sd=sd(residuals(model)))Anderson-Darling Test:
nortest packagelibrary(nortest); ad.test(residuals(model))Purpose: Formal test for heteroscedasticity (non-constant variance).
How it works: Regresses the squared residuals on the predictors to test whether variance is related to predictor values. Under \(H_0\) (homoscedasticity), squared residuals should not be systematically related to predictors.
The test statistic follows a \(\chi^2\) distribution with degrees of freedom equal to the number of predictors.
R code:
Decision rule:
Solutions if violated:
Purpose: More general test for heteroscedasticity than Breusch-Pagan.
How it works: Similar to Breusch-Pagan but includes cross-products and squared terms of predictors when regressing squared residuals. Tests for any form of heteroscedasticity, not just linear relationships between variance and predictors.
R code:
When to use: When you suspect complex patterns of heteroscedasticity that Breusch-Pagan might miss, such as variance that changes non-linearly with predictors.
When to use: Primarily for time series data or any data with a natural ordering where adjacent observations might be correlated.
Purpose: Tests for autocorrelation in residuals (correlation between consecutive residuals).
How it works: Computes a test statistic based on the differences between consecutive residuals:
\[DW = \frac{\sum_{t=2}^{n}(e_t - e_{t-1})^2}{\sum_{t=1}^{n}e_t^2}\]
The DW statistic ranges from 0 to 4:
R code:
Interpretation:
Solutions if violated:
Important note: Only tests for first-order autocorrelation. For higher-order autocorrelation, examine ACF (autocorrelation function) plots or use Ljung-Box test.
Purpose: Tests for functional form misspecification - whether the linear form is adequate or if you need transformations/polynomial terms.
How it works: Adds powers of fitted values (e.g., \(\hat{y}^2\), \(\hat{y}^3\), \(\hat{y}^4\)) to the original model and tests whether they are jointly significant. If they are, the linear form is inadequate because higher-order terms improve the fit.
The logic: if the relationship is truly linear, powers of fitted values shouldn’t add predictive power.
R code:
Decision rule:
Solutions if violated:
When to use: Multiple regression with 2 or more predictors.
Purpose: Detects multicollinearity (high correlation among predictors), which inflates standard errors and makes it difficult to isolate individual predictor effects.
How it works: For each predictor, VIF measures how much the variance of its coefficient estimate is inflated due to correlation with other predictors:
\[VIF_j = \frac{1}{1 - R_j^2}\]
where \(R_j^2\) is the R² from regressing predictor \(j\) on all other predictors.
Interpretation: VIF quantifies how much more variance the coefficient has compared to if predictors were uncorrelated. \(VIF = 1\) means no correlation; \(VIF = 5\) means variance is 5 times larger than if uncorrelated.
R code:
Decision rules:
Solutions:
Important note: High VIF doesn’t bias coefficient estimates, but it inflates standard errors, making it harder to detect significant effects. Your model can still make good predictions with high VIF.
Purpose: Systematically identify influential observations using a quantitative threshold.
How it works: Cook’s distance measures how much all fitted values change when observation \(i\) is deleted:
\[D_i = \frac{(e_i^*)^2}{p} \times \frac{h_i}{1-h_i}\]
where \(e_i^*\) is the standardized residual, \(h_i\) is leverage, and \(p\) is the number of parameters.
R code:
Decision rules:
What to do with influential points:
Alternative influence measures:
Follow this systematic approach for every regression analysis:
Examine each plot carefully:
If linearity is violated:
If heteroscedasticity is detected:
If influential points are found:
If time series data:
Use these only when initial diagnostics suggest specific problems:
Visual diagnostics come first: Plots are more informative than p-values, especially in large samples where tests become overly sensitive.
Statistical tests are supplements: Use quantitative tests to confirm what you see visually, not as a replacement for visual inspection.
Context matters: Not all violations are equally serious. Linearity and independence are typically most critical; moderate violations of normality are often acceptable with reasonable sample sizes.
Large samples are robust: With n > 100, minor violations of normality and homoscedasticity have minimal practical impact due to the Central Limit Theorem and robustness of OLS.
Never automatically delete outliers: Understand why they’re outliers first. They may contain important information or indicate model misspecification.
Report transparently: If you identify violations, acknowledge them and discuss their potential impact. If you exclude influential observations, report results both ways.
Multiple diagnostics: Don’t rely on a single test or plot. A comprehensive assessment examines multiple perspectives on each assumption.
Remember: Perfect adherence to all assumptions is rare in real data. The goal is to understand where and how assumptions are violated, assess the severity of violations, and make informed decisions about how to proceed with your analysis.