Multiple Regression

Author

Professor MacDonald

Published

April 7, 2025

Multiple regression

  • Basic interpretation
  • Assumptions
  • Checks
  • Indicator variables
  • Interaction terms

Seattle housing1

Basic multiple regression interpretation

House prices

When linear regression is not enough

  • \(R^2 = 0.278%\) for sqft and sale_price

  • 27.8% of the variation in Price is accounted for

  • What about the other 72%?

  • Could include other lurking variables such as size of the lot a house is on - more land, higher cost right?

  • A regression with two or more predictor variables is called a multiple regression.

What is multiple regression?

  • For a simple regression, with one independent variable, the least squares line makes residuals as small as possible.

  • For multiple regression, the regression equation still makes the residuals as small as possible.

  • No longer trying to create a line though – instead a multidimensional hyperplane!

  • Calculations difficult.

Check grade and sale_price

What do you think will happen to the coefficient on grade when we add sqft?

Adding both terms

The results

  • \(R^2=0.3051\)

  • \(s_e=696500\)

  • Coefficient:

    • \(price = -872678 + 329.513sqft\_livingspace + 177278grade\)

How would you interpret this model and the diagnostic statistics?

Further investigation

What is different in multiple regression?

  • Meaning of coefficients has changed in a subtle way.

  • Is an extraordinarily versatile calculation, underlying many widely used statistics methods.

  • Offers a glimpse into statistical models that use more than two quantitative variables.

  • Models that use several variables can be a big step toward realistic and useful modeling of complex phenomena and relationships

Multiple regression - coefficients

  • Can’t assume coefficients will stay the same

  • Coefficients change

  • Often in unexpected ways

  • Even changing signs

  • Be alert for a change in value

  • Be alert for a change in meaning

Multiple regression model

  • No simple relationship between \(y\) and \(x_j\), yet \(b_j\) in a multiple regression may be quite different from zero

  • Strong two-variable relationship between \(y\) and \(x_j\), yet \(b_j\) in a multiple regression to be almost zero

  • Strong two-variable relationship between \(y\) and \(x_j\), yet \(b_j\) an be opposite in sign in a multiple regression

  • Easy to extend the model with more predictors

  • Residuals \(e = y - \hat{y}\)

Assumptions

Three key assumptions

  • Linearity assumption (straight enough condition)

  • No pattern in residuals (outliers, straight enough condition)

  • Equal variance assumption (does the plot thicken?)

Linearity assumption

  • Straight Enough Condition
    • We must check the scatterplot for each of the predictor variables vs. the response variable

    • Do not need the scatterplots to show any discernible slope, but should be reasonably straight

    • Cannot have bends, or other nonlinearity

    • Can be easier to look at the plot of residuals

Check the residual

  • Errors have a distribution that is:

    • Unimodal
    • Symmetric
    • Without outliers
  • Look at histogram of residuals

  • Assumption is less important as sample size increases

Equal variance assumption

  • Same variability of the errors for all values of each predictor

  • Does the Plot Thicken? Condition: the spread around the line must be nearly constant.

  • Be alert for “fan” shaped pattern

  • Or other tendency for variability to grow or shrink in one part of the scatterplot

Decision loop

  • Straight Enough Condition: scatterplots of y-variable against each x-variable

    • If straight enough, fit multiple regression model
  • How were data collected? Random? Represent identifiable population? Time? check independence

  • Find the residuals and predicted values.

  • Scatterplot of the residuals against predicted values: patternless, no bends, no thickening

  • Histogram of residuals: unimodal, symmetric, without outliers

  • If conditions check out, interpret regression model, and make predictions.

Partial residual plots

One of the best ways to check the linearity condition is with a partial residual plot. This plot displays the relationship between the predictor variable and the response variable after removing all of the variance of the other variables in the explanatory variable.

How to check variables individually

  • Checked overall equation for weirdness in residuals

  • What about each individual variable’s contribution to the regression?

  • Partial residual plot to the Rescue!

  • Look at plot to judge whether its form is straight enough.

Partial residual plots

Meaning of a partial residual plot

  • Least squares line fit to plot has slope equal to the coefficient the plot illustrates.

  • Residuals are same as final residuals of multiple regression

    • Judge strength of estimation of the plot’s coefficients
  • Outliers seen the same as they would appear in a simple scatterplot

  • Direction corresponds to the sign of multiple regression coefficient

Indicator variables

Wages

  • Indicator variables are for when we want to include categorical variables in our regression
    • In a union vs. not in a union
    • Often coded at 1=true 0=false, but that’s just convention, doesn’t really matter (remember, units don’t matter for regression)
  • Regression equation
    • \(wages = b_0 + b_1exp + b_2union\)

Wages

Slopes of lines

Predict some values

  • Equation: \(wages = 747.5634 + 8.2430exp + -77.7134union\)

Interaction terms

Interaction effects

  • What if lines are not roughly parallel?

  • Indicator variable that is 0 or 1 shifts line up or down.

    • Can’t change slope
    • Works only when same slope just different intercepts

Adjusting for different slopes

  • Introduce another constructed variable

  • The one is the product of an indicator variable and the predictor variable

  • Coefficient of this constructed interaction term gives adjustment to slope, \(b_1\), to be made for the individuals in the indicated group.

Adjusting for different slopes

Different slopes for wages

Predict some values

  • Equation: \(wages = 710.7896 + 10.1421exp + 28.9884union + -5.2755union*exp\)

Footnotes

  1. Credit to: https://crosscut.com/opinion/2020/11/washington-state-housing-question-and-answer↩︎