Multiple Regression

Author

Professor MacDonald

Published

April 7, 2025

Multiple regression

Basic interpretation
Assumptions
Checks
Indicator variables
Interaction terms

Basic multiple regression interpretation

House prices

When linear regression is not enough

\(R^2 = 0.278%\) for sqft and sale_price
27.8% of the variation in Price is accounted for
What about the other 72%?
Could include other lurking variables such as size of the lot a house is on - more land, higher cost right?
A regression with two or more predictor variables is called a multiple regression.

What is multiple regression?

For a simple regression, with one independent variable, the least squares line makes residuals as small as possible.
For multiple regression, the regression equation still makes the residuals as small as possible.
No longer trying to create a line though – instead a multidimensional hyperplane!
Calculations difficult.

Check `grade` and `sale_price`

What do you think will happen to the coefficient on grade when we add sqft?

Adding both terms

The results

\(R^2=0.3051\)
\(s_e=696500\)
Coefficient:
- \(price = -872678 + 329.513sqft\_livingspace + 177278grade\)

How would you interpret this model and the diagnostic statistics?

Further investigation

What is different in multiple regression?

Meaning of coefficients has changed in a subtle way.
Is an extraordinarily versatile calculation, underlying many widely used statistics methods.
Offers a glimpse into statistical models that use more than two quantitative variables.
Models that use several variables can be a big step toward realistic and useful modeling of complex phenomena and relationships

Multiple regression - coefficients

Can’t assume coefficients will stay the same
Coefficients change
Often in unexpected ways
Even changing signs
Be alert for a change in value
Be alert for a change in meaning

Multiple regression model

No simple relationship between \(y\) and \(x_j\), yet \(b_j\) in a multiple regression may be quite different from zero
Strong two-variable relationship between \(y\) and \(x_j\), yet \(b_j\) in a multiple regression to be almost zero
Strong two-variable relationship between \(y\) and \(x_j\), yet \(b_j\) an be opposite in sign in a multiple regression
Easy to extend the model with more predictors
Residuals \(e = y - \hat{y}\)

Assumptions

Three key assumptions

Linearity assumption (straight enough condition)
No pattern in residuals (outliers, straight enough condition)
Equal variance assumption (does the plot thicken?)

Linearity assumption

Straight Enough Condition
- We must check the scatterplot for each of the predictor variables vs. the response variable
- Do not need the scatterplots to show any discernible slope, but should be reasonably straight
- Cannot have bends, or other nonlinearity
- Can be easier to look at the plot of residuals

Check the residual

Errors have a distribution that is:
- Unimodal
- Symmetric
- Without outliers
Look at histogram of residuals
Assumption is less important as sample size increases

Equal variance assumption

Same variability of the errors for all values of each predictor
Does the Plot Thicken? Condition: the spread around the line must be nearly constant.
Be alert for “fan” shaped pattern
Or other tendency for variability to grow or shrink in one part of the scatterplot

Decision loop

Straight Enough Condition: scatterplots of y-variable against each x-variable
- If straight enough, fit multiple regression model
How were data collected? Random? Represent identifiable population? Time? check independence
Find the residuals and predicted values.
Scatterplot of the residuals against predicted values: patternless, no bends, no thickening
Histogram of residuals: unimodal, symmetric, without outliers
If conditions check out, interpret regression model, and make predictions.

Partial residual plots

One of the best ways to check the linearity condition is with a partial residual plot. This plot displays the relationship between the predictor variable and the response variable after removing all of the variance of the other variables in the explanatory variable.

How to check variables individually

Checked overall equation for weirdness in residuals
What about each individual variable’s contribution to the regression?
Partial residual plot to the Rescue!
Look at plot to judge whether its form is straight enough.

Partial residual plots

Meaning of a partial residual plot

Least squares line fit to plot has slope equal to the coefficient the plot illustrates.
Residuals are same as final residuals of multiple regression
- Judge strength of estimation of the plot’s coefficients
Outliers seen the same as they would appear in a simple scatterplot
Direction corresponds to the sign of multiple regression coefficient

Indicator variables

Wages

Indicator variables are for when we want to include categorical variables in our regression
- In a union vs. not in a union
- Often coded at 1=true 0=false, but that’s just convention, doesn’t really matter (remember, units don’t matter for regression)
Regression equation
- \(wages = b_0 + b_1exp + b_2union\)

Wages

Slopes of lines

Predict some values

Equation: \(wages = 747.5634 + 8.2430exp + -77.7134union\)

Interaction terms

Interaction effects

What if lines are not roughly parallel?
Indicator variable that is 0 or 1 shifts line up or down.
- Can’t change slope
- Works only when same slope just different intercepts

Adjusting for different slopes

Introduce another constructed variable
The one is the product of an indicator variable and the predictor variable
Coefficient of this constructed interaction term gives adjustment to slope, \(b_1\), to be made for the individuals in the indicated group.

Adjusting for different slopes

Different slopes for wages

Predict some values

Equation: \(wages = 710.7896 + 10.1421exp + 28.9884union + -5.2755union*exp\)

Footnotes

Credit to: https://crosscut.com/opinion/2020/11/washington-state-housing-question-and-answer↩︎

Multiple regression

Basic multiple regression interpretation

House prices

When linear regression is not enough

What is multiple regression?

Check grade and sale_price

Adding both terms

The results

Further investigation

What is different in multiple regression?

Multiple regression - coefficients

Multiple regression model

Assumptions

Three key assumptions

Linearity assumption

Check the residual

Equal variance assumption

Decision loop

Partial residual plots

How to check variables individually

Partial residual plots

Meaning of a partial residual plot

Indicator variables

Wages

Wages

Slopes of lines

Predict some values

Interaction terms

Interaction effects

Adjusting for different slopes

Adjusting for different slopes

Different slopes for wages

Predict some values

Footnotes

Check `grade` and `sale_price`