Multiple Regression
Multiple regression
- Basic interpretation
- Assumptions
- Checks
- Indicator variables
- Interaction terms
Basic multiple regression interpretation
House prices
When linear regression is not enough
\(R^2 = 0.278%\) for
sqft
andsale_price
27.8% of the variation in Price is accounted for
What about the other 72%?
Could include other lurking variables such as size of the lot a house is on - more land, higher cost right?
A regression with two or more predictor variables is called a multiple regression.
What is multiple regression?
For a simple regression, with one independent variable, the least squares line makes residuals as small as possible.
For multiple regression, the regression equation still makes the residuals as small as possible.
No longer trying to create a line though – instead a multidimensional hyperplane!
Calculations difficult.
Check grade
and sale_price
What do you think will happen to the coefficient on grade
when we add sqft
?
Adding both terms
The results
\(R^2=0.3051\)
\(s_e=696500\)
Coefficient:
- \(price = -872678 + 329.513sqft\_livingspace + 177278grade\)
How would you interpret this model and the diagnostic statistics?
Further investigation
What is different in multiple regression?
Meaning of coefficients has changed in a subtle way.
Is an extraordinarily versatile calculation, underlying many widely used statistics methods.
Offers a glimpse into statistical models that use more than two quantitative variables.
Models that use several variables can be a big step toward realistic and useful modeling of complex phenomena and relationships
Multiple regression - coefficients
Can’t assume coefficients will stay the same
Coefficients change
Often in unexpected ways
Even changing signs
Be alert for a change in value
Be alert for a change in meaning
Multiple regression model
No simple relationship between \(y\) and \(x_j\), yet \(b_j\) in a multiple regression may be quite different from zero
Strong two-variable relationship between \(y\) and \(x_j\), yet \(b_j\) in a multiple regression to be almost zero
Strong two-variable relationship between \(y\) and \(x_j\), yet \(b_j\) an be opposite in sign in a multiple regression
Easy to extend the model with more predictors
Residuals \(e = y - \hat{y}\)
Assumptions
Three key assumptions
Linearity assumption (straight enough condition)
No pattern in residuals (outliers, straight enough condition)
Equal variance assumption (does the plot thicken?)
Linearity assumption
- Straight Enough Condition
We must check the scatterplot for each of the predictor variables vs. the response variable
Do not need the scatterplots to show any discernible slope, but should be reasonably straight
Cannot have bends, or other nonlinearity
Can be easier to look at the plot of residuals
Check the residual
Errors have a distribution that is:
- Unimodal
- Symmetric
- Without outliers
Look at histogram of residuals
Assumption is less important as sample size increases
Equal variance assumption
Same variability of the errors for all values of each predictor
Does the Plot Thicken? Condition: the spread around the line must be nearly constant.
Be alert for “fan” shaped pattern
Or other tendency for variability to grow or shrink in one part of the scatterplot
Decision loop
Straight Enough Condition: scatterplots of y-variable against each x-variable
- If straight enough, fit multiple regression model
How were data collected? Random? Represent identifiable population? Time? check independence
Find the residuals and predicted values.
Scatterplot of the residuals against predicted values: patternless, no bends, no thickening
Histogram of residuals: unimodal, symmetric, without outliers
If conditions check out, interpret regression model, and make predictions.
Partial residual plots
One of the best ways to check the linearity condition is with a partial residual plot. This plot displays the relationship between the predictor variable and the response variable after removing all of the variance of the other variables in the explanatory variable.
How to check variables individually
Checked overall equation for weirdness in residuals
What about each individual variable’s contribution to the regression?
Partial residual plot to the Rescue!
Look at plot to judge whether its form is straight enough.
Partial residual plots
Meaning of a partial residual plot
Least squares line fit to plot has slope equal to the coefficient the plot illustrates.
Residuals are same as final residuals of multiple regression
- Judge strength of estimation of the plot’s coefficients
Outliers seen the same as they would appear in a simple scatterplot
Direction corresponds to the sign of multiple regression coefficient
Indicator variables
Wages
- Indicator variables are for when we want to include categorical variables in our regression
- In a union vs. not in a union
- Often coded at 1=true 0=false, but that’s just convention, doesn’t really matter (remember, units don’t matter for regression)
- Regression equation
- \(wages = b_0 + b_1exp + b_2union\)
Wages
Slopes of lines
Predict some values
- Equation: \(wages = 747.5634 + 8.2430exp + -77.7134union\)
Interaction terms
Interaction effects
What if lines are not roughly parallel?
Indicator variable that is 0 or 1 shifts line up or down.
- Can’t change slope
- Works only when same slope just different intercepts
Adjusting for different slopes
Introduce another constructed variable
The one is the product of an indicator variable and the predictor variable
Coefficient of this constructed interaction term gives adjustment to slope, \(b_1\), to be made for the individuals in the indicated group.
Adjusting for different slopes
Different slopes for wages
Predict some values
- Equation: \(wages = 710.7896 + 10.1421exp + 28.9884union + -5.2755union*exp\)
Footnotes
Credit to: https://crosscut.com/opinion/2020/11/washington-state-housing-question-and-answer↩︎