Regression Wisdom

Author

Professor MacDonald

Published

April 2, 2025

Regression issues

  • Model fit measures
  • Extrapolation
  • Outliers
    • Leverage
    • Influence
  • Interpreting a regression

Model fit measures

Regression results

Many parts of these results we already know how to interpret. For now, we will focus on model fit measures.

  • \(R^2\)
  • \(s_e\)
  • Residuals 5 number summary

Residuals - histogram

How would you interpret this residual histogram?

Correlation review

  • Recall that a correlation, \(r\), ranges from -1 to 1 and indicates the strength of the association between two variables
    • If \(r\) is -1 or 1 exactly, there is no variation, the correlation indicate the relationship is a straight line
    • Note that \(r\) does not indicate the slope
    • If \(r\) is 0, that means there is no relationship
      • What is the slope in that case?
      • \(\hat{y} = b_0 + b_1*x\)

R squared definition

  • \(R^2\) is the square of \(r\) in a two variable case, so between 0 and 1

  • But, unlike \(r\), \(R^2\) is meaningful in multivariate models

  • Percent of the total variation in the data explained by the model

  • Sum of the errors from our model divided by sum of errors from the ‘braindead’ model of \(\hat{y}=\bar{y}\)

  • If the \(R^2\) is small, that means our model doesn’t beat the ‘braindead’ model by very much

When is \(R^2\) “big enough”?

  • \(R^2\) is useful, but only so much so

  • The closer \(R^2\) is to 1, the more useful the model

    • How close is “close”?
    • Depends on the situation
    • \(R^2\) of 0.5 might be very bad for a model that uses height to predict weight
      • Should be more closely related
    • \(R^2\) of 0.5 might be very good for a model using test scores to predict future income
      • Response variable has a lot of factors that shape it and a lot of noise
  • Good practice: always report \(R^2\) and \(s_e\) and let readers analyze the results as well

Extrapolating

  • The farther a new value is from the range of \(x\), the less trust we should place in the predicted value of \(y\)

  • Venture into new \(x\) territory, called extrapolation

  • Dubious: questionable assumption that nothing changes about the relationship between x and y changes for extreme values of \(x\)

Predicting MPG of cars

1970s data on automobiles

Predicting the Maybach

Maybach

Predicting the Maybach

Maybach interior

Predicting the Maybach

Maybach North Korea visit

Will our model do a good job predicting this car’s miles per gallon?

Can we predict this car’s MPG using our model?

Weight: 6581 pounds

Model: \(\hat{y} = b_0 + b_1*x\)
\(\hat{y} = 46.3 + -0.00767*6581\)
\(\hat{y} = -4.17\) miles per gallon

  • Nonsense prediction

Be wary of out of sample predictions

Outliers

Height and net worth

First, we can create random data for both a variable called height and one called log net worth but in the below example they are defined to be random and have no relation to each other.

Adding Steve Ballmer

Steve Ballmer

What will happen to the slope and \(R^2\)?

Steve Ballmer plot

Adding Warren Buffet

Warren Buffet

Adding Aleksandar Vučić

Aleksandar Vučić

Leverage vs. influence

  • A data point whose \(x\) value is far from the mean of the rest of the \(x\) values is said to have high leverage.

  • Leverage points have the potential to strongly pull on the regression line.

  • A point is influential if omitting it from the analysis changes the model enough to make a meaningful difference.

  • Influence is determined by

    • The residual
    • The leverage

Warnings

  • Influential points can hide in plots of residuals.

  • Points with high leverage pull line close to them, so have small residuals.

  • See points in scatterplot of original data.

  • Find regression model with and without the points.

Interpreting a regression

Step 1: develop some expectations

Horsepower vs. MPG

  • More powerful engines probably are less fuel efficient

  • Relationship is likely roughly linear

  • The exact relationship depends on the efficiency of the engine

    • Could be noisy

Step 2: make a picture

Step 3: check the conditions

  • Quantitative variable condition

  • Straight enough condition

  • Outlier condition

  • Does the plot thicken

  • Conclusion:?

Step 4: identify the units

  • Miles per gallon: amount of miles you can travel on one gallon of gas, a measure of efficiency.
    • Most gasoline-using cars have MPG between 10-40, higher being better
  • Horsepower: power of the engine.
    • Typical values for standard cars are in the 100-200 range, higher meaning more powerful

Step 5: intepret the slope of the regression line

  • For every one unit increase in horsepower, miles per gallon decreases by about -0.15 units
    • Is that a lot or a little?

Step 6: determine reasonable values for the predictor variable

Step 7: interpret the intercept

Step 8: solve for reasonable predictor values

Horsepower = Q1 = 75:

\(\hat{y} = b_0 + b_1*x\)
\(\hat{y} = 39.94 + -0.158*100\)
\(\hat{y} = 28.09\)

Horsepower = Median = 93.5:

\(\hat{y} = b_0 + b_1*x\)
\(\hat{y} = 39.94 + -0.158*100\)
\(\hat{y} = 25.17\)

Horsepower = Q3 = 126:

\(\hat{y} = b_0 + b_1*x\)
\(\hat{y} = 39.94 + -0.158*100\)
\(\hat{y} = 20.03\)

Step 9: interpret the residuals and identify their units

Step 10: view the distribution of the residuals

Step 11: interpret the residual standard error

Step 12: interpret the R squared

Step 13: think about confounders

  • What are some confounders, or “lurking variables”?
    • Categorical
    • Quantitative