Simple Linear Regression

Author

Professor MacDonald

Published

March 31, 2025

Linear model

Line of best fit

HDI definition

How to describe the relationship between GDP per capita and HDI score?

As we learned:

  • Direction
  • Form
  • Strength
  • Outlier

What do you expect the relationship between GDP per capita and HDI will be?

Scatterplot of GDP per capita and HDI

Smoother

Taking a guess

What do you think the intercept and the slope should be for a line of ‘good’ fit?

Least squares line

Slope: 0.000006057761, intercept: 0.589694

Linear model

It’s better if we come up with a more formal model: \(\hat{y}= b_0 + b_1x\)

\(\hat{y}\) is our predicted value \(b_0\) is the \(y\) intercept - the value when \(x\) is 0 \(b_1\) is the slope

  • Helps with predictions
    • For values not in the sample, we can estimate their HDI score
  • Helps assess model fit - we can compare different lines more easily
    • More specifically we can calculate the residuals
    • Residuals are difference between our line and the actually observed value - how much our line ‘missed’ by

Linear model for our data

\(\hat{y}= 0.589694 + 0.000006057761x\)

Least squares line

  • But how to calculate?
  • Many different ways
    • Make a line minimizing the least absolute deviations
    • Non-parametric lines
    • Make a line minimizing the sum of the squares of the deviations
  • Least squares line is most common
    • Advantages:
      • Easy to calculate
      • Well understood statistical properties
    • Disadvantages:
      • Line will be strongly influenced by outliers

Examining model fit

  • Checking the residuals
  • Residual standard deviation
  • \(R^2\)
  • Checking assumptions

Checking the residuals

All real datasets have noise so the real formula is:

\(y = b_0 + b_1x + e\)

Residual = Observed - Predicted

  • \(e = y - \hat{y}\)

Can easily plot the residuals, put the “size of the miss” on the \(y\) axis, and original data on the \(x\) axis

Residuals - our data

Graphing the residuals

Residuals vs. observed data

Residual standard deviation

  • Since the residuals are just another distribution, we can also examine their distribution
    • What to look for: symmetrical, no skew/outliers
    • Standard deviation not too large

Residual standard deviation - our data

How would you interpret this histogram of the residuals?

\(R^2\)

\(R^2\) is just the return of \(r\), the correlation coefficient. Remember:

  • \(r\) measures the strength of the association between \(x\) and \(y\)
    • That is, how reliably \(x\) varies with \(y\)
  • The correlation coefficient: 0.74
  • Our \(R^2\): 0.54

What do you think the \(R^2\) will change to when we remove the outlier?

  • The correlation coefficient for a model with the outlier removed:
  • Our \(R^2\) with the outlier removed:

How to interpret \(R^2\)

  • If there are no serious outliers and the relationship is linear, can provide a useful measure of how strongly the predictor variable is related to the response variable
    • The two assumptions above are quite strong - you need to always draw a picture to make sure they are true!
    • Should not be interpreted as how strongly \(x\) causes \(y\), we only know about association.

Regression assumptions

  • Quantitative variable assumption
  • Straight enough condition
  • Outlier condition
  • Does the plot thicken condition?

Have we met these?

Reexpressions

Log reexpressed

What will happen to the shape of the graph?

Log reexpressed - outlier

Any guess as to the outlier?

Outlier

Equatorial Guinea map

President’s son

President’s son’s cars

Graphing the residuals - log

Residuals standard deviation - log

\(R^2\)

  • The correlation coefficient: 0.94
  • Our \(R^2\): 0.89

Regression assumptions

For the log reexpressed version, have the assumptions been met?

  • Quantitative variable assumption
  • Straight enough condition
  • Outlier condition
  • Does the plot thicken condition?