Simple Linear Regression

Author

Professor MacDonald

Published

March 31, 2025

Linear model

Line of best fit

How to describe the relationship between GDP per capita and HDI score?

As we learned:

Direction
Form
Strength
Outlier

What do you expect the relationship between GDP per capita and HDI will be?

Scatterplot of `GDP per capita` and `HDI`

Smoother

Taking a guess

What do you think the intercept and the slope should be for a line of ‘good’ fit?

Least squares line

Slope: 0.000006057761, intercept: 0.589694

Linear model

It’s better if we come up with a more formal model: \(\hat{y}= b_0 + b_1x\)

\(\hat{y}\) is our predicted value \(b_0\) is the \(y\) intercept - the value when \(x\) is 0 \(b_1\) is the slope

Helps with predictions
- For values not in the sample, we can estimate their HDI score
Helps assess model fit - we can compare different lines more easily
- More specifically we can calculate the residuals
- Residuals are difference between our line and the actually observed value - how much our line ‘missed’ by

Linear model for our data

\(\hat{y}= 0.589694 + 0.000006057761x\)

Least squares line

But how to calculate?
Many different ways
- Make a line minimizing the least absolute deviations
- Non-parametric lines
- Make a line minimizing the sum of the squares of the deviations
Least squares line is most common
- Advantages:
  - Easy to calculate
  - Well understood statistical properties
- Disadvantages:
  - Line will be strongly influenced by outliers

Examining model fit

Checking the residuals
Residual standard deviation
\(R^2\)
Checking assumptions

Checking the residuals

All real datasets have noise so the real formula is:

\(y = b_0 + b_1x + e\)

Residual = Observed - Predicted

\(e = y - \hat{y}\)

Can easily plot the residuals, put the “size of the miss” on the \(y\) axis, and original data on the \(x\) axis

Residuals - our data

Graphing the residuals

Residuals vs. observed data

Residual standard deviation

Since the residuals are just another distribution, we can also examine their distribution
- What to look for: symmetrical, no skew/outliers
- Standard deviation not too large

Residual standard deviation - our data

How would you interpret this histogram of the residuals?

\(R^2\)

\(R^2\) is just the return of \(r\), the correlation coefficient. Remember:

\(r\) measures the strength of the association between \(x\) and \(y\)
- That is, how reliably \(x\) varies with \(y\)
The correlation coefficient: 0.74
Our \(R^2\): 0.54

What do you think the \(R^2\) will change to when we remove the outlier?

The correlation coefficient for a model with the outlier removed:

Our \(R^2\) with the outlier removed:

How to interpret \(R^2\)

If there are no serious outliers and the relationship is linear, can provide a useful measure of how strongly the predictor variable is related to the response variable
- The two assumptions above are quite strong - you need to always draw a picture to make sure they are true!
- Should not be interpreted as how strongly \(x\) causes \(y\), we only know about association.

Regression assumptions

Quantitative variable assumption
Straight enough condition
Outlier condition
Does the plot thicken condition?

Have we met these?

Reexpressions

Log reexpressed

What will happen to the shape of the graph?

Log reexpressed - outlier

Any guess as to the outlier?

Outlier

Graphing the residuals - log

Residuals standard deviation - log

\(R^2\)

The correlation coefficient: 0.94
Our \(R^2\): 0.89

Regression assumptions

For the log reexpressed version, have the assumptions been met?

Quantitative variable assumption
Straight enough condition
Outlier condition
Does the plot thicken condition?

Linear model

Line of best fit

Scatterplot of GDP per capita and HDI

Smoother

Taking a guess

Least squares line

Linear model

Linear model for our data

Least squares line

Examining model fit

Checking the residuals

Residuals - our data

Graphing the residuals

Residuals vs. observed data

Residual standard deviation

Residual standard deviation - our data

\(R^2\)

How to interpret \(R^2\)

Regression assumptions

Reexpressions

Log reexpressed

Log reexpressed - outlier

Outlier

Graphing the residuals - log

Residuals standard deviation - log

\(R^2\)

Regression assumptions

Scatterplot of `GDP per capita` and `HDI`