Simple Linear Regression
Linear model
Line of best fit
How to describe the relationship between GDP per capita
and HDI score
?
As we learned:
- Direction
- Form
- Strength
- Outlier
What do you expect the relationship between GDP per capita and HDI will be?
Scatterplot of GDP per capita
and HDI
Smoother
Taking a guess
What do you think the intercept and the slope should be for a line of ‘good’ fit?
Least squares line
Slope: 0.000006057761, intercept: 0.589694
Linear model
It’s better if we come up with a more formal model: \(\hat{y}= b_0 + b_1x\)
\(\hat{y}\) is our predicted value \(b_0\) is the \(y\) intercept - the value when \(x\) is 0 \(b_1\) is the slope
- Helps with predictions
- For values not in the sample, we can estimate their
HDI score
- For values not in the sample, we can estimate their
- Helps assess model fit - we can compare different lines more easily
- More specifically we can calculate the residuals
- Residuals are difference between our line and the actually observed value - how much our line ‘missed’ by
Linear model for our data
\(\hat{y}= 0.589694 + 0.000006057761x\)
Least squares line
- But how to calculate?
- Many different ways
- Make a line minimizing the least absolute deviations
- Non-parametric lines
- Make a line minimizing the sum of the squares of the deviations
- Least squares line is most common
- Advantages:
- Easy to calculate
- Well understood statistical properties
- Disadvantages:
- Line will be strongly influenced by outliers
- Advantages:
Examining model fit
- Checking the residuals
- Residual standard deviation
- \(R^2\)
- Checking assumptions
Checking the residuals
All real datasets have noise so the real formula is:
\(y = b_0 + b_1x + e\)
Residual = Observed - Predicted
- \(e = y - \hat{y}\)
Can easily plot the residuals, put the “size of the miss” on the \(y\) axis, and original data on the \(x\) axis
Residuals - our data
Graphing the residuals
Residuals vs. observed data
Residual standard deviation
- Since the residuals are just another distribution, we can also examine their distribution
- What to look for: symmetrical, no skew/outliers
- Standard deviation not too large
Residual standard deviation - our data
How would you interpret this histogram of the residuals?
\(R^2\)
\(R^2\) is just the return of \(r\), the correlation coefficient. Remember:
- \(r\) measures the strength of the association between \(x\) and \(y\)
- That is, how reliably \(x\) varies with \(y\)
- The correlation coefficient: 0.74
- Our \(R^2\): 0.54
What do you think the \(R^2\) will change to when we remove the outlier?
- The correlation coefficient for a model with the outlier removed:
- Our \(R^2\) with the outlier removed:
How to interpret \(R^2\)
- If there are no serious outliers and the relationship is linear, can provide a useful measure of how strongly the predictor variable is related to the response variable
- The two assumptions above are quite strong - you need to always draw a picture to make sure they are true!
- Should not be interpreted as how strongly \(x\) causes \(y\), we only know about association.
Regression assumptions
- Quantitative variable assumption
- Straight enough condition
- Outlier condition
- Does the plot thicken condition?
Have we met these?
Reexpressions
Log reexpressed
What will happen to the shape of the graph?
Log reexpressed - outlier
Any guess as to the outlier?
Outlier
Graphing the residuals - log
Residuals standard deviation - log
\(R^2\)
- The correlation coefficient: 0.94
- Our \(R^2\): 0.89
Regression assumptions
For the log reexpressed version, have the assumptions been met?
- Quantitative variable assumption
- Straight enough condition
- Outlier condition
- Does the plot thicken condition?