Returning to Regression

Author

Professor MacDonald

Published

April 28, 2025

Returning to regression

  • Moving from description to inference
  • Review of regression conditions
  • Sampling distribution of models
  • Inference for regression

Moving from description to inference

Returning to regression

How would you interpret all parts of this model except the pp value, tvaluet value and Std.ErrorStd. Error?

Model

  • The equation of the least squares line mpĝ=46.3215.35×weight\hat{mpg} = 46.32−15.35\times weight
  • Slope of -15.35 indicates that the miles per gallon of cars is on average 15.35 less for each additional ton the car weighs.
  • How useful is the model?
  • Slope and intercept are descriptions of data - want to know how certain we are of this slope estimate
  • Want to understand what it can tell us beyond the 400 cars in the study
  • Construct CI, test hypotheses about slope, intercept

Sample vs. model regression

  • Sample:
    • ŷ=b0+b1x\hat{y}=b_0 + b_1x
    • This gives a prediction for yy based on the sample.
  • Model:
    • μy=β0+β1x\mu_y = \beta_0 + \beta_1x
    • b0b_0 = y-intercept for the model
    • b1b_1 = slope for the model
    • The model assumes that for every value of xx, the mean of the yy’s lies on the line.

Slope variability

Just as with a mean, if we sample many times from a population, we will, by chance, get variation in our estimate of the slope of the relationship between the two variables.

We can then make a histogram of each of these slope estimates.

In expectation, what shape should we expect the distribution of slope estimates to conform to? Why?

Errors

  • The model predicts the mean of yy for each xx, but misses the actual individual values of yy
    • μy=β0+β1x\mu_y = \beta_0 + \beta_1x
  • The error, e, is the amount the line misses the value of yy.
    • ŷ=β0+β1x+e\hat{y} = \beta_0 + \beta_1x+e
  • This new equation gives the exact value of each of the yy’s.

Review of regression conditions

Straight enough condition

  • Straight enough condition
    • Does the scatterplot look relatively straight?
      • Don’t draw the line. It can fool you.
    • Look at scatterplot of the residuals.
      • Should have horizontal direction
      • Should not have a pattern
    • If straight enough, check the other assumptions.
    • If not straight, stop or re-express.

Independence assumption

  • Errors (ee’s) must be independent of each other.
    • Check residuals plot.
      • Should not have clumps, trends, or a pattern.
      • Should look very random.
    • To make inferences about the population, the sample must be representative.
    • For x=timex = time, plot residuals vs. residuals one step later.
      • Should look very random

We can partially check the independence assumption by viewing the residuals.

Equal variance assumption

  • Variability of yy same for all yy
    • Does the plot thicken? condition: Spread along the line should be nearly constant.
      • “Fan shape” is bad.
    • Standard deviation of residuals, ses_e, will be used for CI and hypothesis tests.
      • This requires same variance for each xx.

The equal variance assumption can be analyzed by also viewing the residual plot.

Normal population assumption

  • As with the tt and zz test, we are relying on the Central Limit Theorem that our sampling distribution of our statistic being normally distributed
    • Statistic in this case is not mean but regression coefficient
  • To meet this condition, the errors for each fixed xx must follow the Normal model.
    • Good enough to use the Nearly Normal Condition and the Outlier condition for the predictors.
    • Look at the histograms.
    • With large sample sizes, the Nearly Normal Condition is usually satisfied.

The normal population assumption can be partially checked by viewing a histogram of the residuals.

Do these plots indicate that the conditions for regression are satisfied?

Which comes first, the conditions or the residuals?

  1. Check Straight Enough Condition with scatterplot.
  2. Fit regression, find predicted values and residuals.
  3. Make scatterplot of residuals and check for thickening, bends, and outliers.
  4. For data measured over time, use residuals plot to check for independence.
  5. Check the Nearly Normal Condition with a histogram and Normal plot of residuals.
  6. If no violations, proceed with inference.
  • Note: Stop if at any point there is a violation.

Sampling distribution of models

Sample to sample variation of the slope and intercept

  • Null hypothesis (usually) - regression slope = 0
    • H0:b1=0H_0: b_1 = 0
    • Ha:b10H_a: b1 \neq 0
  • pp values in the table come from tests of this hypothesis for each coefficient.
  • To calculate pp value, we need to describe our sampling distribution.
  • The mean of the sampling distribution of the regression slope will be 0, from the null hypothesis.
  • We assume shape of the sampling distribution will be normal (from the Central Limit Theorem)
  • Standard errors (or our estimate of the standard deviation of the sampling distribution) must come from the data.
  • Each sample of 400 cars will produce its own line with slightly different b0b_0’s and b1b_1’s.

Spread around the line

  • Less scatter along the line \rightarrow slope more consistent
  • Residual standard deviation, ses_e, measures this scatter.

Spread around the line
  • Less scatter around the line, smaller the residual standard deviation and stronger the relationship between xx and yy
  • Some assess strength of regression by looking at ses_e
  • It has the same units as yy
  • Tells how close data are to the our model.
  • r2r^2 is proportion of the variation of yy accounted for by xx
  • Larger sxs_x (SD in xx) \rightarrow more stable regression

Spread in x’s

Sample size

  • Larger sample size \rightarrow more stable regression

Sample size stability

Standard error for the slope

  • Three aspects of the scatterplot, then, affect the standard error of the regression slope:
    • Spread around the model: ses_e
    • Variation among the xx values.
    • Sample size: nn
  • These are in fact the only things that affect the standard error of the slope.

Standard error formally

  • The standard error for the slope parameter is estimated by:
    • SE(b1)=sen1sxSE(b_1) = \frac{s_e}{\sqrt{n-1}s_x} where se=(yŷ)2n2s_e = \sqrt{\frac{\sum{(y-\hat{y})^2}}{n-2}}
  • We can then calculate how many tt units our slope estimate is from the null hypothesis by:
    • t=b1β1SE(b1)t=\frac{b_1-\beta_1}{SE(b_1)}
    • The tt follows Student’s tt model with df=n2df=n-2
  • Don’t need to remember the SE formula, just note that the process is exactly the same as for the mean
    • Find the number of tt units the slope is away from the mean
    • Use that tt score to find a pp value of how likely it would be to observe a difference in slopes that larger or larger from the null just by chance

Inference for regression

Example - wages

  • It has long been known women earn less than men, and this fact is often taken as evidence of discrimination
  • How to test with statistics though?

Gender wage gap

What about a tt test?

  • Slam dunk case right??

What are some reasons why this tt test might be misleading?

Other factors

  • Women dropping out of the labor force
  • Impact of high earning men influencing the calculation (outliers)
  • Women’s career choice
  • Differences in educational attainment
  • Other factors?

1. Check straight enough condition

Two-way scatterplots

Histogram of wage

Histogram of log(wage)

Do these plots suggest the straight enough condition has been met?

2. Fit the regression

Interpret this regression table.

3. Check the residuals

4. Check the Nearly Normal condition of the residuals

Do the residual plots suggest the conditions of the residuals have been met?

5. Proceed with inference

How should we understand the coefficient of sexsex from the regression compared to the simple difference of means between male and female?

Collinearity

  • Variables are said to be collinear when their correlation is very high
    • Ex: HDI score and GDP/capita
  • What the regression model is trying to do is apportion the amount of responsibility of each predictor independent of the other predictors
    • i.e. What is the impact of female INDEPENDENT of all other predictors?

Collinearity

  • When predictor is collinear
    • Coefficient surprising: unanticipated sign or unexpectedly large or small value
    • SE of coefficient can be inflated, leading to smaller tt statistic, larger pp value
  • What to do?
    • Remove some of the predictors
      • Simplifies model, improves tt statistic of slope (usually)
    • Keep those most reliably measured, least expensive to find, or ones that are politically important