Lecture 4.1 - Comparing Groups

Author

Professor MacDonald

Published

April 23, 2025

Comparing groups

What type of inference is a $t$ test?
A confidence interval for the difference between two means
The two-sample $t$ test: testing for the difference between two means
Experiments and causality

What type of inference is a $t$ test?

A review - descriptive vs. inference

Table 1: Inference types - general

Type of analysis	Descriptive	Inferential
Univariate	Histogram, bar chart	Confidence interval
Univariate compared to theoretical expectation	QQ plot	One proportion z test, one proportion t test
Comparing two variables	Scatterplot, two variable regression	Two proportion z test, two proportion t test
Comparing many variables	Multiple variable regression	Multiple variable regression

A review - one vs. two mean test

Table 2: Inference types - mean

One mean test	Two mean test
Comparing the mean of your sample to some statement about the world	Comparing the mean of one part of your sample to another part of your sample
Null hypothesis: based on some belief we have about the general population, i.e. students sleep 7.03 hours	Null hypothesis: no difference between groups

Example

Table 3: Inference example

One mean test	Two mean test
$H_0$ : Our sample mean of hours of sleep is the same as all students in the world	$H_0$ : The sample mean of male students hours slept is the same as the mean of female students hours slept
$H_a$ : Our sample mean is different than the world’s population mean	$H_a$ : The sample mean of female students is different than the sample mean of male students

A confidence interval for the difference between two means/proportions

Difference between means/proportions: standard error

Want to find the $SE$ for $\bar{y}_1-\bar{y}_2$
Start with theoretical properties:
- $SD(\bar{y}_1-\bar{y}_2) = \sqrt{Var(\bar{y}_1) + Var(\bar{y}_2)}$
- $SD(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2}}$
Don’t know the population σ\sigma for each subsample, so use the sample SDSDs as before
- $SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$

Example: proportions

658 male passengers on the Titanic; 135 survived
388 female passengers on the Titanic; 292 survived
$SD(\hat{p}_1-\hat{p}_2) = \sqrt{\frac{p_1q_1}{n_1} + \frac{p_1q_2}{n_2}}$
$SE(0.205-0.753) = \sqrt{\frac{0.205\times0.795}{658} + \frac{0.753\times0.247}{388}}$
$SE(0.205-0.753) = \sqrt{0.00024 + 0.000479}$
$SE(0.205-0.753) = 0.0269$

Confidence interval

What is the 95% confidence interval of the difference in means?
$\bar{y}_1-\bar{y}_2\pm Critical value \times SE$
$-0.548\pm z^* \times 0.0269$
$-0.548\pm 1.96 \times 0.0269$
$-0.548\pm 0.0528$

What can you conclude from this - how can you state the results? What are some factors that are omitted?

CI for the difference between two proportions/means

First find two-sample $z$ / $t$ interval for the difference in means
Then apply two-sample $z$ / $t$ test
Interval looks like others we have seen
- $\bar{y}_1-\bar{y}_2\pm ME$
- $ME = t^*/z^*\times SE(\bar{y}_1-\bar{y}_2)$
Uses the $z$ model (proportion) or Student’s $t$ model (mean)
The degrees of freedom for $t$ are complicated, so just use a computer

Sampling distribution for the difference between two means

When the conditions are met, the sampling distribution of the standardized sample difference between the means of two independent groups:
- $t = \frac{(\bar{y}_1-\bar{y}_2) - (\mu_1 - \mu_2)}{SE(\bar{y}_1-\bar{y}_2)}$
Uses the Student’s $t$ model
Degrees of freedom are found with a special formula
Think carefully here about what we are modeling

Assumptions

Independence assumption:
- Within each group, individual responses should be independent of each other.
- Knowing one response should not provide information about other responses.
Randomization condition:
- If responses are selected with randomization, their independence is likely.
Independent Groups Assumption
- Responses in the two groups are independent of each other.
- Knowing how one group responds should not provide information about the other group.

Assumptions continued

Nearly normal condition
- Check this for both groups
- A violation by either one, violates the condition
- $n < 15$ in either group: should not use these methods if the histogram or Normal probability plot shows severe skewness
- $n$ closer to 40 for both groups: mildly skewed histogram is OK
- $n > 40$ for both groups: Fine as long as no extreme outliers or extreme skewness

Confidence interval formally

When the conditions are met, the confidence interval for the difference between means from two independent groups is
- $(\bar{y}_1-\bar{y}_2)\pm t^*_{df}\times SE(\bar{y}_1-\bar{y}_2)$
- where $SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$
Critical value $t^*_{df}$ depends on confidence level $C$

The two sample $t$ test: testing for the difference between two means

A two sample $t$ test for difference between means

Conditions same as two-sample t-interval
- $H_0: \mu_1-\mu_2 = \Delta_0$ ( $\Delta_0$ usually $0$ )
When the conditions are met and the null hypothesis is true, use the Student’s tt model to find the pp value.
- $t = \frac{(\bar{y}_1-\bar{y}_2) - \Delta_0}{SE(\bar{y}_1-\bar{y}_2)}$
- $SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$

Step by step example

Is there a difference in housing price depending on if the house has a view?
- Think →\rightarrow
  - Plan: I have housing prices from many thousands of houses in King County, assumed to have been sampled randomly.
  - Hypotheses
  - $H_0: \mu_w-\mu_{notw}=0$
  - $H_a: \mu_w-\mu_{notw}\ne0$

Step by step example

Think →\rightarrow
- Mean price of house with a view: 1772071, mean price no view: 1139608
- Model:
  - Randomization Condition: Subjects assigned to treatment groups randomly?
  - Independent Groups Assumption: Sampling method gives independent groups?
  - Nearly normal condition: Histograms are reasonably unimodal and symmetric?
  - The assumptions and conditions are reasonable?

After analyzing these assumptions , are we justified in using the Student’s t-model to perform a two-sample t-test?

Step by step example

Show
- Mechanics
  - Mean price of house with a view: 1772071
  - Mean price no view: 1139608
  - SD of view: 1128502
  - SD no view: 823799
  - $n$ of view: 433
  - $n$ of no view 21504

What is the formula we should use in the next step?

Step by step example

Show
- Mechanics
  - solve for the SE: $SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$
  - solve for the SE: $SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{1128502}{433} + \frac{823799}{21504}}$
  - solve for the SE:
  - find $t$ score: $t = \frac{(\bar{y}_1-\bar{y}_2) - \Delta_0}{SE(\bar{y}_1-\bar{y}_2)}$
  - find $t$ score: $t = \frac{(1772071-1139608) - 0}{54523}$
  - find $t$ score: $t = 11.6$
  - find $p$ value: can use table with $df=433$ , $p = 0$

Step by step example

Alternatively, we can use the built-in $t$ test function

Show

What can we conclude, based on the results of this $t$ test? What are some assumptions of this $t$ test that may be violated?

Step by step example

Tell →\rightarrow
- Conclusion: the $p$ value = 0 is less than the critical value
- If there were no difference in the mean prices, then a difference this large would occur 1 times in millions of times
- Too rare to believe happened by chance? Yes
- Reject $H_0$ ? Yes
- Conclude that houses with a view are more expensive than regular houses? Yes

Experiments

Independence

Independence assumption:
- Within each group, individual responses should be independent of each other.
- Knowing one response should not provide information about other responses.
Randomization condition:
- If responses are selected with randomization, their independence is likely.
Independent Groups Assumption
- Responses in the two groups are independent of each other.
- Knowing how one group responds should not provide information about the other group.

The importance of the counterfactual

For causal inference, one should ask the counter-factual question, for those who received “treatment”, what would have happened to them if they hadn’t been treated?
That is, we only observe one state of reality (had more vegetables), but we want to know the DIFFERENCE the treatment had on the person by asking what would have happened if the DID NOT receive the treatment

The importance of the counterfactual

More formally, we are interested in the difference the treatment has on the response variable (Health)
On a child ( $y_1$ )that did receive more vegetables, we want to consider the case of what would have happened if they had NOT had the vegetables and find the treatment effect
- Or, $y_1^t- y_1^c$ = treatment effect ( $t$ denoting treatment; $c$ denoting control)
- Note that $y_1^t$ is observed, but $y_1^c$ is not.
The problem is one of missing data – how to estimate $y_1^c$ ?

Comparability problems

If subjects who receive treatment and those who do not are different in some important characteristics, we have selectivity bias – e.g. higher SES children were more likely to be in the vegetable treatment group
- Violates the independent group assumption because if rich children are more likely to be in the “eats vegetables” group we know that the observed value of the response variable, $heath$ , is likely to be higher
- Knowing which group they are in gives us some knowledge of what their observed value of $y$ will be
Often called “omitted variable bias.”
Big problem in observational studies – many variables are probably not present that we’d like to know
What are some omitted variables that might bias our finding that houses on the view have a higher price than houses not on the view?

Experiments

Experiments solve the omitted variable bias
Random assignment of treatment and control status ensures that subjects differ ON AVERAGE only in the treatment they receive
We can then compute the Average Treatment Effect (ATE) of being in the control vs. treatment
A $t$ test between treatment and control group will therefore be accurate
In observational studies, it is very rare to be able to guarantee that assignment to the two groups are independent of the response variable.
- If there are important other omitted variables that influence assignment to the two groups, need to control for these omitted variables via a multiple regression

Drawbacks of experiments

Lack of generalizability – Often done on college students or in contrived settings (external validity)
Cost – very expensive to run a full experiment
Ethics – why shouldn’t we give positive treatments to everyone?
Mechanically complicated
- Difficult to ensure proper randomization
- Difficult to design appropriate treatments
- Difficult to develop appropriate measurements

Multiple regression

Multiple Regression
- Attempts to control for, or estimate the treatment effect, for each variable included, INDEPENDENT of the other variables
- How sure are we of the treatment effect?
- $t$ test of the slope of the regression line
- Null hypothesis is that treatment variable (“eats vegetables”) makes no difference on response variable
- If slope is non-zero, it indicates that differences in treatment produce differences in response variable (increase education $\rightarrow$ increase in wages)

Comparing groups

What type of inference is a tt test?

A review - descriptive vs. inference

A review - one vs. two mean test

Example

A confidence interval for the difference between two means/proportions

Difference between means/proportions: standard error

Example: proportions

Confidence interval

CI for the difference between two proportions/means

Sampling distribution for the difference between two means

Assumptions

Assumptions continued

Confidence interval formally

The two sample tt test: testing for the difference between two means

A two sample tt test for difference between means

Step by step example

Step by step example

Step by step example

Step by step example

Step by step example

Step by step example

Experiments

Independence

The importance of the counterfactual

The importance of the counterfactual

Comparability problems

Experiments

Drawbacks of experiments

Multiple regression

What type of inference is a $t$ test?

The two sample $t$ test: testing for the difference between two means

A two sample $t$ test for difference between means