Lecture 4.1 - Comparing Groups

Author

Professor MacDonald

Published

April 23, 2025

Comparing groups

  • What type of inference is a tt test?
  • A confidence interval for the difference between two means
  • The two-sample tt test: testing for the difference between two means
  • Experiments and causality

Seattle waterfront

What type of inference is a tt test?

A review - descriptive vs. inference

Table 1: Inference types - general
Type of analysis Descriptive Inferential
Univariate Histogram, bar chart Confidence interval
Univariate compared to theoretical expectation QQ plot One proportion z test, one proportion t test
Comparing two variables Scatterplot, two variable regression Two proportion z test, two proportion t test
Comparing many variables Multiple variable regression Multiple variable regression

A review - one vs. two mean test

Table 2: Inference types - mean
One mean test Two mean test
Comparing the mean of your sample to some statement about the world Comparing the mean of one part of your sample to another part of your sample
Null hypothesis: based on some belief we have about the general population, i.e. students sleep 7.03 hours Null hypothesis: no difference between groups

Example

Table 3: Inference example
One mean test Two mean test
H0H_0: Our sample mean of hours of sleep is the same as all students in the world H0H_0: The sample mean of male students hours slept is the same as the mean of female students hours slept
HaH_a: Our sample mean is different than the world’s population mean HaH_a: The sample mean of female students is different than the sample mean of male students

A confidence interval for the difference between two means/proportions

Difference between means/proportions: standard error

  • Want to find the SESE for y1y2\bar{y}_1-\bar{y}_2
  • Start with theoretical properties:
    • SD(y1y2)=Var(y1)+Var(y2)SD(\bar{y}_1-\bar{y}_2) = \sqrt{Var(\bar{y}_1) + Var(\bar{y}_2)}
    • SD(y1y2)=σ12n1+σ22n2SD(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2}}
  • Don’t know the population σ\sigma for each subsample, so use the sample SDSDs as before
    • SE(y1y2)=s12n1+s22n2SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

Example: proportions

  • 658 male passengers on the Titanic; 135 survived
  • 388 female passengers on the Titanic; 292 survived
  • SD(p̂1p̂2)=p1q1n1+p1q2n2SD(\hat{p}_1-\hat{p}_2) = \sqrt{\frac{p_1q_1}{n_1} + \frac{p_1q_2}{n_2}}
  • SE(0.2050.753)=0.205×0.795658+0.753×0.247388SE(0.205-0.753) = \sqrt{\frac{0.205\times0.795}{658} + \frac{0.753\times0.247}{388}}
  • SE(0.2050.753)=0.00024+0.000479SE(0.205-0.753) = \sqrt{0.00024 + 0.000479}
  • SE(0.2050.753)=0.0269SE(0.205-0.753) = 0.0269

Confidence interval

  • What is the 95% confidence interval of the difference in means?

  • y1y2±Criticalvalue×SE\bar{y}_1-\bar{y}_2\pm Critical value \times SE

  • 0.548±z*×0.0269-0.548\pm z^* \times 0.0269

  • 0.548±1.96×0.0269-0.548\pm 1.96 \times 0.0269

  • 0.548±0.0528-0.548\pm 0.0528

What can you conclude from this - how can you state the results? What are some factors that are omitted?

CI for the difference between two proportions/means

  • First find two-sample zz/tt interval for the difference in means
  • Then apply two-sample zz/tt test
  • Interval looks like others we have seen
    • y1y2±ME\bar{y}_1-\bar{y}_2\pm ME
    • ME=t*/z*×SE(y1y2)ME = t^*/z^*\times SE(\bar{y}_1-\bar{y}_2)
  • Uses the zz model (proportion) or Student’s tt model (mean)
  • The degrees of freedom for tt are complicated, so just use a computer

Sampling distribution for the difference between two means

  • When the conditions are met, the sampling distribution of the standardized sample difference between the means of two independent groups:

    • t=(y1y2)(μ1μ2)SE(y1y2)t = \frac{(\bar{y}_1-\bar{y}_2) - (\mu_1 - \mu_2)}{SE(\bar{y}_1-\bar{y}_2)}
  • Uses the Student’s tt model

  • Degrees of freedom are found with a special formula

  • Think carefully here about what we are modeling

Assumptions

  • Independence assumption:
    • Within each group, individual responses should be independent of each other.
    • Knowing one response should not provide information about other responses.
  • Randomization condition:
    • If responses are selected with randomization, their independence is likely.
  • Independent Groups Assumption
    • Responses in the two groups are independent of each other.
    • Knowing how one group responds should not provide information about the other group.

Assumptions continued

  • Nearly normal condition
    • Check this for both groups
    • A violation by either one, violates the condition
    • n<15n < 15 in either group: should not use these methods if the histogram or Normal probability plot shows severe skewness
    • nn closer to 40 for both groups: mildly skewed histogram is OK
    • n>40n > 40 for both groups: Fine as long as no extreme outliers or extreme skewness

Confidence interval formally

  • When the conditions are met, the confidence interval for the difference between means from two independent groups is

    • (y1y2)±tdf*×SE(y1y2)(\bar{y}_1-\bar{y}_2)\pm t^*_{df}\times SE(\bar{y}_1-\bar{y}_2)

    • where SE(y1y2)=s12n1+s22n2SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

  • Critical value tdf*t^*_{df} depends on confidence level CC

The two sample tt test: testing for the difference between two means

A two sample tt test for difference between means

  • Conditions same as two-sample t-interval
    • H0:μ1μ2=Δ0H_0: \mu_1-\mu_2 = \Delta_0 (Δ0\Delta_0 usually 00)
  • When the conditions are met and the null hypothesis is true, use the Student’s tt model to find the pp value.
    • t=(y1y2)Δ0SE(y1y2)t = \frac{(\bar{y}_1-\bar{y}_2) - \Delta_0}{SE(\bar{y}_1-\bar{y}_2)}
    • SE(y1y2)=s12n1+s22n2SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

Step by step example

  • Is there a difference in housing price depending on if the house has a view?
    • Think \rightarrow
      • Plan: I have housing prices from many thousands of houses in King County, assumed to have been sampled randomly.
      • Hypotheses
      • H0:μwμnotw=0H_0: \mu_w-\mu_{notw}=0
      • Ha:μwμnotw0H_a: \mu_w-\mu_{notw}\ne0

Step by step example

  • Think \rightarrow
    • Mean price of house with a view: 1772071, mean price no view: 1139608
    • Model:
      • Randomization Condition: Subjects assigned to treatment groups randomly?
      • Independent Groups Assumption: Sampling method gives independent groups?
      • Nearly normal condition: Histograms are reasonably unimodal and symmetric?
      • The assumptions and conditions are reasonable?

After analyzing these assumptions , are we justified in using the Student’s t-model to perform a two-sample t-test?

Step by step example

  • Show
    • Mechanics
      • Mean price of house with a view: 1772071
      • Mean price no view: 1139608
      • SD of view: 1128502
      • SD no view: 823799
      • nn of view: 433
      • nn of no view 21504

What is the formula we should use in the next step?

Step by step example

  • Show
    • Mechanics
      • solve for the SE: SE(y1y2)=s12n1+s22n2SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}
      • solve for the SE: SE(y1y2)=1128502433+82379921504SE(\bar{y}_1-\bar{y}_2) = \sqrt{\frac{1128502}{433} + \frac{823799}{21504}}
      • solve for the SE:
      • find tt score: t=(y1y2)Δ0SE(y1y2)t = \frac{(\bar{y}_1-\bar{y}_2) - \Delta_0}{SE(\bar{y}_1-\bar{y}_2)}
      • find tt score: t=(17720711139608)054523t = \frac{(1772071-1139608) - 0}{54523}
      • find tt score: t=11.6t = 11.6
      • find pp value: can use table with df=433df=433, p=0p = 0

Step by step example

Alternatively, we can use the built-in tt test function

  • Show

What can we conclude, based on the results of this tt test? What are some assumptions of this tt test that may be violated?

Step by step example

  • Tell \rightarrow
    • Conclusion: the pp value = 0 is less than the critical value
    • If there were no difference in the mean prices, then a difference this large would occur 1 times in millions of times
    • Too rare to believe happened by chance? Yes
    • Reject H0H_0? Yes
    • Conclude that houses with a view are more expensive than regular houses? Yes

Experiments

Independence

  • Independence assumption:
    • Within each group, individual responses should be independent of each other.
    • Knowing one response should not provide information about other responses.
  • Randomization condition:
    • If responses are selected with randomization, their independence is likely.
  • Independent Groups Assumption
    • Responses in the two groups are independent of each other.
    • Knowing how one group responds should not provide information about the other group.

The importance of the counterfactual

  • For causal inference, one should ask the counter-factual question, for those who received “treatment”, what would have happened to them if they hadn’t been treated?

  • That is, we only observe one state of reality (had more vegetables), but we want to know the DIFFERENCE the treatment had on the person by asking what would have happened if the DID NOT receive the treatment

The importance of the counterfactual

  • More formally, we are interested in the difference the treatment has on the response variable (Health)

  • On a child (y1y_1)that did receive more vegetables, we want to consider the case of what would have happened if they had NOT had the vegetables and find the treatment effect

    • Or, y1ty1cy_1^t- y_1^c = treatment effect (tt denoting treatment; cc denoting control)
    • Note that y1ty_1^t is observed, but y1cy_1^c is not.
  • The problem is one of missing data – how to estimate y1cy_1^c?

Comparability problems

  • If subjects who receive treatment and those who do not are different in some important characteristics, we have selectivity bias – e.g. higher SES children were more likely to be in the vegetable treatment group

    • Violates the independent group assumption because if rich children are more likely to be in the “eats vegetables” group we know that the observed value of the response variable, heathheath, is likely to be higher
    • Knowing which group they are in gives us some knowledge of what their observed value of yy will be
  • Often called “omitted variable bias.”

  • Big problem in observational studies – many variables are probably not present that we’d like to know

  • What are some omitted variables that might bias our finding that houses on the view have a higher price than houses not on the view?

Experiments

  • Experiments solve the omitted variable bias
  • Random assignment of treatment and control status ensures that subjects differ ON AVERAGE only in the treatment they receive
  • We can then compute the Average Treatment Effect (ATE) of being in the control vs. treatment
  • A tt test between treatment and control group will therefore be accurate
  • In observational studies, it is very rare to be able to guarantee that assignment to the two groups are independent of the response variable.
    • If there are important other omitted variables that influence assignment to the two groups, need to control for these omitted variables via a multiple regression

Drawbacks of experiments

  • Lack of generalizability – Often done on college students or in contrived settings (external validity)
  • Cost – very expensive to run a full experiment
  • Ethics – why shouldn’t we give positive treatments to everyone?
  • Mechanically complicated
    • Difficult to ensure proper randomization
    • Difficult to design appropriate treatments
    • Difficult to develop appropriate measurements

Multiple regression

  • Multiple Regression
    • Attempts to control for, or estimate the treatment effect, for each variable included, INDEPENDENT of the other variables
    • How sure are we of the treatment effect?
    • tt test of the slope of the regression line
    • Null hypothesis is that treatment variable (“eats vegetables”) makes no difference on response variable
    • If slope is non-zero, it indicates that differences in treatment produce differences in response variable (increase education \rightarrow increase in wages)