Lecture 3.2 - Confidence Intervals - Means

Author

Professor MacDonald

Published

April 14, 2025

Confidence intervals - means

  • The central limit theorem
  • A confidence interval for the mean
  • Interpreting confidence intervals
  • Picking our interval up by our bootstraps
  • Thoughts about confidence intervals

House price revisited

  • Prices in King County Houses:
    • 21937 houses
    • Highly right skewed
    • Can define this as the entire population
    • Prices are quantitative

House price graph

Distribution

  • Distribution:

    • Min: 75000
    • Q1: 685000
    • Med: 906000
    • Q3: 1355000
    • Max: 23000000
    • Mean: 1152092
    • SD: 835505
  • Highly right skewed

  • SD almost as large as the median

  • If a distribution looks like this, what do you think the sampling distribution will look like when n=25? How about when n=200?

The central limit theorem

  • The Central Limit Theorem
    • The sampling distribution of any mean becomes nearly Normal as the sample size grows.
  • Requirements
    • Observations independent
    • Randomly collected sample
  • The sampling distribution of the means is close to Normal if either:
    • Large sample size
    • Population close to Normal

Samples = 100, nn = 200

Samples = 1000, nn = 200

Samples = 100000, nn = 200

Sampling distribution shape

  • As number of samples taken goes to infinity, shape of the sampling distribution becomes more clearly normally shaped

  • Doesn’t matter the shape of the underlying distribution except for a very few exceptions

  • How about holding samples fixed and changing nn in our sample of a skewed distribution?

nn = 10

nn = 25

nn = 50

nn = 100

Central limit theorem formally

  • When a random sample is drawn from any population with mean μ\mu and standard deviation σ\sigma, its sample mean, y\bar{y}, has a sampling distribution with the same mean but whose standard deviation is σn\frac{\sigma}{\sqrt{n}} and we write σ(y)=SD(y)=σn\sigma(\bar{y})=SD(\bar{y})=\frac{\sigma}{\sqrt{n}}

  • No matter what population the random sample comes from, the shape of the sampling distribution is approximately Normal as long as the sample size is large enough.

  • The larger the sample used, the more closely the Normal approximates the sampling distribution for the mean.

  • Practically, nn does not have to be very large for this to work in most cases

Practical issue with finding the sampling distribution sd

  • We almost never know σ\sigma

  • Natural thing is to use sdsamplê\hat{sd_{sample}}

  • With this, we can estimate the sampling distribution SD with SE:

    • SE(y)=snSE(\bar{y})=\frac{s}{\sqrt{n}}
  • This formula works well for large samples, not so much for small

    • Problem: too much variation in the sample SD from sample to sample
  • For smaller nn, need to turn to Gosset and a new family of models depending on sample size

A confidence interval for the mean

Gosset the brewer

Guiness

Gosset

Gosset

What Gosset discovered

  • At Guinness, Gosset experimented with beer.

  • The Normal Model was not right, especially for small samples.

  • Still bell shaped, but details differed, depending on nn

  • Came up with the “Student’s tt Distribution” as the correct model

A practical sampling distribution model

  • When certain assumptions and conditions are met, the standardized sample mean is:

t=yμSE(y)t=\frac{\bar{y}-\mu}{SE(\bar{y})}

  • The t score indicates that the result should be interpreted by a Student’s tt model with n1n-1 degrees of freedom

  • We can estimate the standard deviation of the sampling distribution by:

SE(y)=snSE(\bar{y}) = \frac{s}{\sqrt{n}}

Degrees of freedom

  • For every sample size nn, there is a different Student’s tt distribution

  • Degrees of freedom: df=n1df=n-1

  • Similar to the n1n-1 calculation for sample standard deviation

  • Reason for this is a bit complicated, at this point can just remember to specify tt distribution with n1n-1

Student’s tt

Student’s t

One sample tt interval for the mean

  • When the assumptions are met, the confidence interval for the mean is:

y±tn1×SE(y)\bar{y} \pm t_{n-1}\times SE(\bar{y})

  • The critical value, tn1*t^*_{n-1}, depends on the confidence interval, CC, and the degrees of freedom n1n-1

Example: A one sample tt interval for the mean

  • Price from one sample in King County

Average house price

  • y±t19*×SE(y)\bar{y}\pm t^*_{19} \times SE(\bar{y})

  • 1118400±2.09×SE(789011(20))1118400\pm 2.09 \times SE(\frac{789011}{\sqrt{(20)}})

  • 1118400±2.09×176428.331118400\pm 2.09 \times 176428.33

  • [7395381497262][739538 - 1497262]

What is the right way to talk about this confidence interval?

Thoughts about zz and tt

  • The Student’s t distribution:

    • Is unimodal.
    • Is symmetric about its mean.
    • Bell-shaped
  • Samller values of dfdf have longer tails and larger standard deviation than the Normal.

  • As dfdf increase, look more and more like Normal.

  • Is needed because we are using s as an estimate for σ\sigma

  • If you happen to know σ\sigma, which almost never happens, use the Normal model and not Student’s tt

  • As nn becomes larger, still safe to use the tt distribution because it basically turns into the normal distribution

Assumptions and conditions

  • Independence Assumption
    • Data values should be mutually independent
    • Example: weighing yourself every day
  • Randomization Condition: The data should arise from a random sample or suitably randomized experiment.
    • Data from SRS almost surely independent
    • If doesn’t satisfy Randomization Condition, think about whether values are independent and whether sample is representative of the population.

Assumptions and conditions

  • Normal Population Assumption
    • Nearly Normal Condition: Distribution is unimodal and symmetric.
    • Check with a histogram.
    • n<15n < 15: data should follow a normal model closely. If outliers or strong skewness, don’t use tt-methods
    • 15<n<4015 < n < 40: tt-methods work well as long as data are unimodal and reasonably symmetric.
    • n>40n > 40: tt-methods are safe as long as data are not extremely skewed.
    • Similar to the rule for proportions that must have somewhat even distribution of yeses and noes

Example: Checking Assumptions and Conditions for Student’s tt

  • Price of housing in King County

    • Independence Assumption: Yes

    • Nearly Normal Condition: No

Interpreting confidence intervals

What not to say

Don’t say:

  • “95% of the price of houses in King County is between $739538 and $1497262.”
    • The CI is about the mean price, not about the individual houses.
  • “We are 95% confident that a randomly selected house price will be between $739538 and $1497262.”
    • Again, we are concerned here with the mean, not individual houses

What not to say continued

Don’t Say

  • “The mean price is $1118400 95% of the time.”
    • The population mean never changes. Only sample means vary from sample to sample.
  • “95% of all samples will have a mean price between $739538 and $1497262.”
    • This interval does not set the standard for all other intervals. This interval is no more likely to be correct than any other.

What you should say

Do Say

  • “I am 95% confident that the true mean price is between $739538 and $1497262.”

    • Technically: “95% of all random samples will produce intervals that cover the true value.”

The first statement is more personal and less technical.

Bootstrapping

Picking our interval up by our bootstraps

Keep in mind

  • The confidence interval (unlike the sampling distribution) is centered at y\bar{y} rather than at μ\mu.

  • We need to know how far to reach out from y\bar{y}, so we need to estimate the population standard deviation. Estimating σ\sigma means we need to refer to Student’s tt-models.

  • Using Student’s tt-requires the assumption that the underlying data follow a Normal model.

    • Practically, we need to check that the data distribution of our sample is at least unimodal and reasonably symmetric, with no outliers for n<100n<100.

Bootstrapping

Process:

  • We have a random sample, representative of population.
  • Make copies and build a pseudo-population
  • Sample repeatedly from this population
  • Find means
  • Make a histogram
  • Observe how means are distributed and how much they vary

Bootstrapping

How will this bootstrapping confidence interval compare to the confidence interval calculated by classical means?

Thoughts about confidence intervals

Confidence intervals - what’s important

  • It’s not their precision.
  • Our specific confidence interval is random by nature
  • Changes with the sample
  • Important to know how they are constructed
  • Need to check assumptions and conditions
  • Contains our best guess of the mean
  • And how precise we think that guess is