Confidence Intervals - Proportions

Author

Professor MacDonald

Published

April 9, 2025

Confidence Intervals

  • The sampling distribution model for a proportion

  • When does the normal model work?

  • Confidence interval for a proportion

  • Interpreting confidence intervals

  • Margin of error: certainty vs. precision

The sampling distribution model for a proportion

Sampling model

  • Draw samples at random, n=100n=100

  • Samples vary

  • Can’t draw all samples of size 100, astronomical

  • Draw a few thousand samples

  • Distribution is called the sampling distribution of the proportion.

What shape do you think the sampling distribution will have if we have sample size n=100n=100?

Graph of a sampling distribution

  • Remember, this is not a graph of the actual distribution

Random matters

  • Sampling distribution for a proportion

    • Symmetric - check

    • Unimodal - check

    • Centered at pp: 0.853

    • Standard deviation: 0.035

    • Follows the Normal model - check

The Normal model for sampling

  • Samples don’t all have the same proportion.

  • Normal model is the right one for sample proportions.

  • Modeling how sample statistics, proportions or means, vary from sample to sample is powerful.

  • Allows us to quantify that variation.

  • Make statements about corresponding population parameter.

  • Make model for random behavior, then understand and and use that model.

Whicn Normal model to choose?

  • Reminder: normal model is N(μ,σ2)N(\mu, \sigma^2)

  • μ\mu or mean is pp, or the proportion we want to estimate, nn is sample size

  • For proportions, σ(p)=p(1p)n\sigma(p) = \sqrt{\frac{p(1-p)}{n}}

  • This is the standard deviation of the SAMPLING DISTRIBUTION, that is the distribution of pp across infinite samples

Mean and standard deviation

Reminder - Normal model rule

  • Using this normal model rule, we can tell how likely it is to have a certain p̂\hat{p} given the sampling distribution normal model

  • Remember the 68–95–99.7 (1 sd, 2 sd, 3 sd), for other distances use technology

  • Most common: 95% of samples have sample proportion within two standard deviations of the true population proportion.

  • Knowing the sampling distribution tells us how much variation to expect

  • Called the sampling error in some contexts

  • Not really an error, just variability

  • Better to call it sampling variability

When does the normal model work?

  • Independence Assumption: check data collected in a way that makes this assumption plausible

  • Randomization Condition: subjects randomly assigned treatments, or survey is simple random sample

  • 10% Condition: sample size less than 10% of the population size

  • Success Failure Condition: there must be at least 10 expected successes and failures. np̂10n\hat{p}\geq10 and n̂p10\hat{n}p\geq10

When does the normal model fail for the sampling distribution?

  • pp close to 0 or 1

  • People in this class that can dunk a basketball

  • Sample size 100

    • If true p=0.001p = 0.001, then probably none in sample of 100
  • If we simulated samples of size 100 with p=0.001p = 0.001

    • Distribution skewed right, can’t rely on normal model percentages anymore
  • nn is fine, but pp is too small

What will the shape of the sampling distribution look like if p=0.001p = 0.001?

Example simulation

Class sampling exercise

  • We know that about 50% of students at DKU plan to or have selected a major in the natural sciences

  • ? % of students in our class plan to major in the natural sciences in our class

    • Is our class unusually small?
  • Check conditions

    • Randomization condition
    • 10% condition
    • Success failure condition

Find how far we are from the population mean

  • Population standard deviation formula is:
    • p(1p)n\frac{\sqrt{p(1-p)}}{\sqrt{n}}
    • p̂\hat{p} is the proportion of yeses
    • nn is the sample size
  • We are calculating using the population sampling SDSD since we know it
    • If we don’t know the population sampling SDSD we have to use a different strategy, but not the case here
  • Knowing the SDSD, we can create a z score for the difference between our class and the population
    • z score is how many SDSDs our class is from the population mean
      • (classscoredkumean)/SD(class score - dkumean) / SD

Normal distribution percentages

Calculation for our class

  • p(1p)n\frac{\sqrt{p(1-p)}}{\sqrt{n}}

  • p̂\hat{p} is the proportion of yeses

  • nn is the sample size

  • p̂pSD(p)\frac{\hat{p} - p}{SD(p)}

  • 68-95-99.7 Rule: Values ?SDSD above the mean occur less than ?% of the time. Our class mean appears to be far/near from the population mean

Calculate the how likely our result would be if our class is a random sample of DKU students.

Confidence intervals of proportions

Standard errors for proportions

  • What is the sampling distribution?

  • Usually we do not know the population proportion pp.

  • Therefore, we cannot find the standard deviation of the sampling distribution p(1p)n\frac{\sqrt{p(1-p)}}{\sqrt{n}}

  • After taking a sample, we only know the sample proportion, which we use as an approximation (called the standard error)

    • p̂(1p̂)n\frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}

Example: bedrooms

  • Draw a random sample of 100 houses
  • p̂(1p̂)100\frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{100}}

  • The sampling distribution should be approximately normal

What is a confidence interval?

  • Confidence interval: a way to express the range of plausible values for the parameter (in this case, percent of homes with three bedrooms)

  • We never know the true value but we want to say something about how wide the range of possible values are

  • What is a reasonable range?

    • Traditionally, 95% (about two standard deviations) of the standard error distribution
    • Mean of our sample ±\pm range of possible values we could get if we took additional samples

Example: bedrooms

  • Our mean: 0.87

  • Our estimated sampling distribution standard error:

    • p̂(1p̂)100\sqrt{\frac{\hat{p}(1-\hat{p})}{100}}
    • 0.87̂(10.87̂)100\sqrt{\frac{\hat{0.87}(1-\hat{0.87})}{100}}
    • 0.7569100\sqrt{\frac{0.7569}{100}}
    • 0.007569\sqrt{0.007569}
    • 0.0870.087
  • A range of reasonable values if we sampled this again:

    • 2×0.0872\times0.087
    • 0.87±0.1740.87\pm0.174

Statement: we are ~95% confident that this interval contains the true proportion of houses with three or more bedrooms in the population

Critical values

  • Critical values are the cutoff we use to determine what is ‘reasonable’

  • Derived from the Normal model

  • Can use any z-score as a cutoff

  • Corresponding multiplier of the SE is called the critical value.

  • Normal model for this interval, it is denoted z*z^*.

  • To find, need to use computer, calculator, Normal probability table

Recap

  • Make sure conditions are met, then find level C confidence interval for p̂\hat{p}, our population mean estimate

  • Confidence interval is defined as p̂±z*×SE(p̂)\hat{p}\pm z^* \times SE(\hat{p})

  • SE(p̂)SE(\hat{p}) estimated by p̂(1p̂)n\frac{\sqrt{\hat{p}(1-\hat{p})}}{\sqrt{n}}

  • z*z^* specifies number of SEs needed for C% of random samples to yield confidence intervals that capture the true parameter

What you cannot say about pp from the sample

  1. “0.87 of all houses in King County have at least three bedrooms.”
  • No. Observations vary. Another sample would yield a different sample proportion.
  1. “It is probably true that 0.87 of all houses in King County have at least three bedrooms.”
  • No again. In fact, even if we didn’t know the true proportion, we’d know that it’s probably not 0.87.

What you cannot say about pp from the sample

  1. “We don’t know exactly what proportion of houses in King County have at least three bedrooms, but we know that it’s within the interval 0.87±2×0.0870.87\pm2\times0.087.”
  • No but getting closer. We don’t know this for sure.
  1. “We don’t know exactly what proportion of houses in King County have at least three bedrooms, but the interval from 0.696 to 1.044 probably contains the true proportion.”
  • Right but can be more precise. We should specify how confident we are not just say probably

What you can say about pp from the sample

  1. “We are 95% confident that between 0.696 and 1.044 of houses in King County have at least three bedrooms.”
  • Statements like these are called confidence intervals. They’re the best we can do.

Naming the confidence interval

  • This confidence interval is a one-proportion z-interval.

    • “One” since there is a single mean being calculated.
    • “Proportion” since we are interested in the proportion of the population.
    • “z-interval” since the distance of the interval relies on a normal sampling distribution model.

Interpreting confidence intervals

Capturing a proportion

  • The confidence interval may or may not contain the true population proportion.

  • Consider repeating the study over an over again, each time with the same sample size.

  • Each time we would get a different p̂\hat{p}

  • From each p̂\hat{p}, a different confidence interval could be computed.

  • About 95% of these confidence intervals will capture the true proportion.

  • 5% will be duds.

Random matters - confidence intervals

  • There are a huge number of confidence intervals that could be drawn.

  • In theory, all the confidence intervals could be listed.

    • 95% will “work” (capture the true proportion).
    • 5% will be “duds” (not capture the true proportion).
  • What about our confidence interval (0.696, 1.044)?

    • In this case, we can find out the true value
    • Most of the time we never know

Random matters - confidence intervals

100 samples CI

Margin of error: certainty vs. precision

Margin of error

  • Confidence interval for a population proportion: p̂±2×SE(p̂)\hat{p} \pm 2\times SE(\hat{p})

  • The distance, 2×SE(p̂)~2\times SE(\hat{p}), from p̂\hat{p} is called the margin of error

  • Confidence intervals can be applied to many statistics, not just means. Regression slopes and other quantities can also have confidence intervals.

    • In general, a confidence interval has the form estimate ±\pm margin of error

Certainty vs. precision

  • Competing goals
    • More certainty, need to capture pp more often, need to make the interval wider.
    • More precise, need to provider tighter bounds on our estimate for pp, need to make the interval narrower
  • Instead of a 95% confidence interval, any percent can be used.
    • Increasing the confidence (e.g. 99%) increases the margin of error.
      • Need to make our range wider to make sure we don’t ‘miss’
    • Decreasing the confidence (e.g. 90%) decreases the margin of error.
      • Need to make our range smaller so as to be more specific about our guess

What sample size?

  • Can increase both certainty and precision by increasing sample size

  • For 95%, z*z^* = 1.96

  • Values that make ME largest are p̂=0.5\hat{p}=0.5

  • If we want to ensure, say, a margin of error of <3<3%

    • ME=z×p̂(1p̂)nME = z\times \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
    • 0.03=1.96×(0.5)(0.5)n0.03 = 1.96\times \sqrt{\frac{(0.5)(0.5)}{n}}
  • Solving for nn, gives n1067.1n\approx1067.1

  • We need to survey at least 1068 to ensure a ME less than 0.03 for the 95% confidence interval.

Thoughts on sample size and ME

  • Obtaining a large sample size can be expensive and/or take a long time.

  • For a pilot study, ME=10ME = 10% can be acceptable.

  • For full studies, ME<5ME < 5% is better.

  • Public opinion polls typically use ME=3ME = 3%, n=1000n = 1000

  • If pp is expected to be very small such as 0.005, then much smaller ME such as 0.1% is required.

    • Common in medical studies