Lecture 1.3 - Advanced distributions

Author

Professor MacDonald

Published

March 24, 2025

More on distributions

Thoughts about comparing groups

  • Faceted histograms are a reasonable display to show distributions by a categorical variable
    • However these displays become hard to interpret when the number of levels in a category grows large
  • Much easier to interpret is side by side box plots
  • Box plots capture many important characteristics of a distribution into a summary display
  • Think carefully about how you treat outliers
  • Let’s view data from the 2023-2024 NBA season

Two group comparison

NBA side-by-side histograms of points scored by W/L

NBA boxplot comparison of points scored by W/L

NBA boxplot comparison of points scored by W/L (better)

Many group comparison

NBA side-by-side histograms of points scored by team

NBA boxplot comparison of points scored by team (better)

Your turn

  • Work with your neighbor to analyze a different set of statistics
    • Can be by division or not
    • Remember the key features of distributions
      • Shape
      • Center
      • Spread
  • Interpret your results

Checking outliers - assists

Outliers - assists

Assists > 40 - true outliers?

Checking outliers - points

Outliers - points

Points by team > 150 - true outliers?

In summary

  • Think about which kind of display is appropriate for comparing distributions
  • When conditioning on a categorical variable, boxplots are usually better
  • But boxplots lose information
  • Think carefully about omitting outliers
  • Outliers may reveal important information about your dataset!

Titanic passengers and the Normal distribution

Titanic

Dataset of passengers on the Titanic

  • What are your expectations for how age should be distributed?
  • We are going to violate our first three rules:
    1. Make a picture
    2. Make a picture
    3. Make a picture

Were the passenger ages normally distributed?

To answer that question, we need some information about the distribution

Remember, our main information about distributions is:

  • Shape

  • Center

  • Spread

Information about age

  • Standard deviation: 14.4
  • Mean: 29.9
  • Normal model: \(N(\mu, \sigma) = N(29.9,14.4)\)
    • \(\mu\) is the theoretical mean
    • \(\sigma\) is the theoretical standard deviation
    • These values define the data generating process
    • We only see some values of the data generating process, but if we saw infinite values, the mean would be \(\mu\) and the sd would be \(\sigma\)
    • More on this in the second half of class
  • How can we check normality using this information?

Checking normality

Thinking about normality

  • We can check normality by comparing the quantiles of our data with that of the known quantiles of the normal distribution
    • We know approximately 95% of the data lies within two standard deviations
    • Therefore, 2.5% data with the lowest values lie outside of -2 standard deviations and 2.5% of data with the highest values lie outside of 2 standard deviations
  • Similarly, we know the same information for data within one standard deviation (16%, 68%, 16%)

Data within standard deviations

Checking against the data

Histogram of ages from the data

Normality and scaling

  • Note that normality does not depend on the size of the standard deviation or the size of the mean
  • Could easily change the units to be months instead of years
    • Mean would increase a lot
    • Standard deviation would increase a lot
    • However, amount of observations within each standard deviation would stay the same

Final thoughts on normality

When is the normal distribution useful?

  • When we know a data-generating process is normally distributed we don’t even need to sample the population
    • Can find out exactly how much data is between a certain number of standard deviations
  • When we expect a data-generating process to be normally distributed, can test for deviations from normality
    • In the case of Titanic passengers, some parts of the distribution were more bunched up, others more spread out
  • A lot of our statistical techniques require or work better when the data is ‘roughly’ normal
    • Will detail these in the coming weeks
  • We can transform our data to be closer to normal
    • Note that transformations won’t work if the data has multiple modes, can only correct skew

What transformation would be helpful for age?