Lecture 1.3 - Advanced distributions

Author

Professor MacDonald

Published

March 24, 2025

More on distributions

Thoughts about comparing groups

Faceted histograms are a reasonable display to show distributions by a categorical variable
- However these displays become hard to interpret when the number of levels in a category grows large
Much easier to interpret is side by side box plots
Box plots capture many important characteristics of a distribution into a summary display
Think carefully about how you treat outliers
Let’s view data from the 2023-2024 NBA season

Two group comparison

NBA side-by-side histograms of points scored by W/L

NBA boxplot comparison of points scored by W/L

NBA boxplot comparison of points scored by W/L (better)

Many group comparison

NBA side-by-side histograms of points scored by team

NBA boxplot comparison of points scored by team (better)

Your turn

Work with your neighbor to analyze a different set of statistics
- Can be by division or not
- Remember the key features of distributions
  - Shape
  - Center
  - Spread
Interpret your results

Checking outliers - assists

Outliers - assists

Assists > 40 - true outliers?

Checking outliers - points

Outliers - points

Points by team > 150 - true outliers?

In summary

Think about which kind of display is appropriate for comparing distributions
When conditioning on a categorical variable, boxplots are usually better
But boxplots lose information
Think carefully about omitting outliers
Outliers may reveal important information about your dataset!

Titanic passengers and the Normal distribution

Dataset of passengers on the Titanic

What are your expectations for how age should be distributed?

We are going to violate our first three rules:
1. Make a picture
2. Make a picture
3. Make a picture

Were the passenger ages normally distributed?

To answer that question, we need some information about the distribution

Remember, our main information about distributions is:

Shape
Center
Spread

Information about `age`

Standard deviation: 14.4
Mean: 29.9
Normal model: \(N(\mu, \sigma) = N(29.9,14.4)\)
- \(\mu\) is the theoretical mean
- \(\sigma\) is the theoretical standard deviation
- These values define the data generating process
- We only see some values of the data generating process, but if we saw infinite values, the mean would be \(\mu\) and the sd would be \(\sigma\)
- More on this in the second half of class
How can we check normality using this information?

Checking normality

Thinking about normality

We can check normality by comparing the quantiles of our data with that of the known quantiles of the normal distribution
- We know approximately 95% of the data lies within two standard deviations
- Therefore, 2.5% data with the lowest values lie outside of -2 standard deviations and 2.5% of data with the highest values lie outside of 2 standard deviations
Similarly, we know the same information for data within one standard deviation (16%, 68%, 16%)

Data within standard deviations

Checking against the data

Histogram of ages from the data

Normality and scaling

Note that normality does not depend on the size of the standard deviation or the size of the mean
Could easily change the units to be months instead of years
- Mean would increase a lot
- Standard deviation would increase a lot
- However, amount of observations within each standard deviation would stay the same

Final thoughts on normality

When is the normal distribution useful?

When we know a data-generating process is normally distributed we don’t even need to sample the population
- Can find out exactly how much data is between a certain number of standard deviations
When we expect a data-generating process to be normally distributed, can test for deviations from normality
- In the case of Titanic passengers, some parts of the distribution were more bunched up, others more spread out
A lot of our statistical techniques require or work better when the data is ‘roughly’ normal
- Will detail these in the coming weeks
We can transform our data to be closer to normal
- Note that transformations won’t work if the data has multiple modes, can only correct skew

More on distributions

Thoughts about comparing groups

Two group comparison

NBA side-by-side histograms of points scored by W/L

NBA boxplot comparison of points scored by W/L

NBA boxplot comparison of points scored by W/L (better)

Many group comparison

NBA side-by-side histograms of points scored by team

NBA boxplot comparison of points scored by team (better)

Your turn

Checking outliers - assists

Outliers - assists

Assists > 40 - true outliers?

Checking outliers - points

Outliers - points

Points by team > 150 - true outliers?

In summary

Titanic passengers and the Normal distribution

Dataset of passengers on the Titanic

Were the passenger ages normally distributed?

Information about age

Checking normality

Thinking about normality

Data within standard deviations

Checking against the data

Histogram of ages from the data

Normality and scaling

Final thoughts on normality

When is the normal distribution useful?

What transformation would be helpful for age?

Information about `age`

What transformation would be helpful for `age`?