Lecture 1.3 - Advanced distributions
More on distributions
Thoughts about comparing groups
- Faceted histograms are a reasonable display to show distributions by a categorical variable
- However these displays become hard to interpret when the number of levels in a category grows large
- Much easier to interpret is side by side box plots
- Box plots capture many important characteristics of a distribution into a summary display
- Think carefully about how you treat outliers
- Let’s view data from the 2023-2024 NBA season
Two group comparison
NBA side-by-side histograms of points scored by W/L
NBA boxplot comparison of points scored by W/L
NBA boxplot comparison of points scored by W/L (better)
Many group comparison
NBA side-by-side histograms of points scored by team
NBA boxplot comparison of points scored by team (better)
Your turn
- Work with your neighbor to analyze a different set of statistics
- Can be by division or not
- Remember the key features of distributions
- Shape
- Center
- Spread
- Interpret your results
Checking outliers - assists
Outliers - assists
Assists > 40 - true outliers?
Checking outliers - points
Outliers - points
Points by team > 150 - true outliers?
In summary
- Think about which kind of display is appropriate for comparing distributions
- When conditioning on a categorical variable, boxplots are usually better
- But boxplots lose information
- Think carefully about omitting outliers
- Outliers may reveal important information about your dataset!
Titanic passengers and the Normal distribution
Dataset of passengers on the Titanic
- What are your expectations for how age should be distributed?
- We are going to violate our first three rules:
- Make a picture
- Make a picture
- Make a picture
Were the passenger ages normally distributed?
To answer that question, we need some information about the distribution
Remember, our main information about distributions is:
Shape
Center
Spread
Information about age
- Standard deviation: 14.4
- Mean: 29.9
- Normal model: \(N(\mu, \sigma) = N(29.9,14.4)\)
- \(\mu\) is the theoretical mean
- \(\sigma\) is the theoretical standard deviation
- These values define the data generating process
- We only see some values of the data generating process, but if we saw infinite values, the mean would be \(\mu\) and the sd would be \(\sigma\)
- More on this in the second half of class
- How can we check normality using this information?
Checking normality
Thinking about normality
- We can check normality by comparing the quantiles of our data with that of the known quantiles of the normal distribution
- We know approximately 95% of the data lies within two standard deviations
- Therefore, 2.5% data with the lowest values lie outside of -2 standard deviations and 2.5% of data with the highest values lie outside of 2 standard deviations
- Similarly, we know the same information for data within one standard deviation (16%, 68%, 16%)
Data within standard deviations
Checking against the data
Histogram of ages from the data
Normality and scaling
- Note that normality does not depend on the size of the standard deviation or the size of the mean
- Could easily change the units to be months instead of years
- Mean would increase a lot
- Standard deviation would increase a lot
- However, amount of observations within each standard deviation would stay the same
Final thoughts on normality
When is the normal distribution useful?
- When we know a data-generating process is normally distributed we don’t even need to sample the population
- Can find out exactly how much data is between a certain number of standard deviations
- When we expect a data-generating process to be normally distributed, can test for deviations from normality
- In the case of Titanic passengers, some parts of the distribution were more bunched up, others more spread out
- A lot of our statistical techniques require or work better when the data is ‘roughly’ normal
- Will detail these in the coming weeks
- We can transform our data to be closer to normal
- Note that transformations won’t work if the data has multiple modes, can only correct skew