Rafael A Irizarry
September 4, 2014
Summarize data main characteristics through the use of summaries or plots. Used to make discoveries, motivate analysis approaches or convey a message.
“The greatest value of a picture is when it forces us to notice what we never expected to see.”
John Tukey
A histogram of these test scores forces us to notice something somewhat problematic
Let's summarize the height data for our class
Timestamp Height Sex
1 9/2/2014 13:40:36 75 Male
2 9/2/2014 13:46:59 70 Male
3 9/2/2014 13:59:20 68 Male
4 9/2/2014 14:51:53 74 Male
5 9/2/2014 15:16:15 61 Male
6 9/2/2014 15:16:16 65 Female
Data is here:
https://docs.google.com/spreadsheet/pub?key=0ApvpBbD8HP4mdDlRUi1vdTlBQ3Rub2dJSUNVUDlDdVE&output=csv
Note that some entries are not in inches.
Timestamp Height Sex
22 9/2/2014 15:16:28 5' 4" Male
110 9/2/2014 15:16:52 5'7 Male
127 9/2/2014 15:16:56 5'7" Male
150 9/2/2014 15:17:09 5'3" Female
187 9/2/2014 15:18:00 5'8.11 Male
202 9/2/2014 15:19:48 5'11 Male
236 9/4/2014 0:46:45 5'9'' Male
55 9/2/2014 15:16:37 165cm Female
Fixing this part of what we call data wrangling
After fixing the above issue, there are still some problems:
Timestamp Height Sex
12 9/2/2014 15:16:23 6.00 Male
40 9/2/2014 15:16:32 5.30 Female
66 9/2/2014 15:16:41 511.00 Male
84 9/2/2014 15:16:46 6.00 Male
99 9/2/2014 15:16:50 2.00 Female
126 9/2/2014 15:16:56 9000.00 Male
194 9/2/2014 15:18:14 5.25 Female
231 9/3/2014 21:43:00 5.50 Male
235 9/3/2014 23:55:37 11111.00 Male
241 9/4/2014 5:15:28 6.00 Female
242 9/4/2014 6:31:03 6.50 Male
244 9/4/2014 9:24:41 150.00 Female
We sometimes have to fix these “by hand”
\[ F(a) = \mbox{Pr}(\mbox{Height} \leq a) \]
Referred to as cumulative distribution function (CDF)
Histograms show: \( F(b)-F(a) \) for several intervals \( (a,b] \)
Easier to interpret than cumulative distribution functions
The distribution of many outcomes in nature are approximated by the normal distribution:
\[ \mbox{Pr}(a < Y < b) = \int_a^b \frac{1}{\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} \left( \frac{y-\mu}{\sigma} \right)^2 \right\} \, dy \]
If our data follows the normal distribution then \( \mu \) and \( \sigma \) are a sufficient summary: they tell us everything! To see this let
\[ Z = \frac{Y-\mu}{\sigma} \]
then we have a formula that the gives us \( \mbox{Pr}(Z < a) \) for any \( a \) without looking at the data. All we need to know is \( \mu \) and \( \sigma \)
Average SD
Male 70 3
Female 65 3
Here are the approximations for males
Height Real Approx
1 63 0.02 0.03
2 65 0.07 0.06
3 67 0.16 0.10
4 68 0.31 0.31
5 70 0.50 0.44
6 71 0.69 0.68
7 73 0.84 0.88
8 75 0.93 0.95
9 76 0.98 0.99
Observed versus normal approximation quantiles
Many pairs of data are bivariate normal
The regression line is defined by this formula
\[ \frac{Y - \mu_Y}{\sigma_Y} = \rho \frac{X-\mu_X}{\sigma_X} \]
For bivariate normal pairs of data these five numbers provide a complete summary:
\( \mu_X,\mu_Y,\sigma_X,\sigma_Y \mbox{ and } r \)
For example, look at compensation for 199 US CEOs (2000)
Average is $600,000 but 84%, not 50%, make less.
The normal approximation is not useful here.