Statistical Summaries and Exploratory Data Analysis

Rafael A Irizarry
September 4, 2014

Brief Announcement

Exploratory Data Analysis

Summarize data main characteristics through the use of summaries or plots. Used to make discoveries, motivate analysis approaches or convey a message.

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

John Tueky

John Tukey

Graphics Editor at The New York Times

Amanda Cox
M.S. in Statistics from the University of Washington (2005)
Responsible for most of the infographics we see in The New York Times

Grades from NYC Regents Exam

In New York City you need a 65 to pass the Regents exam.
Data on these scores are collected for several reasons.

Grades from NYC Regents Exam

A histogram of these test scores forces us to notice something somewhat problematic

Voting Patterns by State

Link

Here is another example of an advanced visualization based on the histogram idea.

CS109 Heights Data

Let's summarize the height data for our class

          Timestamp Height    Sex
1 9/2/2014 13:40:36     75   Male
2 9/2/2014 13:46:59     70   Male
3 9/2/2014 13:59:20     68   Male
4 9/2/2014 14:51:53     74   Male
5 9/2/2014 15:16:15     61   Male
6 9/2/2014 15:16:16     65 Female

Data is here:

https://docs.google.com/spreadsheet/pub?key=0ApvpBbD8HP4mdDlRUi1vdTlBQ3Rub2dJSUNVUDlDdVE&output=csv

Motivating Data Wrangling

Note that some entries are not in inches.

            Timestamp Height    Sex
22  9/2/2014 15:16:28  5' 4"   Male
110 9/2/2014 15:16:52    5'7   Male
127 9/2/2014 15:16:56   5'7"   Male
150 9/2/2014 15:17:09   5'3" Female
187 9/2/2014 15:18:00 5'8.11   Male
202 9/2/2014 15:19:48   5'11   Male
236  9/4/2014 0:46:45  5'9''   Male
55  9/2/2014 15:16:37  165cm Female

Fixing this part of what we call data wrangling

Data Wrangling

After fixing the above issue, there are still some problems:

            Timestamp   Height    Sex
12  9/2/2014 15:16:23     6.00   Male
40  9/2/2014 15:16:32     5.30 Female
66  9/2/2014 15:16:41   511.00   Male
84  9/2/2014 15:16:46     6.00   Male
99  9/2/2014 15:16:50     2.00 Female
126 9/2/2014 15:16:56  9000.00   Male
194 9/2/2014 15:18:14     5.25 Female
231 9/3/2014 21:43:00     5.50   Male
235 9/3/2014 23:55:37 11111.00   Male
241  9/4/2014 5:15:28     6.00 Female
242  9/4/2014 6:31:03     6.50   Male
244  9/4/2014 9:24:41   150.00 Female

We sometimes have to fix these “by hand”

Distributions

\[ F(a) = \mbox{Pr}(\mbox{Height} \leq a) \]

plot of chunk unnamed-chunk-4

Referred to as cumulative distribution function (CDF)

Distributions

Histograms show: $ F(b)-F(a) $ for several intervals $ (a,b] $ plot of chunk unnamed-chunk-5

Easier to interpret than cumulative distribution functions

Normal Approximation

The distribution of many outcomes in nature are approximated by the normal distribution:

\[ \mbox{Pr}(a < Y < b) = \int_a^b \frac{1}{\sigma\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} \left( \frac{y-\mu}{\sigma} \right)^2 \right\} \, dy \]

Y represents a data point
$ \mu $ is the average (also called the mean)
$ \sigma $ is the standard deviation

Normal Approximation

If our data follows the normal distribution then $ \mu $ and $ \sigma $ are a sufficient summary: they tell us everything! To see this let

\[ Z = \frac{Y-\mu}{\sigma} \]

then we have a formula that the gives us $ \mbox{Pr}(Z < a) $ for any $ a $ without looking at the data. All we need to know is $ \mu $ and $ \sigma $

       Average SD
Male        70  3
Female      65  3

Standard Units

$ Z=(Y-\mu)/\sigma $ are said to be in standard units.
How many SDs away is $ Y $ from the average? $ Z $
In CS109 a six four male is 2 SDs away, thus $ Z=2 $
Without even counting we know: less than 5% are this tall
68% are within 1 SD
95% are within 2 SDs
>99% are within 3 SDs

How good is the normal approximation?

Here are the approximations for males

  Height Real Approx
1     63 0.02   0.03
2     65 0.07   0.06
3     67 0.16   0.10
4     68 0.31   0.31
5     70 0.50   0.44
6     71 0.69   0.68
7     73 0.84   0.88
8     75 0.93   0.95
9     76 0.98   0.99

QQ-plots

Observed versus normal approximation quantiles plot of chunk unnamed-chunk-8

Two variables

plot of chunk unnamed-chunk-9

Normal approximation for two variable

Many pairs of data are bivariate normal plot of chunk unnamed-chunk-10

The blue line is the average within each strata
It is called the regression line

Regression line

The regression line is defined by this formula

\[ \frac{Y - \mu_Y}{\sigma_Y} = \rho \frac{X-\mu_X}{\sigma_X} \]

$ \rho $ is called correlation
For fathers and son heights it is 0.5

For bivariate normal pairs of data these five numbers provide a complete summary:

$ \mu_X,\mu_Y,\sigma_X,\sigma_Y \mbox{ and } r $

Anscombe's quartet

Most data is not normal

For example, look at compensation for 199 US CEOs (2000)

plot of chunk unnamed-chunk-11

Average is $600,000 but 84%, not 50%, make less.

The normal approximation is not useful here.