Bayesian Statistics

Rafael A. Irizarry
October 16, 2014

Homework2

homework2

Cystic Fibrosis Test

  • A test for cystic fibrosis has an accuracy of 99%:

\[ \mbox{Prob}(+|D)=0.99, \mbox{Prob}(-|\mbox{no } D)=0.99, \]

  • If we select random person and they test postive what is probability of positive test?

  • We write this as \( \mbox{Prob}(D|+)? \)

  • cystic fibrosis rate is 1 in 3,900, \( \mbox{Prob}(D)=0.0025 \)

Bayes Rule

\[ \mbox{Pr}(A|B) = \frac{\mbox{Pr}(B|A)\mbox{Pr}(A)}{\mbox{Pr}(B)} \]

Bayes Rule

\[ \begin{eqnarray*} \mbox{Prob}(D|+) & = & \frac{ P(+|D) \cdot P(D)} {\mbox{Prob}(+)} \\ & = & \frac{\mbox{Prob}(+|D)\cdot P(D)} {\mbox{Prob}(+|D) \cdot P(D) + \mbox{Prob}(+|\mbox{no } D) \mbox{Prob}(\mbox{no } D)} \\ \end{eqnarray*} \]

Bayes Rule

\[ \begin{eqnarray*} \mbox{Prob}(D|+) & = & \frac{ P(+|D) \cdot P(D)} {\mbox{Prob}(+)} \\ & = & \frac{\mbox{Prob}(+|D)\cdot P(D)} {\mbox{Prob}(+|D) \cdot P(D) + \mbox{Prob}(+|\mbox{no } D) \mbox{Prob}(\mbox{no } D)} \\ & = & \frac{0.99 \cdot 0.0025}{0.99 \cdot 0.0025 + 0.01 \cdot (.9975)} \\ & = & 0.02 \;\;\; \mbox{not} \; \; \; 0.99 \end{eqnarray*} \]

Simulation

plot of chunk unnamed-chunk-1

Simulation

\( \mbox{Prob}(A|B) \) = % red bottom left, \( \mbox{Prob}(A) \)=%red on top, \( \mbox{Prob}(B|A) \)=% not X, \( \mbox{Prob}(B) \)=total points bottom left,

Bayes in Practice

iglesias

José Iglesias 2013

Month At Bats H AVG
April 20 9 .450

What is your prediction for his average in October?

Note: No one has finished a season batting .400 since Ted Williams in 1941!

Distribution of AVG

This is for all players (>500 AB) 2010, 2011, 2012

plot of chunk unnamed-chunk-4

Average is .275 and SD is 0.027

José Iglesias’ April batting average

  • Should we trade him?

  • What is the SE of our estimate?

  • \[ \sqrt{\frac{.450 (1-.450)}{20}}=.111 \]

  • Confidence interval?

  • .450-.222 to .450+.222 = .228 to .672

Hierarchichal Model

Pick a random player, then what is their batting average

\[ \begin{eqnarray*} \theta &\sim& N(\mu, \tau^2) \mbox{ is called a prior}\\ Y | \theta &\sim& N(\theta, \sigma^2) \mbox{ is called a sampling distribution} \end{eqnarray*} \]

Two levels of variability:

  • Player to player variability
  • Variability due to luck when batting

Hierarchichal Model

\[ \begin{eqnarray*} \theta &\sim& N(\mu, \tau^2) \mbox{ is called a prior}\\ Y | \theta &\sim& N(\theta, \sigma^2) \mbox{ is called a sampling distribution} \end{eqnarray*} \]

  • \( \theta \) is our players “intrinsic” average value
  • \( \mu \) is the average of all players
  • \( \tau \) is the SD of all players
  • \( Y \) is the observed average
  • \( \sigma \) is the variability due to luck at each AB

Hierarchichal Model

Here are the equations with our data

\[ \begin{eqnarray*} \theta &\sim& N(.275, .027^2) \\ Y | \theta &\sim& N(\theta, .110^2) \end{eqnarray*} \]

Posterior Distribution

The continuous version of Bayes rule can be used here

\[ \begin{eqnarray*} f_{\theta|Y}(\theta|Y)&=&\frac{f_{Y|\theta}(Y|\theta) f_{\theta}(\theta)}{f_Y(Y)}\\ &=&\frac{f_{Y|\theta}(Y|\theta) f_{\theta}(\theta)}{\int_{\theta}f_{Y|\theta}(Y|\theta)f_{\theta}(\theta)}\\ \end{eqnarray*} \]

We are particularly interested in the \( \theta \) that maximizes \( f_{\theta|Y}(\theta|Y) \).

In our case, these can be shown to be normal so we want the average \( \mbox{E}(\theta|y) \)

Posterior Distribution

We can show the average of this distribution is the following:

\[ \begin{eqnarray*} \mbox{E}(\theta|y) &=& B \mu + (1-B) Y\\ &=& \mu + (1-B)(Y-\mu)\\ B &=& \frac{\sigma^2}{\sigma^2+\tau^2} \end{eqnarray*} \]

Posterior Distribution

In the case of José Iglesias, we have:

\[ \begin{eqnarray*} E(\theta | Y=.450) &=& B \times .275 + (1 - B) \times .450 \\ &=& .275 + (1 - B)(.450 - .260) \\ B &=&\frac{.110^2}{.110^2 + .027^2} = 0.943\\ E(\theta | Y=450) &\approx& .285\\ \end{eqnarray*} \]

Posterior Distribution

The variance can be shown to be:

\[ \begin{eqnarray*} \mbox{var}(\theta|y) &=& \frac{1}{1/\sigma^2+1/\tau^2} &=& \frac{1}{1/.110^2 + 1/.027^2} \end{eqnarray*} \]

In our example the SD is 0.026

Results

Month At Bat Hits AVG
April 20 9 .450
May 26 11 .423
June 86 34 .395
July 83 17 .205
August 85 25 .294
September 50 10 .200
Total w/o April 330 97 .293

Frequentist confidence interval = .450 \( \pm \) 0.220

Empirical Bayes credible interval = .285 \( \pm \) 0.052

Actual = .293

Elections

The US Senate has 100 senators

Currently we have 55 Dems (includes 2 independents) versus 45 Republicans.

This election 36 Senate are being contested

senate map

21 are held by Democrats and 15 by Republicans.

Aggregators

FiveThirtyEight

538confidenceintervals

Average of polls by state with SEs

plot of chunk unnamed-chunk-6

Note: Kansas, New Hampshire and North Carolina

Poll results by state

plot of chunk unnamed-chunk-8

Kansas (results by pollster)

60% R, 56% I, 62% I, 51% R, 72% I, 61% R, 61% R, Tie, Tie, Tie

plot of chunk unnamed-chunk-9

Kansas (results by day)

plot of chunk unnamed-chunk-10

New Hampshire

75% D, 81% D, 69% D, 60% D, 89% D, 84% D, 99% D, 3 Lean D

plot of chunk unnamed-chunk-11

New Hampshire

plot of chunk unnamed-chunk-12 Different colors are different pollsters

North Carolina

81% D, 78% D, 73% D, 60% D, 75% D, 71% D, 96% D, Tie, Tie, Lean D

plot of chunk unnamed-chunk-13

North Carolina

plot of chunk unnamed-chunk-14

Confidence Interval

A closer look at NC

Mindless confidence interval:

[ -4.3 -2.1 ]

Using only recent polls

[ -3.8 -1.6 ]

Bayesian Analysis

Use, previous North Carolina results, average 0 and SD 5

Two level model: \[ \begin{eqnarray*} \theta &\sim& N(\mu, \tau^2) \\ Y | \theta &\sim& N(\theta, 0.5^2) \end{eqnarray*} \]

Observed \( Y = -2.4 \)

How do we construct a prior? Are there other levels?