Rafael A. Irizarry
October 16, 2014
\[ \mbox{Prob}(+|D)=0.99, \mbox{Prob}(-|\mbox{no } D)=0.99, \]
If we select random person and they test postive what is probability of positive test?
We write this as \( \mbox{Prob}(D|+)? \)
cystic fibrosis rate is 1 in 3,900, \( \mbox{Prob}(D)=0.0025 \)
\[ \mbox{Pr}(A|B) = \frac{\mbox{Pr}(B|A)\mbox{Pr}(A)}{\mbox{Pr}(B)} \]
\[ \begin{eqnarray*} \mbox{Prob}(D|+) & = & \frac{ P(+|D) \cdot P(D)} {\mbox{Prob}(+)} \\ & = & \frac{\mbox{Prob}(+|D)\cdot P(D)} {\mbox{Prob}(+|D) \cdot P(D) + \mbox{Prob}(+|\mbox{no } D) \mbox{Prob}(\mbox{no } D)} \\ \end{eqnarray*} \]
\[ \begin{eqnarray*} \mbox{Prob}(D|+) & = & \frac{ P(+|D) \cdot P(D)} {\mbox{Prob}(+)} \\ & = & \frac{\mbox{Prob}(+|D)\cdot P(D)} {\mbox{Prob}(+|D) \cdot P(D) + \mbox{Prob}(+|\mbox{no } D) \mbox{Prob}(\mbox{no } D)} \\ & = & \frac{0.99 \cdot 0.0025}{0.99 \cdot 0.0025 + 0.01 \cdot (.9975)} \\ & = & 0.02 \;\;\; \mbox{not} \; \; \; 0.99 \end{eqnarray*} \]
\( \mbox{Prob}(A|B) \) = % red bottom left, \( \mbox{Prob}(A) \)=%red on top, \( \mbox{Prob}(B|A) \)=% not X, \( \mbox{Prob}(B) \)=total points bottom left,
José Iglesias 2013
Month | At Bats | H | AVG |
---|---|---|---|
April | 20 | 9 | .450 |
What is your prediction for his average in October?
Note: No one has finished a season batting .400 since Ted Williams in 1941!
This is for all players (>500 AB) 2010, 2011, 2012
Average is .275 and SD is 0.027
Should we trade him?
What is the SE of our estimate?
\[ \sqrt{\frac{.450 (1-.450)}{20}}=.111 \]
Confidence interval?
.450-.222 to .450+.222 = .228 to .672
Pick a random player, then what is their batting average
\[ \begin{eqnarray*} \theta &\sim& N(\mu, \tau^2) \mbox{ is called a prior}\\ Y | \theta &\sim& N(\theta, \sigma^2) \mbox{ is called a sampling distribution} \end{eqnarray*} \]
Two levels of variability:
\[ \begin{eqnarray*} \theta &\sim& N(\mu, \tau^2) \mbox{ is called a prior}\\ Y | \theta &\sim& N(\theta, \sigma^2) \mbox{ is called a sampling distribution} \end{eqnarray*} \]
Here are the equations with our data
\[ \begin{eqnarray*} \theta &\sim& N(.275, .027^2) \\ Y | \theta &\sim& N(\theta, .110^2) \end{eqnarray*} \]
The continuous version of Bayes rule can be used here
\[ \begin{eqnarray*} f_{\theta|Y}(\theta|Y)&=&\frac{f_{Y|\theta}(Y|\theta) f_{\theta}(\theta)}{f_Y(Y)}\\ &=&\frac{f_{Y|\theta}(Y|\theta) f_{\theta}(\theta)}{\int_{\theta}f_{Y|\theta}(Y|\theta)f_{\theta}(\theta)}\\ \end{eqnarray*} \]
We are particularly interested in the \( \theta \) that maximizes \( f_{\theta|Y}(\theta|Y) \).
In our case, these can be shown to be normal so we want the average \( \mbox{E}(\theta|y) \)
We can show the average of this distribution is the following:
\[ \begin{eqnarray*} \mbox{E}(\theta|y) &=& B \mu + (1-B) Y\\ &=& \mu + (1-B)(Y-\mu)\\ B &=& \frac{\sigma^2}{\sigma^2+\tau^2} \end{eqnarray*} \]
In the case of José Iglesias, we have:
\[ \begin{eqnarray*} E(\theta | Y=.450) &=& B \times .275 + (1 - B) \times .450 \\ &=& .275 + (1 - B)(.450 - .260) \\ B &=&\frac{.110^2}{.110^2 + .027^2} = 0.943\\ E(\theta | Y=450) &\approx& .285\\ \end{eqnarray*} \]
The variance can be shown to be:
\[ \begin{eqnarray*} \mbox{var}(\theta|y) &=& \frac{1}{1/\sigma^2+1/\tau^2} &=& \frac{1}{1/.110^2 + 1/.027^2} \end{eqnarray*} \]
In our example the SD is 0.026
Month | At Bat | Hits | AVG |
---|---|---|---|
April | 20 | 9 | .450 |
May | 26 | 11 | .423 |
June | 86 | 34 | .395 |
July | 83 | 17 | .205 |
August | 85 | 25 | .294 |
September | 50 | 10 | .200 |
Total w/o April | 330 | 97 | .293 |
Frequentist confidence interval = .450 \( \pm \) 0.220
Empirical Bayes credible interval = .285 \( \pm \) 0.052
Actual = .293
The US Senate has 100 senators
Currently we have 55 Dems (includes 2 independents) versus 45 Republicans.
This election 36 Senate are being contested
21 are held by Democrats and 15 by Republicans.
Note: Kansas, New Hampshire and North Carolina
60% R, 56% I, 62% I, 51% R, 72% I, 61% R, 61% R, Tie, Tie, Tie
75% D, 81% D, 69% D, 60% D, 89% D, 84% D, 99% D, 3 Lean D
Different colors are different pollsters
81% D, 78% D, 73% D, 60% D, 75% D, 71% D, 96% D, Tie, Tie, Lean D
A closer look at NC
Mindless confidence interval:
[ -4.3 -2.1 ]
Using only recent polls
[ -3.8 -1.6 ]
Use, previous North Carolina results, average 0 and SD 5
Two level model: \[ \begin{eqnarray*} \theta &\sim& N(\mu, \tau^2) \\ Y | \theta &\sim& N(\theta, 0.5^2) \end{eqnarray*} \]
Observed \( Y = -2.4 \)
How do we construct a prior? Are there other levels?