Exploring Data: Examples from baseball and global income

Rafael A Irizarry
September 11, 2014

Quick Review of Normal CDF

If Y has mean 70 and SD 3, what proportion of Y > 76?

\[ \mbox{Pr}(Y > 76) = \int_{76}^{-\infty} \frac{1}{3\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} \left( \frac{y-70}{3} \right)^2 \right\} \, dy \]

Quick Review of normal CDF

Easier for me to think of: \[ \Phi(z) = \int_{\infty}^{z} \frac{1}{\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} y ^2 \right\} \, dy \]

  • This is the formula for standard units: \( \mu=0 \) and \( \sigma=1 \)
  • Now Z=(76-70)/3 = 2
  • \( \mbox{Pr}(Z>2) = 1 - \Phi(2) \)
  • Memorize a python command for \( \Phi(z) \)

Baseball Database

Sean Lahman's Baseball Database provides a zip file with:

SchoolsPlayers.csv, SeriesPost.csv, Teams.csv, TeamsFranchises.csv, TeamsHalf.csv, AllstarFull.csv, Appearances.csv, AwardsManagers.csv, AwardsPlayers.csv, AwardsShareManagers.csv, AwardsSharePlayers.csv, Batting.csv, BattingPost.csv, Fielding.csv, FieldingOF.csv, FieldingPost.csv, HallOfFame.csv, Managers.csv, ManagersHalf.csv, Master.csv, Pitching.csv, PitchingPost.csv, readme2013.txt, Salaries.csv, Schools.csv

Python for Data Analysis provides pointers on how to import

Teams table

Read the documentation. Column names are

yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro

Teams table

We need these four

   yearID teamID  W  L
1    1871    PH1 21  7
2    1871    CH1 19  9
3    1871    BS1 20 10
4    1871    WS3 15 15
5    1871    NY2 16 17
6    1871    TRO 13 15
7    1871    FW1  7 12
8    1871    CL1 10 19
9    1871    RC1  4 21
10   1872    BS1 39  8
2745 X 4

Teams table

We need to add salary

   yearID teamID  W  L salary
1    1871    PH1 21  7      ?
2    1871    CH1 19  9      ?
3    1871    BS1 20 10      ?
4    1871    WS3 15 15      ?
5    1871    NY2 16 17      ?
6    1871    TRO 13 15      ?
7    1871    FW1  7 12      ?
8    1871    CL1 10 19      ?
9    1871    RC1  4 21      ?
10   1872    BS1 39  8      ?

Salaries

      yearID teamID lgID  playerID   salary
20031   2009    NYA   AL rodrial01 33000000
20846   2010    NYA   AL rodrial01 33000000
21673   2011    NYA   AL rodrial01 32000000
22525   2012    NYA   AL rodrial01 30000000
23379   2013    NYA   AL rodrial01 29000000
19179   2008    NYA   AL rodrial01 28000000
21617   2011    LAA   AL wellsve01 26187500
16687   2005    NYA   AL rodrial01 26000000
23794   2013    PHI   NL   leecl02 25000000
23380   2013    NYA   AL wellsve01 24642857
20847   2010    NYA   AL sabatcc01 24285714
21674   2011    NYA   AL sabatcc01 24285714
23381   2013    NYA   AL sabatcc01 24285714
22469   2012    LAA   AL wellsve01 24187500
20383   2009    LAN   NL ramirma02 23854494
23956 X 5

Note: not by team

Split-apply-combine

  • Split by team/year
  • Apply a sum to the salaries
  • Comgine back into a table
    teamID yearID    salary
513    NYA   2013 231978886
404    LAN   2013 223362196
505    NYA   2005 208306817
508    NYA   2008 207896789
510    NYA   2010 206333389
511    NYA   2011 202275028
509    NYA   2009 201449189
512    NYA   2012 196522289
506    NYA   2006 194663079
507    NYA   2007 189259045

Hadley Wickham (2011) Journal of Statistical Software

Boston Red Sox v New York Yankees

Through the years

plot of chunk unnamed-chunk-8

Merge

  teamID yearID   salary
1    ANA   1997 31135472
2    ANA   1998 41281000
3    ANA   1999 55388166
4    ANA   2000 51464167
5    ANA   2001 47535167
  yearID teamID  W  L
1   1871    PH1 21  7
2   1871    CH1 19  9
3   1871    BS1 20 10
4   1871    WS3 15 15
5   1871    NY2 16 17

Merge

  yearID teamID  W  L   salary
1   1997    ANA 84 78 31135472
2   1998    ANA 85 77 41281000
3   1999    ANA 70 92 55388166
4   2000    ANA 82 80 51464167
5   2001    ANA 75 87 47535167

Plot

plot of chunk unnamed-chunk-12

Adding a Regression Line

plot of chunk unnamed-chunk-13

Residuals