Exploring Data: Examples from baseball and global income

Rafael A Irizarry
September 11, 2014

Quick Review of Normal CDF

If Y has mean 70 and SD 3, what proportion of Y > 76?

\[ \mbox{Pr}(Y > 76) = \int_{76}^{-\infty} \frac{1}{3\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} \left( \frac{y-70}{3} \right)^2 \right\} \, dy \]

Quick Review of normal CDF

Easier for me to think of: \[ \Phi(z) = \int_{\infty}^{z} \frac{1}{\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} y ^2 \right\} \, dy \]

  • This is the formula for standard units: \( \mu=0 \) and \( \sigma=1 \)
  • Now Z=(76-70)/3 = 2
  • \( \mbox{Pr}(Z>2) = 1 - \Phi(2) \)
  • Memorize a python command for \( \Phi(z) \)

Baseball Database

Sean Lahman's Baseball Database provides a zip file with:

SchoolsPlayers.csv, SeriesPost.csv, Teams.csv, TeamsFranchises.csv, TeamsHalf.csv, AllstarFull.csv, Appearances.csv, AwardsManagers.csv, AwardsPlayers.csv, AwardsShareManagers.csv, AwardsSharePlayers.csv, Batting.csv, BattingPost.csv, Fielding.csv, FieldingOF.csv, FieldingPost.csv, HallOfFame.csv, Managers.csv, ManagersHalf.csv, Master.csv, Pitching.csv, PitchingPost.csv, readme2013.txt, Salaries.csv, Schools.csv

Python for Data Analysis provides pointers on how to import

Teams table

Read the documentation. Column names are

yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro

Teams table

We need these four

   yearID teamID  W  L
1    1871    PH1 21  7
2    1871    CH1 19  9
3    1871    BS1 20 10
4    1871    WS3 15 15
5    1871    NY2 16 17
6    1871    TRO 13 15
7    1871    FW1  7 12
8    1871    CL1 10 19
9    1871    RC1  4 21
10   1872    BS1 39  8
2745 X 4

Teams table

We need to add salary

   yearID teamID  W  L salary
1    1871    PH1 21  7      ?
2    1871    CH1 19  9      ?
3    1871    BS1 20 10      ?
4    1871    WS3 15 15      ?
5    1871    NY2 16 17      ?
6    1871    TRO 13 15      ?
7    1871    FW1  7 12      ?
8    1871    CL1 10 19      ?
9    1871    RC1  4 21      ?
10   1872    BS1 39  8      ?

Salaries

      yearID teamID lgID  playerID   salary
20031   2009    NYA   AL rodrial01 33000000
20846   2010    NYA   AL rodrial01 33000000
21673   2011    NYA   AL rodrial01 32000000
22525   2012    NYA   AL rodrial01 30000000
23379   2013    NYA   AL rodrial01 29000000
19179   2008    NYA   AL rodrial01 28000000
21617   2011    LAA   AL wellsve01 26187500
16687   2005    NYA   AL rodrial01 26000000
23794   2013    PHI   NL   leecl02 25000000
23380   2013    NYA   AL wellsve01 24642857
20847   2010    NYA   AL sabatcc01 24285714
21674   2011    NYA   AL sabatcc01 24285714
23381   2013    NYA   AL sabatcc01 24285714
22469   2012    LAA   AL wellsve01 24187500
20383   2009    LAN   NL ramirma02 23854494
23956 X 5

Note: not by team

Split-apply-combine

  • Split by team/year
  • Apply a sum to the salaries
  • Comgine back into a table
    teamID yearID    salary
513    NYA   2013 231978886
404    LAN   2013 223362196
505    NYA   2005 208306817
508    NYA   2008 207896789
510    NYA   2010 206333389
511    NYA   2011 202275028
509    NYA   2009 201449189
512    NYA   2012 196522289
506    NYA   2006 194663079
507    NYA   2007 189259045

Hadley Wickham (2011) Journal of Statistical Software

Boston Red Sox v New York Yankees

Through the years

plot of chunk unnamed-chunk-8

Merge

  teamID yearID   salary
1    ANA   1997 31135472
2    ANA   1998 41281000
3    ANA   1999 55388166
4    ANA   2000 51464167
5    ANA   2001 47535167
  yearID teamID  W  L
1   1871    PH1 21  7
2   1871    CH1 19  9
3   1871    BS1 20 10
4   1871    WS3 15 15
5   1871    NY2 16 17

Merge

  yearID teamID  W  L   salary
1   1997    ANA 84 78 31135472
2   1998    ANA 85 77 41281000
3   1999    ANA 70 92 55388166
4   2000    ANA 82 80 51464167
5   2001    ANA 75 87 47535167

Plot

plot of chunk unnamed-chunk-12

Adding a Regression Line

plot of chunk unnamed-chunk-13

Residuals

plot of chunk unnamed-chunk-14

Global Income Data

Start by getting regions (from World Atlas)

Despite the name it is html not csv

                    Country Region
1                   Algeria AFRICA
2                    Angola AFRICA
3                     Benin AFRICA
4                  Botswana AFRICA
5                   Burkina AFRICA
6                   Burundi AFRICA
7                  Cameroon AFRICA
8                Cape Verde AFRICA
9  Central African Republic AFRICA
10                     Chad AFRICA

Gapminder

We use income per person data from Gapminder

     Afghanistan Albania Algeria
1800       472.1   601.2   766.3
1801       472.1   601.8   766.2
1802       472.1   602.5   766.2
1803       472.1   603.1   766.2
1804       472.1   603.7   766.2
1805       472.1   604.3   766.2
1806       472.1   605.0   766.1
1807       472.1   605.6   766.1
1808       472.1   606.2   766.1
1809       472.1   606.9   766.1

Income distribution

plot of chunk unnamed-chunk-17

Income distribution

plot of chunk unnamed-chunk-18

Comparing Continents

First we map to continents. These “countries” don't map:

Abkhazia; Akrotiri and Dhekelia; American Samoa; Anguilla; Aruba; Bermuda; British Virgin Islands; Burkina Faso; Cayman Islands; Central African Rep.; Channel Islands; Christmas Island; Cocos Island; Congo, Dem. Rep.; Congo, Rep.; Cook Islands; Cote d'Ivoire; Czech Rep.; Czechoslovakia; Dominican Rep.; East Germany; Eritrea and Ethiopia; Faeroe Islands; Falkland Islands (Malvinas); French Guiana; French Polynesia; Gibraltar; Greenland; Guadeloupe; Guam; Guernsey; Holy See; Hong Kong, China; Isle of Man; Jersey; Korea, Dem. Rep.; Korea, Rep.; Korea, United; Kosovo; Macao, China; Macedonia, FYR; Martinique; Mayotte; Micronesia, Fed. Sts.; Montserrat; Myanmar; Nagorno-Karabakh; Netherlands Antilles; New Caledonia; Niue; Norfolk Island; Northern Cyprus; Northern Mariana Islands; Pitcairn; Puerto Rico; Reunion; Russia; Saint Barthélemy; Saint Helena; Saint Martin; Saint-Pierre-et-Miquelon; Serbia and Montenegro; Serbia excluding Kosovo; Slovak Republic; Somaliland; South Ossetia; Svalbard; Taiwan; Timor-Leste; Tokelau; Transnistria; Turks and Caicos Islands; USSR; Wallis et Futuna; West Bank and Gaza; West Germany; Western Sahara; Virgin Islands (U.S.); Yemen Arab Republic (Former); Yemen Democratic (Former); Yemen, Rep.; Yugoslavia; Åland

Comparing Continents

  • Split by continent
  • Summarize each
  • Join in a table or plot
              median  IQR
AFRICA           800  456
ASIA             946  739
OCEANIA         1120 1026
NORTH AMERICA   1787  758
SOUTH AMERICA   2039  644
EUROPE          3143 2627

Comparing Continents

plot of chunk unnamed-chunk-21

Comparing regions

plot of chunk unnamed-chunk-22