Rafael A Irizarry
September 11, 2014
If Y has mean 70 and SD 3, what proportion of Y > 76?
\[ \mbox{Pr}(Y > 76) = \int_{76}^{-\infty} \frac{1}{3\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} \left( \frac{y-70}{3} \right)^2 \right\} \, dy \]
Easier for me to think of: \[ \Phi(z) = \int_{\infty}^{z} \frac{1}{\sqrt{2\pi}} \exp \left\{ -\frac{1}{2} y ^2 \right\} \, dy \]
Sean Lahman's Baseball Database provides a zip file with:
SchoolsPlayers.csv, SeriesPost.csv, Teams.csv, TeamsFranchises.csv, TeamsHalf.csv, AllstarFull.csv, Appearances.csv, AwardsManagers.csv, AwardsPlayers.csv, AwardsShareManagers.csv, AwardsSharePlayers.csv, Batting.csv, BattingPost.csv, Fielding.csv, FieldingOF.csv, FieldingPost.csv, HallOfFame.csv, Managers.csv, ManagersHalf.csv, Master.csv, Pitching.csv, PitchingPost.csv, readme2013.txt, Salaries.csv, Schools.csv
Python for Data Analysis provides pointers on how to import
Read the documentation. Column names are
yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
We need these four
yearID teamID W L
1 1871 PH1 21 7
2 1871 CH1 19 9
3 1871 BS1 20 10
4 1871 WS3 15 15
5 1871 NY2 16 17
6 1871 TRO 13 15
7 1871 FW1 7 12
8 1871 CL1 10 19
9 1871 RC1 4 21
10 1872 BS1 39 8
2745 X 4
We need to add salary
yearID teamID W L salary
1 1871 PH1 21 7 ?
2 1871 CH1 19 9 ?
3 1871 BS1 20 10 ?
4 1871 WS3 15 15 ?
5 1871 NY2 16 17 ?
6 1871 TRO 13 15 ?
7 1871 FW1 7 12 ?
8 1871 CL1 10 19 ?
9 1871 RC1 4 21 ?
10 1872 BS1 39 8 ?
yearID teamID lgID playerID salary
20031 2009 NYA AL rodrial01 33000000
20846 2010 NYA AL rodrial01 33000000
21673 2011 NYA AL rodrial01 32000000
22525 2012 NYA AL rodrial01 30000000
23379 2013 NYA AL rodrial01 29000000
19179 2008 NYA AL rodrial01 28000000
21617 2011 LAA AL wellsve01 26187500
16687 2005 NYA AL rodrial01 26000000
23794 2013 PHI NL leecl02 25000000
23380 2013 NYA AL wellsve01 24642857
20847 2010 NYA AL sabatcc01 24285714
21674 2011 NYA AL sabatcc01 24285714
23381 2013 NYA AL sabatcc01 24285714
22469 2012 LAA AL wellsve01 24187500
20383 2009 LAN NL ramirma02 23854494
23956 X 5
Note: not by team
teamID yearID salary
513 NYA 2013 231978886
404 LAN 2013 223362196
505 NYA 2005 208306817
508 NYA 2008 207896789
510 NYA 2010 206333389
511 NYA 2011 202275028
509 NYA 2009 201449189
512 NYA 2012 196522289
506 NYA 2006 194663079
507 NYA 2007 189259045
Hadley Wickham (2011) Journal of Statistical Software
Through the years
teamID yearID salary
1 ANA 1997 31135472
2 ANA 1998 41281000
3 ANA 1999 55388166
4 ANA 2000 51464167
5 ANA 2001 47535167
yearID teamID W L
1 1871 PH1 21 7
2 1871 CH1 19 9
3 1871 BS1 20 10
4 1871 WS3 15 15
5 1871 NY2 16 17
yearID teamID W L salary
1 1997 ANA 84 78 31135472
2 1998 ANA 85 77 41281000
3 1999 ANA 70 92 55388166
4 2000 ANA 82 80 51464167
5 2001 ANA 75 87 47535167
Start by getting regions (from World Atlas)
Despite the name it is html not csv
Country Region
1 Algeria AFRICA
2 Angola AFRICA
3 Benin AFRICA
4 Botswana AFRICA
5 Burkina AFRICA
6 Burundi AFRICA
7 Cameroon AFRICA
8 Cape Verde AFRICA
9 Central African Republic AFRICA
10 Chad AFRICA
We use income per person data from Gapminder
Afghanistan Albania Algeria
1800 472.1 601.2 766.3
1801 472.1 601.8 766.2
1802 472.1 602.5 766.2
1803 472.1 603.1 766.2
1804 472.1 603.7 766.2
1805 472.1 604.3 766.2
1806 472.1 605.0 766.1
1807 472.1 605.6 766.1
1808 472.1 606.2 766.1
1809 472.1 606.9 766.1
First we map to continents. These “countries” don't map:
Abkhazia; Akrotiri and Dhekelia; American Samoa; Anguilla; Aruba; Bermuda; British Virgin Islands; Burkina Faso; Cayman Islands; Central African Rep.; Channel Islands; Christmas Island; Cocos Island; Congo, Dem. Rep.; Congo, Rep.; Cook Islands; Cote d'Ivoire; Czech Rep.; Czechoslovakia; Dominican Rep.; East Germany; Eritrea and Ethiopia; Faeroe Islands; Falkland Islands (Malvinas); French Guiana; French Polynesia; Gibraltar; Greenland; Guadeloupe; Guam; Guernsey; Holy See; Hong Kong, China; Isle of Man; Jersey; Korea, Dem. Rep.; Korea, Rep.; Korea, United; Kosovo; Macao, China; Macedonia, FYR; Martinique; Mayotte; Micronesia, Fed. Sts.; Montserrat; Myanmar; Nagorno-Karabakh; Netherlands Antilles; New Caledonia; Niue; Norfolk Island; Northern Cyprus; Northern Mariana Islands; Pitcairn; Puerto Rico; Reunion; Russia; Saint Barthélemy; Saint Helena; Saint Martin; Saint-Pierre-et-Miquelon; Serbia and Montenegro; Serbia excluding Kosovo; Slovak Republic; Somaliland; South Ossetia; Svalbard; Taiwan; Timor-Leste; Tokelau; Transnistria; Turks and Caicos Islands; USSR; Wallis et Futuna; West Bank and Gaza; West Germany; Western Sahara; Virgin Islands (U.S.); Yemen Arab Republic (Former); Yemen Democratic (Former); Yemen, Rep.; Yugoslavia; Åland
median IQR
AFRICA 800 456
ASIA 946 739
OCEANIA 1120 1026
NORTH AMERICA 1787 758
SOUTH AMERICA 2039 644
EUROPE 3143 2627