Analysis on the Olympics dataset
¶
Datasets
¶
The core datasets used for this task were taken from
Kaggle
.
This is a well-analysed dataset with over 100 kernels pinned against it on Kaggle. To enrich my analysis, and make it original, I also scraped some other datasets:
I used a dataset of capital city Latitude & Longitudes from
Techslides
.
I manually created a new dataset of olympic host city Altitudes, Latitudes & Longitudes from
mapcoordinates
.
Running environment
¶
I used a Jupyter Notebook (Python 3) to narrate my analysis.
Requirement already satisfied: geopy in /Users/williamneedham/anaconda/lib/python3.6/site-packages (1.17.0)
Requirement already satisfied: geographiclib<2,>=1.49 in /Users/williamneedham/anaconda/lib/python3.6/site-packages (from geopy) (1.49)
import unittest
#import difflib - not used
from geopy.distance import vincenty, geodesic
%pylab inline
%matplotlib inline
Contents¶
My analysis is laid out as follows:
Stage 1: Initial look at the data¶
Link to initial look at the data
Stage 2: Feature engineering¶
Link to feature engineering section of the report
Stage 3: Research questions¶
Following a review of the data available and a brainstorming session, I decided to focus my efforts on the following areas:
Summary statistics to explore how athletes have changed over the years
Analysis of some interesting countries
Is there a correlation between Distance Travelled to the Games and performance and has that changed since 1980?
Stage 4: Conclusions¶
Conclusions and further research options
...any finally - what about the Jamaican bobsleigh team?
Contact details
#Loading athlete data
data = pd.read_csv('athlete_events.csv', index_col = 'ID')
data.head(10)
A couple of early observations, upon loading the data:
NA in the Medal column have been converted to NaN, needs correcting.
Team variable combines > 1 Country, which is interesting, but unhelpful, will look to merge with the NOC table to clean.
It looks like Games is just a concat of Year + Season, should remove, once checked.
there's some NaN's in the Height & Weight Column , will need to be cleaned if using these columns.
There's some interesting sports from way-back-when - including Tug-Of-War, maybe these would be interesting to look at!
Fill NaN values in Medal column with 'None'¶
#Replace NA in Medal column with None
data['Medal'] = data['Medal'].fillna("None")
data['Medal'].head()
String replacement: replacing incorrect/ shorterned country names in the NOC dataset¶
Later on in the analysis, I do fuzzy matching of Country Names to join with the Capital Cities metadata dataset. After doing this, I changed this part of the script to include the fuzzy matches that the algorithm highlighted. i.e. before it is merged in the main data frame.
NOC['region'].replace(['US', 'USA', 'UK', 'Saint Kitts', 'Trinidad', 'Boliva', 'Virgin Islands, US', 'Curacao'],
['United States', 'United States', 'United Kingdom', 'Saint Kitts and Nevis',
'Trinidad and Tobago', 'Bolivia', 'US Virgin Islands', 'Cura̤ao'],
inplace = True)
#before discarding, check to see if Games is in Year + Season for all values, then discard.
if (data['Games'].equals(data['Year'].map(str) + " " + data['Season'])):
data = data.drop(['notes', 'Games'], axis=1)
else:
data = data.drop(['notes'], axis=1)
We can see above that Games == Year + Season for all values, and so is removed from the dataset.
Review of null values in the dataset¶
# assign a list of (ColumnName, NullCount) tuples to null_list using list comprehension
null_list = [(col, data[col].isnull().sum()) for col in data.columns]
print(null_list)
We can see that only the Age, Height, Weight & the newly created region columns contain null values. My instinct tells me that this is because they didn't record age/height/weight measurements before a certain date. We'll have to investigate the region nulls a bit further.
Investigating the region nulls¶
We can see below that these are being caused by the Refugee Olympic Athletes Team, Unknowns and Tuvalu. I will remove these rows from the dataset.
#find out which points haven't merged properly
missing_region_datapoints = data[data.region.isnull()]
print(len(missing_region_datapoints))
missing_region_datapoints.sample(5)
# remove rows with NaN in the region column, using prints to check
print(len(data))
data = data[pd.notnull(data['region'])]
print(len(data))
Feature engineering¶
I chose to use external datasets to enrich the analyse and build novel feature sets. This section shows how I went about it. In this section most of the code is wrapped around unit tests to ensure the transformation have been carried out as expected.
Mapping host cities to host countries¶
Firstly, i created a map of Host Cities to Host Country, which wasn't originally in the dataset. This will be when looking at the 'Home-advantage' hypothesis.
#manually created a dictionary map from city -> country
city_to_country = { 'Barcelona':'Spain', 'London':'UK', 'Antwerpen':'Belgium', 'Paris':'France', 'Calgary':'Canada',
'Albertville':'France', 'Lillehammer':'Norway', 'Los Angeles':'USA', 'Salt Lake City':'USA',
'Helsinki':'Finland', 'Lake Placid':'USA', 'Sydney':'Australia', 'Atlanta':'USA', 'Stockholm':'Sweden',
'Sochi':'Russia', 'Nagano':'Japan', 'Torino':'Italy', 'Beijing':'China', 'Rio de Janeiro':'Brazil', 'Athina':'Greece',
'Squaw Valley':'USA', 'Innsbruck':'Austria', 'Sarajevo': 'Bosnia and Herzegovina', 'Mexico City':'Mexico', 'Munich': 'Germany',
'Seoul': 'South Korea', 'Berlin': 'Germany', 'Oslo': 'Norway', "Cortina d'Ampezzo":'Italy', 'Melbourne': 'Australia', 'Roma': 'Italy',
'Amsterdam': 'Netherlands', 'Montreal': 'Canada', 'Moskva': 'Russia', 'Tokyo':'Japan', 'Vancouver':'Canada', 'Grenoble':'France',
'Sapporo':'Japan', 'Chamonix':'France', 'St. Louis':'USA', 'Sankt Moritz':'Switzerland',
'Garmisch-Partenkirchen':'Germany'}
#simple dictionary map
def map_cities_to_countries(data, dict_city_to_country) :
data['Host_Country'] = data['City'].map(dict_city_to_country)
return
data
# unit test the mapping to ensure the merge did not introduce nulls into the dataset.
class TestMapping(unittest.TestCase):
# Create the unit test
def test_mapping(self):
map_cities_to_countries(data, city_to_country)
count_na = data['Medal'].isnull().sum()
# Test that the mapping has not introduces NA's
self.assertEqual(0, count_na)
# Run the test
unittest.main(argv=['ignored', '-v'], exit=False)
test_mapping (__main__.TestMapping) ... ok
----------------------------------------------------------------------
Ran 1 test in 0.071s
host_cities_metadata = pd.read_csv('hostcities.csv', names=['HostCity', 'HostLatitude','HostLongitude','HostAltitude'])
host_cities_metadata = host_cities_metadata.iloc[1:]
host_cities_metadata.head()
#merge the host cities metadata into the main data frame
data = pd.merge(data, host_cities_metadata, left_on='City', right_on='HostCity')
data.head(5)
The empty dataframe above shows the merge was successful.
This dataset contains the CountryName, CapitalName, CapitalLatitude, CapitalLongitude, CountryCode (not used) and ContinentName. My plan is merge this dataset on the Athletes 'region' variable, that we originally got from the NOC dataset. I'm doing this in preparation for answering the 'distance-travelled' research question; my hypothesis is that athletes that travel the furthest for the games are most disadvantaged due to jet-lag/ acclimatisation etc.
I appreciate using the capital city lat/long of an athletes country is not ideal (as they may not live in the capital, or might live abroad etc), but serves as an OK proxy for this analysis.
#load in the data
names = ['CountryName', 'CapitalName', 'CapitalLatitude','CapitalLongitude', 'CountryCode', 'ContinentName']
capital_cities_metadata = pd.read_csv('country-capitals.csv', names = names)
#remove the first row as it's a duplicate of the column headers
capital_cities_metadata = capital_cities_metadata.iloc[1:]
# describe method to undertsand the shape of the dataset.
capital_cities_metadata.describe()
Before I merge the capital cities metadata dataset into the main 'data' frame, I'm going to perform an outer join to create a new variable called data_new. In doing so, I can then identify Country names are not joining properly.
This then becomes a fuzzy matching problem between two lists of strings from different sources.
#df_pitches = pd.read_csv("df_pitches_2016.csv")
#df_salaries = pd.read_csv("df_salaries_2016.csv")
# Creating my merged data frame
data_new = data.merge(capital_cities_metadata,
left_on='region',
right_on='CountryName',
how='outer',
# Only selecing pitchers with a lot of pitches
#data_new = data_new[data_new > 1000][['CapitalLongitude']]
#Selecing people with missing salaries
missing_latitude = data_new[data_new.CapitalLatitude.isnull()]
#Displaying results
missing_latitude.reset_index(inplace=True,drop='index')
missing_latitude.sample(40)
missing_latitude['region'].unique()
array(['Ivory Coast', 'Saint Vincent', 'Individual Olympic Athletes',
'Virgin Islands, British', 'Antigua', 'Gambia', 'Brunei',
'Micronesia'], dtype=object)
Fuzzy matching of Country Names from the NOC dataset and the country capitals dataset¶
To try and facilitate as clean a merge as possible, I used a fuzzy matching algorithm from the fuzzywuzzy package. This highlighted inconsistencies in the string, which I then dealt with earlier on in the analysis (up the page), then re-ran the script. This was because I needed to make the string replacements prior to merge the NOC & main data frames.
def match_name(name, list_names, min_score=0):
# -1 score incase we don't get any matches
max_score = -1
# Returning empty name for no match as well
max_name = ""
# Iternating over all names in the other
for name2 in list_names:
#Finding fuzzy match score
score = fuzz.ratio(name, name2)
# Checking if we are above our threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = name2
max_score = score
return (max_name, max_score)
/Users/williamneedham/anaconda/lib/python3.6/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
# iterating over our players without salaries found above
for name in missing_latitude.region.unique():
# Use our method to find best match, we can set a threshold here
match = match_name(name, capital_cities_metadata.CountryName, 50)
# New dict for storing data
dict_ = {}
dict_.update({"country_name" : name})
dict_.update({"match_name" : match[0]})
dict_.update({"score" : match[1]})
dict_list.append(dict_)
merge_table = pd.DataFrame(dict_list)
# Display results
merge_table
#merge the capital cities metadata into the main data frame
data = pd.merge(data, capital_cities_metadata, left_on='region', right_on='CountryName')
data.head()
# remove rows with NaN in the region column, using prints to check
print(len(data))
data = data[pd.notnull(data.region)]
print(len(data))
Calculating how far each athlete has travelled to participate in the games¶
Now that we have lat/long pairs of 1) where the athlete lives, and 2) where the games is being held, we can calcuate the Vincenty distance (from GeoPy).
The Vincenty distance is the geodesic distance between two points on the earth, taking into account the Earth's elliptic shape; using Vincenty’s method.
def calculate_vincenty_distance(distances) :
This function first creates (Latitude, Longitude) tuples for both the Athletes Home City and the Olympic Host City.
Then, calculates a new column 'vincenty_distance' using np.vectorize (as the vincenty method requires two inputs)
Inputs: distances - dataframe with cols = ['CapitalLatitude', 'CapitalLongitude', 'HostLatitude', 'HostLongitude']
Outputs: same dataframe with new column ['vincenty_distance']
distances['AthleteLatLong'] = list(zip(distances.CapitalLatitude, distances.CapitalLongitude))
distances['HostCityLatLong'] = list(zip(distances.HostLatitude, distances.HostLongitude))
distances['vincenty_distance'] = np.vectorize(geodesic)(distances['AthleteLatLong'], distances['HostCityLatLong'])
return distances
# create a new dataframe away from the original data, and then concat back into the original dataframe
new_data = data
distances = new_data
.ix[:, ['CapitalLongitude', 'CapitalLatitude','HostLongitude','HostLatitude']]
#normally I'd one class for all unit tests, but I've created a new one here for code readability in a notebook.
class TestCalculations(unittest.TestCase):
# Create the unit test
def test_vincenty_calculation(self):
#create a test tuple of data from the first row of 'data' dataframe
athleteLatLong0 = (39.91666667, 116.383333) # first row
#create a test tuple of data from the first row of 'data' dataframe
hostLatLong0 = (41.3828939, 2.1774322)
#calculate the vincenty distance on the test data
distanceTest = geodesic(athleteLatLong0, hostLatLong0).km
#calculate vincenty distance using our function calculate_vincenty_distance
calculate_vincenty_distance(distances)
# get the result of the first row
function_output = distances.loc[0,'vincenty_distance']
# Test that the result of the two method asserts equal.
self.assertAlmostEqual(function_output, distanceTest,5)
# Run the test
unittest.main(argv=['ignored', '-v'], exit=False)
test_vincenty_calculation (__main__.TestCalculations) ... ok
test_mapping (__main__.TestMapping) ... ok
----------------------------------------------------------------------
Ran 2 tests in 88.062s
#cast the vincenty distance from Distance ojbect to float
distances['vincenty_distance'] = distances['vincenty_distance'].astype(str).str[:-3].astype(float)
data = pd.concat([data, distances['vincenty_distance']], axis=1)
#take a look at the results
data.head()
Index(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Year',
'Season', 'City', 'Sport', 'Event', 'Medal', 'region', 'Host_Country',
'HostCity', 'HostLatitude', 'HostLongitude', 'HostAltitude',
'CountryName', 'CapitalName', 'CapitalLatitude', 'CapitalLongitude',
'CountryCode', 'ContinentName', 'vincenty_distance'],
dtype='object')
cols_to_keep = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
'City', 'Sport', 'Event', 'Medal', 'region', 'Host_Country',
'CapitalName', 'ContinentName', 'vincenty_distance', 'HostAltitude']
clean_data = pd.DataFrame(data, columns=cols_to_keep)
clean_data.head()
clean_data.columns = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
'City', 'Sport', 'Event', 'Medal', 'AthleteCountry', 'HostCountry',
'AthletesHomeCapital', 'AthletesHomeContinent', 'DistanceTravelled', 'HostCityAltitude']
clean_data.head()
df_medals = pd.get_dummies(clean_data['Medal'])
# Join the dummy variables to the main dataframe
clean_data = pd.concat([clean_data, df_medals], axis=1)
pd.options.display.max_columns = None
clean_data.head()
clean_data['TotalMedals'] = clean_data['Gold'] + clean_data['Silver'] + clean_data['Bronze']
clean_data['AnyMedals'] = np.where(clean_data.loc[:,'Medal'] == 'None', 0, 1)
clean_data.sample(10)
======================================================¶
Summary statistics to explore how athletes have changed over the years¶
In this question, i'm looking to explore how the sex, age, and number of competitors has changed in the last 120 years.
The changing profile of the Olympic athlete (part 1)¶
#Proportion
ProportionMvW = data.groupby(['Year', 'Sex']).agg({'Name': 'count'})
percentages = ProportionMvW.groupby(level=0).apply(lambda x:
100 * x / float(x.sum()))
percentages = percentages.reset_index()
fig, ax = pyplot.subplots(figsize=a4_dims)
ax = sns.barplot(x="Year", y="Name", hue="Sex", data=percentages, palette="Set2")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
ax.set_title('Proportion of men and women competing in the olympic games')
ax.set_ylabel('Percentage of athletes for each gender')
ax.legend(loc="upper right")
plt.show()
Insights:
Good news! The ratio of male to female competitors is slowly becoming more balanced - could 2020 be the year that we have an equal number of female competitors and male competitors?
The changing profile of the Olympic athlete (part 2)¶
The visualisation below shows how the distribution of men and women's ages have changed since 1896, split by Sex and Summer/Winter Olympics
sns.set(rc={'figure.figsize':(20,20)})
g = sns.FacetGrid(clean_data, col="Season", row="Sex", size=6)
g = g.map(sns.boxplot, "Year", 'Age')
g.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
show()
Insights:
some 'athletes' in the 20th century were > 80 years when they competed! Worth a quick look back at the data to see what they competed in.
As could be expected, variance is greatly reduced in the modern athletes age (with the median Age being in the early 20's. Interestingly the median age in the winter olympics seems to be higher.
Higher variance in the Summer Olympics - I would hypothesise this is down to the wider variety of sports on offer at the Summer Olympics.
#create a table of year,countryname & sex to be merged into a count of each later.
mini_country_table = clean_data[clean_data['Season'] == 'Summer'].loc[:, ['Year','AthleteCountry', 'Name'
, 'Sex']].drop_duplicates()
mini_country_table.head()
# Create a pivot table to count gender wise representation of each team in each year
CountAthletesByCountry = pd.pivot_table(mini_country_table,
index = ['Year', 'AthleteCountry'],
columns = 'Sex',
aggfunc = 'count').reset_index()
# rename columns as per column names in the 0th level
CountAthletesByCountry.columns = CountAthletesByCountry.columns.get_level_values(0)
# rename the columns appropriately
CountAthletesByCountry.columns = ['Year', 'AthleteCountry', 'Female_Athletes', 'Male_Athletes']
# get total athletes per team-year
CountAthletesByCountry['Total_Athletes'] = CountAthletesByCountry['Female_Athletes'] + \
CountAthletesByCountry['Male_Athletes']
uk_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "United Kingdom"]
uk_athletes.fillna(0, inplace = True)
uk_athletes.set_index('Year', inplace = True)
swedish_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Sweden"]
swedish_athletes.set_index('Year', inplace = True)
japanese_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Japan"]
japanese_athletes.set_index('Year', inplace = True)
germany_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Germany"]
germany_athletes.set_index('Year', inplace = True)
# Plot the values of male, female and total athletes using bar charts and the line charts.
fig, ((ax1, ax2), (ax3, ax4)) = subplots(nrows = 2, ncols = 2, figsize = (20, 20), sharey = True)
fig.subplots_adjust(hspace = 0.3)
# Plot team Australia's contingent size
ax1.bar(uk_athletes.index.values, uk_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax1.bar(uk_athletes.index.values, uk_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax1.plot(uk_athletes.index.values, uk_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
marker = 'o')
ax1.legend(loc="upper right")
ax1.set_title('UK Athletes :\nParticipation since 1896')
ax1.set_ylabel('Count of Athletes')
# Plot German athlete participation
ax2.bar(germany_athletes.index.values, germany_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax2.bar(germany_athletes.index.values, germany_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax2.plot(germany_athletes.index.values, germany_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
marker = 'o')
ax2.set_title('Germany Athletes :\nParticipation since 1896')
ax2.legend(loc="upper left")
ax2.set_ylabel('Count of Athletes')
# Plot Japan's contingent size
ax3.bar(japanese_athletes.index.values, japanese_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax3.bar(japanese_athletes.index.values, japanese_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax3.plot(japanese_athletes.index.values, japanese_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
marker = 'o')
ax3.set_title('Japanese Athletes :\nParticipation since 1896')
ax3.set_ylabel('Count of Athletes')
ax3.legend(loc="upper right")
# Plot team Swedens's contingent size
ax4.bar(swedish_athletes.index.values, swedish_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
ax4.bar(swedish_athletes.index.values, swedish_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
ax4.plot(swedish_athletes.index.values, swedish_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
marker = 'o')
ax4.set_title('Swedish Athletes :\nParticipation since 1896')
ax4.set_ylabel('Count of Athletes')
ax4.legend(loc="upper left")
show()
/Users/williamneedham/anaconda/lib/python3.6/site-packages/pandas/core/frame.py:2842: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
downcast=downcast, **kwargs)
Insights:
Overall gender balance has been improving slowing since the mid 20th century and it nearly equal for all countries considered.
Swedish female participation has risen above male participation in the last 3 Olympic Games.
UK particpation peaked in 2012 - when it hosted in London! This peak trend was mirrored for German participation in the 1972 Games, held in West Germany.
1916, 1940 and 1944 games cancelled due to World Wars.
This is where we bring in the engineered feature 'DistanceTravelled' - the Vincenty distance between an athletes home capital and where the games took place. My hypothesis is that distance travelled has an impact of number of medals a Country receives (due to things physiological factors like jet-mind, blood flow etc), I also further hypothesise that the effect of this will be reduced in the modern era (I've chosen Pre/Post 1980).
To investigate these hypothesese, I've split the dataset (pre and post 1980) and calculate correlation between this distance and medals won. Obviously, whether a country has a high medal count is a multivariate problem and would require a more in-depth analysis in practice.
Subset_Medalists = clean_data['AnyMedals'] == 1
#Create a new dataset to be merged with the medal table dataset below
subset_distancetravelled = clean_data.loc[:, ['Year', 'AthleteCountry', 'DistanceTravelled']].drop_duplicates()
medal_tally = clean_data.groupby(['Year','AthleteCountry'])['TotalMedals'].agg('sum').reset_index()
medal_tally_byDistance = medal_tally.merge(subset_distancetravelled,
left_on = ['Year', 'AthleteCountry'],
right_on = ['Year', 'AthleteCountry'],
how = 'left')
#create two datasets to compare correlations
Pre_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']<=1980]
Post_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']>1980]
medal_tally_byDistance.head()
selected_rows = Pre_1980_MedalTallyByDistance['TotalMedals'] > 0
correlation = Pre_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]
plot(Pre_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'],
Pre_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] ,
linestyle = 'none',
marker = 'o',
alpha = 0.4)
xlabel('Distance Travelled')
ylabel('Number of Medals')
title('PRE-1980 - Distance travelled versus medal tally')
text(np.nanpercentile(Pre_1980_MedalTallyByDistance['DistanceTravelled'], 99.6),
max(Pre_1980_MedalTallyByDistance['TotalMedals']) - 50,
"Correlation = " + str(correlation))
selected_rows = Post_1980_MedalTallyByDistance['TotalMedals'] > 0
correlation = Post_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]
plot(Post_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'],
Post_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] ,
linestyle = 'none',
marker = 'o',
alpha = 0.4)
xlabel('Distance Travelled')
ylabel('Number of Medals')
title('POST-1960 - Distance travelled versus medal tally')
text(np.nanpercentile(Post_1980_MedalTallyByDistance['DistanceTravelled'], 99.6),
max(Post_1980_MedalTallyByDistance['TotalMedals']) - 50,
"Correlation = " + str(correlation))
Insights:
negative correlations recorded for post the Pre-1980 dataset and Post-1980 dataset, meaning a higher distance travelled does lead to lower medal haul.
The correlation coefficient for pre-1980 was 'Correlation = -0.146533766948' and for Post-1980 was 'Correlation = -0.0614926627557'. This tells us that my hypothesis was correct and that this Distance Travelled effect has less of an effect in the modern era (where flights are cheap/ more comfortable and athletes probably spend time preparing in the country of the games beforehand).
Admittedly though, the coefficients are small and as such the DistanceTravelled-effect is smaller than I thought it would be.
Conclusions and further research options¶
This was an enjoyable task! I focused my efforts on data engineering and manfuactures some novel features. The following insights were of most interest to me:
good to see gender balance finally becoming a reality (after over 120 years of competition!
following high variance in the early years, most athletes are in there early 20's (slightly higher median age for the Winter Olympics).
of all interesting countries selected for this study (UK, Germany, Sweden & Japan), Germany send the most athletes (on average each year),
Given more time, I would like to:
add more summary statistics,
explore covariance,
build predictive models (i.e. predicting the medals table at the next Games),
conduct a clustering analysis to so what separates the best athletes from OK athletes.
#df[(df['coverage'] > 50) & (df['reports'] < 4)]
jamaicans = clean_data[(clean_data['AthleteCountry'] == 'Jamaica') & (clean_data['Sport'] == 'Bobsleigh') ]
jamaican_grouped = jamaicans.groupby(['Year','City'])['TotalMedals'].agg('sum').reset_index()
jamaican_grouped
new_data = {'Year': [2018, 2022], 'City': ["Pyeongchang", "Beijing"], 'TotalMedals': [0, NaN]}
future_results = pd.DataFrame(new_data)
combined = pd.concat([jamaican_grouped, future_results])
combined
plt.bar(combined['Year'],combined['TotalMedals'] )
fig.suptitle('test title', fontsize=20)
plt.xlabel('Olympic Year', fontsize=18)
plt.ylabel('Number of medals won', fontsize=16)
plt.ylim((0,10))
plt.xticks(combined['Year'])
([<matplotlib.axis.XTick at 0x111dbfcf8>,
<matplotlib.axis.XTick at 0x111b19710>,
<matplotlib.axis.XTick at 0x111b199e8>,
<matplotlib.axis.XTick at 0x111dfbe48>,
<matplotlib.axis.XTick at 0x1185083c8>,
<matplotlib.axis.XTick at 0x118508908>,
<matplotlib.axis.XTick at 0x118508e48>,
<matplotlib.axis.XTick at 0x11850e3c8>],
<a list of 8 Text xticklabel objects>)