添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
腹黑的鸡蛋面  ·  Google ...·  1 周前    · 
强悍的酸菜鱼  ·  GitHub - ...·  1 周前    · 
失落的企鹅  ·  Google test的使用 | Ivanzz·  1 周前    · 
玩篮球的手电筒  ·  LightGBM ...·  4 天前    · 
果断的面包  ·  Help with lightgbm on ...·  3 天前    · 
高大的小熊猫  ·  JPA ...·  1 年前    · 
直爽的猕猴桃  ·  javascript - ...·  2 年前    · 

Analysis on the Olympics dataset

Datasets

The core datasets used for this task were taken from Kaggle .

This is a well-analysed dataset with over 100 kernels pinned against it on Kaggle. To enrich my analysis, and make it original, I also scraped some other datasets:

  • I used a dataset of capital city Latitude & Longitudes from Techslides .
  • I manually created a new dataset of olympic host city Altitudes, Latitudes & Longitudes from mapcoordinates .
  • Running environment

    I used a Jupyter Notebook (Python 3) to narrate my analysis.

    Requirement already satisfied: geopy in /Users/williamneedham/anaconda/lib/python3.6/site-packages (1.17.0)
    Requirement already satisfied: geographiclib<2,>=1.49 in /Users/williamneedham/anaconda/lib/python3.6/site-packages (from geopy) (1.49)
    import unittest
    #import difflib - not used
    from geopy.distance import vincenty, geodesic
    %pylab inline
    %matplotlib inline
    

    Contents

    My analysis is laid out as follows:

    Stage 1: Initial look at the data

    Link to initial look at the data

    Stage 2: Feature engineering

    Link to feature engineering section of the report

    Stage 3: Research questions

    Following a review of the data available and a brainstorming session, I decided to focus my efforts on the following areas:

    Summary statistics to explore how athletes have changed over the years

    Analysis of some interesting countries

    Is there a correlation between Distance Travelled to the Games and performance and has that changed since 1980?

    Stage 4: Conclusions

    Conclusions and further research options

    ...any finally - what about the Jamaican bobsleigh team?

    Contact details

    Contact details

    #Loading athlete data
    data = pd.read_csv('athlete_events.csv', index_col = 'ID')
    data.head(10)
    

    A couple of early observations, upon loading the data:

  • NA in the Medal column have been converted to NaN, needs correcting.
  • Team variable combines > 1 Country, which is interesting, but unhelpful, will look to merge with the NOC table to clean.
  • It looks like Games is just a concat of Year + Season, should remove, once checked.
  • there's some NaN's in the Height & Weight Column , will need to be cleaned if using these columns.
  • There's some interesting sports from way-back-when - including Tug-Of-War, maybe these would be interesting to look at!
  • Fill NaN values in Medal column with 'None'

    #Replace NA in Medal column with None
    data['Medal'] = data['Medal'].fillna("None")
    data['Medal'].head()
    

    String replacement: replacing incorrect/ shorterned country names in the NOC dataset

    Later on in the analysis, I do fuzzy matching of Country Names to join with the Capital Cities metadata dataset. After doing this, I changed this part of the script to include the fuzzy matches that the algorithm highlighted. i.e. before it is merged in the main data frame.

    NOC['region'].replace(['US', 'USA', 'UK', 'Saint Kitts', 'Trinidad', 'Boliva', 'Virgin Islands, US', 'Curacao'], 
                          ['United States', 'United States', 'United Kingdom', 'Saint Kitts and Nevis', 
                           'Trinidad and Tobago', 'Bolivia', 'US Virgin Islands', 'Cura̤ao'], 
                          inplace = True)
    
    #before discarding, check to see if Games is in Year + Season for all values, then discard. 
    if (data['Games'].equals(data['Year'].map(str) + " " + data['Season'])):
        data = data.drop(['notes', 'Games'], axis=1)
    else: 
        data = data.drop(['notes'], axis=1)
    

    We can see above that Games == Year + Season for all values, and so is removed from the dataset.

    Review of null values in the dataset

    # assign a list of (ColumnName, NullCount) tuples to null_list using list comprehension
    null_list = [(col, data[col].isnull().sum()) for col in data.columns]
    print(null_list)
    

    We can see that only the Age, Height, Weight & the newly created region columns contain null values. My instinct tells me that this is because they didn't record age/height/weight measurements before a certain date. We'll have to investigate the region nulls a bit further.

    Investigating the region nulls

    We can see below that these are being caused by the Refugee Olympic Athletes Team, Unknowns and Tuvalu. I will remove these rows from the dataset.

    #find out which points haven't merged properly
    missing_region_datapoints = data[data.region.isnull()]
    print(len(missing_region_datapoints))
    missing_region_datapoints.sample(5)
    
    # remove rows with NaN in the region column, using prints to check
    print(len(data))
    data = data[pd.notnull(data['region'])]
    print(len(data))
    

    Feature engineering

    I chose to use external datasets to enrich the analyse and build novel feature sets. This section shows how I went about it. In this section most of the code is wrapped around unit tests to ensure the transformation have been carried out as expected.

    Mapping host cities to host countries

    Firstly, i created a map of Host Cities to Host Country, which wasn't originally in the dataset. This will be when looking at the 'Home-advantage' hypothesis.

    #manually created a dictionary map from city -> country 
    city_to_country = { 'Barcelona':'Spain', 'London':'UK', 'Antwerpen':'Belgium', 'Paris':'France', 'Calgary':'Canada',
           'Albertville':'France', 'Lillehammer':'Norway', 'Los Angeles':'USA', 'Salt Lake City':'USA',
           'Helsinki':'Finland', 'Lake Placid':'USA', 'Sydney':'Australia', 'Atlanta':'USA', 'Stockholm':'Sweden',
           'Sochi':'Russia', 'Nagano':'Japan', 'Torino':'Italy', 'Beijing':'China', 'Rio de Janeiro':'Brazil', 'Athina':'Greece',
           'Squaw Valley':'USA', 'Innsbruck':'Austria', 'Sarajevo': 'Bosnia and Herzegovina', 'Mexico City':'Mexico', 'Munich': 'Germany',
           'Seoul': 'South Korea', 'Berlin': 'Germany', 'Oslo': 'Norway', "Cortina d'Ampezzo":'Italy', 'Melbourne': 'Australia', 'Roma': 'Italy',
           'Amsterdam': 'Netherlands', 'Montreal': 'Canada', 'Moskva': 'Russia', 'Tokyo':'Japan', 'Vancouver':'Canada', 'Grenoble':'France',
           'Sapporo':'Japan', 'Chamonix':'France', 'St. Louis':'USA', 'Sankt Moritz':'Switzerland',
           'Garmisch-Partenkirchen':'Germany'}
    #simple dictionary map
    def map_cities_to_countries(data, dict_city_to_country) : 
        data['Host_Country'] = data['City'].map(dict_city_to_country)
        return
    
    
    
    
        
     data
    # unit test the mapping to ensure the merge did not introduce nulls into the dataset. 
    class TestMapping(unittest.TestCase):
        # Create the unit test
        def test_mapping(self):
            map_cities_to_countries(data, city_to_country)
            count_na = data['Medal'].isnull().sum()
            # Test that the mapping has not introduces NA's
            self.assertEqual(0, count_na)
     # Run the test 
    unittest.main(argv=['ignored', '-v'], exit=False)        
    
    test_mapping (__main__.TestMapping) ... ok
    ----------------------------------------------------------------------
    Ran 1 test in 0.071s
    
    host_cities_metadata = pd.read_csv('hostcities.csv', names=['HostCity', 'HostLatitude','HostLongitude','HostAltitude'])
    host_cities_metadata = host_cities_metadata.iloc[1:]
    host_cities_metadata.head()
    
    #merge the host cities metadata into the main data frame
    data = pd.merge(data, host_cities_metadata, left_on='City', right_on='HostCity')
    data.head(5)
    

    The empty dataframe above shows the merge was successful.

    Read in and merge the athletes capital city metadata

    This dataset contains the CountryName, CapitalName, CapitalLatitude, CapitalLongitude, CountryCode (not used) and ContinentName. My plan is merge this dataset on the Athletes 'region' variable, that we originally got from the NOC dataset. I'm doing this in preparation for answering the 'distance-travelled' research question; my hypothesis is that athletes that travel the furthest for the games are most disadvantaged due to jet-lag/ acclimatisation etc. I appreciate using the capital city lat/long of an athletes country is not ideal (as they may not live in the capital, or might live abroad etc), but serves as an OK proxy for this analysis.

    #load in the data
    names = ['CountryName', 'CapitalName', 'CapitalLatitude','CapitalLongitude', 'CountryCode', 'ContinentName']
    capital_cities_metadata = pd.read_csv('country-capitals.csv', names = names)
    #remove the first row as it's a duplicate of the column headers
    capital_cities_metadata = capital_cities_metadata.iloc[1:]
    # describe method to undertsand the shape of the dataset. 
    capital_cities_metadata.describe()
    

    Merge on capital_cities_metadata.

    Before I merge the capital cities metadata dataset into the main 'data' frame, I'm going to perform an outer join to create a new variable called data_new. In doing so, I can then identify Country names are not joining properly. This then becomes a fuzzy matching problem between two lists of strings from different sources.

    #df_pitches = pd.read_csv("df_pitches_2016.csv")
    #df_salaries = pd.read_csv("df_salaries_2016.csv")
    # Creating my merged data frame
    data_new = data.merge(capital_cities_metadata, 
                            left_on='region', 
                            right_on='CountryName', 
                            how='outer', 
    
    # Only selecing pitchers with a lot of pitches
    #data_new = data_new[data_new > 1000][['CapitalLongitude']]
    #Selecing people with missing salaries
    missing_latitude = data_new[data_new.CapitalLatitude.isnull()]
    #Displaying results
    missing_latitude.reset_index(inplace=True,drop='index')
    missing_latitude.sample(40)
    missing_latitude['region'].unique()
    
    array(['Ivory Coast', 'Saint Vincent', 'Individual Olympic Athletes',
           'Virgin Islands, British', 'Antigua', 'Gambia', 'Brunei',
           'Micronesia'], dtype=object)

    Fuzzy matching of Country Names from the NOC dataset and the country capitals dataset

    To try and facilitate as clean a merge as possible, I used a fuzzy matching algorithm from the fuzzywuzzy package. This highlighted inconsistencies in the string, which I then dealt with earlier on in the analysis (up the page), then re-ran the script. This was because I needed to make the string replacements prior to merge the NOC & main data frames.

    def match_name(name, list_names, min_score=0): # -1 score incase we don't get any matches max_score = -1 # Returning empty name for no match as well max_name = "" # Iternating over all names in the other for name2 in list_names: #Finding fuzzy match score score = fuzz.ratio(name, name2) # Checking if we are above our threshold and have a better score if (score > min_score) & (score > max_score): max_name = name2 max_score = score return (max_name, max_score)
    /Users/williamneedham/anaconda/lib/python3.6/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
      warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
    # iterating over our players without salaries found above
    for name in missing_latitude.region.unique():
        # Use our method to find best match, we can set a threshold here
        match = match_name(name, capital_cities_metadata.CountryName, 50)
        # New dict for storing data
        dict_ = {}
        dict_.update({"country_name" : name})
        dict_.update({"match_name" : match[0]})
        dict_.update({"score" : match[1]})
        dict_list.append(dict_)
    merge_table = pd.DataFrame(dict_list)
    # Display results
    merge_table
    
    #merge the capital cities metadata into the main data frame
    data = pd.merge(data, capital_cities_metadata, left_on='region', right_on='CountryName')
    data.head()
    
    # remove rows with NaN in the region column, using prints to check
    print(len(data))
    data = data[pd.notnull(data.region)]
    print(len(data))
    

    Calculating how far each athlete has travelled to participate in the games

    Now that we have lat/long pairs of 1) where the athlete lives, and 2) where the games is being held, we can calcuate the Vincenty distance (from GeoPy).

    The Vincenty distance is the geodesic distance between two points on the earth, taking into account the Earth's elliptic shape; using Vincenty’s method.

    def calculate_vincenty_distance(distances) : 
        This function first creates (Latitude, Longitude) tuples for both the Athletes Home City and the Olympic Host City. 
        Then, calculates a new column 'vincenty_distance' using np.vectorize (as the vincenty method requires two inputs)
        Inputs: distances - dataframe with cols = ['CapitalLatitude', 'CapitalLongitude', 'HostLatitude', 'HostLongitude']
        Outputs: same dataframe with new column ['vincenty_distance']
        distances['AthleteLatLong'] = list(zip(distances.CapitalLatitude, distances.CapitalLongitude))
        distances['HostCityLatLong'] = list(zip(distances.HostLatitude, distances.HostLongitude))
        distances['vincenty_distance'] = np.vectorize(geodesic)(distances['AthleteLatLong'], distances['HostCityLatLong'])
        return distances
    # create a new dataframe away from the original data, and then concat back into the original dataframe
    new_data = data
    distances = new_data
    
    
    
    
        
    .ix[:, ['CapitalLongitude', 'CapitalLatitude','HostLongitude','HostLatitude']]
    #normally I'd one class for all unit tests, but I've created a new one here for code readability in a notebook. 
    class TestCalculations(unittest.TestCase):
        # Create the unit test
        def test_vincenty_calculation(self):
            #create a test tuple of data from the first row of 'data' dataframe
            athleteLatLong0 = (39.91666667, 116.383333) # first row
            #create a test tuple of data from the first row of 'data' dataframe
            hostLatLong0 = (41.3828939, 2.1774322) 
            #calculate the vincenty distance on the test data
            distanceTest = geodesic(athleteLatLong0, hostLatLong0).km
            #calculate vincenty distance using our function calculate_vincenty_distance
            calculate_vincenty_distance(distances)
            # get the result of the first row
            function_output = distances.loc[0,'vincenty_distance']
            # Test that the result of the two method asserts equal. 
            self.assertAlmostEqual(function_output, distanceTest,5)
     # Run the test 
    unittest.main(argv=['ignored', '-v'], exit=False)        
    
    test_vincenty_calculation (__main__.TestCalculations) ... ok
    test_mapping (__main__.TestMapping) ... ok
    ----------------------------------------------------------------------
    Ran 2 tests in 88.062s
    
    #cast the vincenty distance from Distance ojbect to float 
    distances['vincenty_distance'] = distances['vincenty_distance'].astype(str).str[:-3].astype(float)
    data = pd.concat([data, distances['vincenty_distance']], axis=1)
    #take a look at the results
    data.head()
    
    Index(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Year',
           'Season', 'City', 'Sport', 'Event', 'Medal', 'region', 'Host_Country',
           'HostCity', 'HostLatitude', 'HostLongitude', 'HostAltitude',
           'CountryName', 'CapitalName', 'CapitalLatitude', 'CapitalLongitude',
           'CountryCode', 'ContinentName', 'vincenty_distance'],
          dtype='object')
    cols_to_keep = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
           'City', 'Sport', 'Event', 'Medal', 'region', 'Host_Country', 
                    'CapitalName', 'ContinentName', 'vincenty_distance', 'HostAltitude']
    clean_data = pd.DataFrame(data, columns=cols_to_keep)
    clean_data.head()
    
    clean_data.columns = ['Name', 'Sex', 'Age', 'Team', 'Height', 'Weight', 'Year', 'Season',
           'City', 'Sport', 'Event', 'Medal', 'AthleteCountry', 'HostCountry', 
                    'AthletesHomeCapital', 'AthletesHomeContinent', 'DistanceTravelled', 'HostCityAltitude']
    clean_data.head()
    
    df_medals = pd.get_dummies(clean_data['Medal'])
    # Join the dummy variables to the main dataframe
    clean_data = pd.concat([clean_data, df_medals], axis=1)
    pd.options.display.max_columns = None
    clean_data.head()
    
    clean_data['TotalMedals'] = clean_data['Gold'] + clean_data['Silver'] + clean_data['Bronze']
    clean_data['AnyMedals'] = np.where(clean_data.loc[:,'Medal'] == 'None', 0, 1)
    clean_data.sample(10)
    

    ======================================================

    Summary statistics to explore how athletes have changed over the years

    In this question, i'm looking to explore how the sex, age, and number of competitors has changed in the last 120 years.

    The changing profile of the Olympic athlete (part 1)

    #Proportion 
    ProportionMvW = data.groupby(['Year', 'Sex']).agg({'Name': 'count'})
    percentages = ProportionMvW.groupby(level=0).apply(lambda x:
                                                     100 * x / float(x.sum()))
    percentages = percentages.reset_index()
    fig, ax = pyplot.subplots(figsize=a4_dims)
    ax = sns.barplot(x="Year", y="Name", hue="Sex", data=percentages, palette="Set2") 
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
    ax.set_title('Proportion of men and women competing in the olympic games')
    ax.set_ylabel('Percentage of athletes for each gender')
    ax.legend(loc="upper right")
    plt.show()
    

    Insights:

  • Good news! The ratio of male to female competitors is slowly becoming more balanced - could 2020 be the year that we have an equal number of female competitors and male competitors?
  • The changing profile of the Olympic athlete (part 2)

    The visualisation below shows how the distribution of men and women's ages have changed since 1896, split by Sex and Summer/Winter Olympics

    sns.set(rc={'figure.figsize':(20,20)})
    g = sns.FacetGrid(clean_data, col="Season",  row="Sex", size=6)
    g = g.map(sns.boxplot, "Year", 'Age')
    g.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="right")
    show()
    

    Insights:

  • some 'athletes' in the 20th century were > 80 years when they competed! Worth a quick look back at the data to see what they competed in.
  • As could be expected, variance is greatly reduced in the modern athletes age (with the median Age being in the early 20's. Interestingly the median age in the winter olympics seems to be higher.
  • Higher variance in the Summer Olympics - I would hypothesise this is down to the wider variety of sports on offer at the Summer Olympics.
  • #create a table of year,countryname & sex to be merged into a count of each later. 
    mini_country_table = clean_data[clean_data['Season'] == 'Summer'].loc[:, ['Year','AthleteCountry', 'Name'
    
    
    
    
        
    , 'Sex']].drop_duplicates()
    mini_country_table.head()
    
    # Create a pivot table to count gender wise representation of each team in each year
    CountAthletesByCountry = pd.pivot_table(mini_country_table,
                                            index = ['Year', 'AthleteCountry'],
                                            columns = 'Sex',
                                            aggfunc = 'count').reset_index()
    # rename columns as per column names in the 0th level
    CountAthletesByCountry.columns = CountAthletesByCountry.columns.get_level_values(0)
    # rename the columns appropriately
    CountAthletesByCountry.columns = ['Year', 'AthleteCountry', 'Female_Athletes', 'Male_Athletes']
    # get total athletes per team-year
    CountAthletesByCountry['Total_Athletes'] = CountAthletesByCountry['Female_Athletes'] + \
    CountAthletesByCountry['Male_Athletes']
    uk_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "United Kingdom"]
    uk_athletes.fillna(0, inplace = True)
    uk_athletes.set_index('Year', inplace = True)
    swedish_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Sweden"]
    swedish_athletes.set_index('Year', inplace = True)
    japanese_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Japan"]
    japanese_athletes.set_index('Year', inplace = True)
    germany_athletes = CountAthletesByCountry[CountAthletesByCountry['AthleteCountry'] == "Germany"]
    germany_athletes.set_index('Year', inplace = True)
    # Plot the values of male, female and total athletes using bar charts and the line charts.
    fig, ((ax1, ax2), (ax3, ax4)) = subplots(nrows = 2, ncols = 2, figsize = (20, 20), sharey = True)
    fig.subplots_adjust(hspace = 0.3)
    # Plot team Australia's contingent size
    ax1.bar(uk_athletes.index.values, uk_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
    ax1.bar(uk_athletes.index.values, uk_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
    ax1.plot(uk_athletes.index.values, uk_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
            marker = 'o')
    ax1.legend(loc="upper right")
    ax1.set_title('UK Athletes :\nParticipation since 1896')
    ax1.set_ylabel('Count of Athletes')
    # Plot German athlete participation
    ax2.bar(germany_athletes.index.values, germany_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
    ax2.bar(germany_athletes.index.values, germany_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
    ax2.plot(germany_athletes.index.values, germany_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
            marker = 'o')
    ax2.set_title('Germany Athletes :\nParticipation since 1896')
    ax2.legend(loc="upper left")
    ax2.set_ylabel('Count of Athletes')
    # Plot Japan's contingent size
    ax3.bar(japanese_athletes.index.values, japanese_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
    ax3.bar(japanese_athletes.index.values, japanese_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
    ax3.plot(japanese_athletes.index.values, japanese_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes', 
             marker = 'o')
    ax3.set_title('Japanese Athletes :\nParticipation since 1896')
    ax3.set_ylabel('Count of Athletes')
    ax3.legend(loc="upper right")
    # Plot team Swedens's contingent size
    ax4.bar(swedish_athletes.index.values, swedish_athletes['Male_Athletes'], width = -1, align = 'edge', label = 'Male Athletes')
    ax4.bar(swedish_athletes.index.values, swedish_athletes['Female_Athletes'], width = 1, align = 'edge', label = 'Female Athletes')
    ax4.plot(swedish_athletes.index.values, swedish_athletes['Total_Athletes'], linestyle = ':', color = 'black', label = 'Total Athletes',
            marker = 'o')
    ax4.set_title('Swedish Athletes :\nParticipation since 1896')
    ax4.set_ylabel('Count of Athletes')
    ax4.legend(loc="upper left")
    show()
    
    /Users/williamneedham/anaconda/lib/python3.6/site-packages/pandas/core/frame.py:2842: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
      downcast=downcast, **kwargs)
    

    Insights:

  • Overall gender balance has been improving slowing since the mid 20th century and it nearly equal for all countries considered.
  • Swedish female participation has risen above male participation in the last 3 Olympic Games.
  • UK particpation peaked in 2012 - when it hosted in London! This peak trend was mirrored for German participation in the 1972 Games, held in West Germany.
  • 1916, 1940 and 1944 games cancelled due to World Wars.
  • Is there a correlation between Distance Travelled to the Games and performance and has that changed since 1980?

    This is where we bring in the engineered feature 'DistanceTravelled' - the Vincenty distance between an athletes home capital and where the games took place. My hypothesis is that distance travelled has an impact of number of medals a Country receives (due to things physiological factors like jet-mind, blood flow etc), I also further hypothesise that the effect of this will be reduced in the modern era (I've chosen Pre/Post 1980).

    To investigate these hypothesese, I've split the dataset (pre and post 1980) and calculate correlation between this distance and medals won. Obviously, whether a country has a high medal count is a multivariate problem and would require a more in-depth analysis in practice.

    Subset_Medalists = clean_data['AnyMedals'] == 1 #Create a new dataset to be merged with the medal table dataset below subset_distancetravelled = clean_data.loc[:, ['Year', 'AthleteCountry', 'DistanceTravelled']].drop_duplicates() medal_tally = clean_data.groupby(['Year','AthleteCountry'])['TotalMedals'].agg('sum').reset_index() medal_tally_byDistance = medal_tally.merge(subset_distancetravelled, left_on = ['Year', 'AthleteCountry'], right_on = ['Year', 'AthleteCountry'], how = 'left') #create two datasets to compare correlations Pre_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']<=1980] Post_1980_MedalTallyByDistance = medal_tally_byDistance[medal_tally_byDistance['Year']>1980] medal_tally_byDistance.head()
    selected_rows = Pre_1980_MedalTallyByDistance['TotalMedals'] > 0
    correlation = Pre_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]
    plot(Pre_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'], 
         Pre_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] , 
         linestyle = 'none', 
         marker = 'o',
         alpha = 0.4)
    xlabel('Distance Travelled')
    ylabel('Number of Medals')
    title('PRE-1980 - Distance travelled versus medal tally')
    text(np.nanpercentile(Pre_1980_MedalTallyByDistance['DistanceTravelled'], 99.6), 
         max(Pre_1980_MedalTallyByDistance['TotalMedals']) - 50,
         "Correlation = " + str(correlation))
    
    selected_rows = Post_1980_MedalTallyByDistance['TotalMedals'] > 0
    correlation = Post_1980_MedalTallyByDistance.loc[selected_rows, ['DistanceTravelled', 'TotalMedals']].corr()['TotalMedals'][0]
    plot(Post_1980_MedalTallyByDistance.loc[selected_rows, 'DistanceTravelled'], 
         Post_1980_MedalTallyByDistance.loc[selected_rows, 'TotalMedals'] , 
         linestyle = 'none', 
         marker = 'o', 
        alpha = 0.4)
    xlabel('Distance Travelled')
    ylabel('Number of Medals')
    title('POST-1960 - Distance travelled versus medal tally')
    text(np.nanpercentile(Post_1980_MedalTallyByDistance['DistanceTravelled'], 99.6), 
         max(Post_1980_MedalTallyByDistance['TotalMedals']) - 50,
         "Correlation = " + str(correlation))
    

    Insights:

  • negative correlations recorded for post the Pre-1980 dataset and Post-1980 dataset, meaning a higher distance travelled does lead to lower medal haul.
  • The correlation coefficient for pre-1980 was 'Correlation = -0.146533766948' and for Post-1980 was 'Correlation = -0.0614926627557'. This tells us that my hypothesis was correct and that this Distance Travelled effect has less of an effect in the modern era (where flights are cheap/ more comfortable and athletes probably spend time preparing in the country of the games beforehand).
  • Admittedly though, the coefficients are small and as such the DistanceTravelled-effect is smaller than I thought it would be.
  • Conclusions and further research options

    This was an enjoyable task! I focused my efforts on data engineering and manfuactures some novel features. The following insights were of most interest to me:

  • good to see gender balance finally becoming a reality (after over 120 years of competition!
  • following high variance in the early years, most athletes are in there early 20's (slightly higher median age for the Winter Olympics).
  • of all interesting countries selected for this study (UK, Germany, Sweden & Japan), Germany send the most athletes (on average each year),
  • Given more time, I would like to:

  • add more summary statistics,
  • explore covariance,
  • build predictive models (i.e. predicting the medals table at the next Games),
  • conduct a clustering analysis to so what separates the best athletes from OK athletes.
  • #df[(df['coverage']  > 50) & (df['reports'] < 4)]
    jamaicans = clean_data[(clean_data['AthleteCountry'] == 'Jamaica') & (clean_data['Sport'] == 'Bobsleigh') ]
    jamaican_grouped = jamaicans.groupby(['Year','City'])['TotalMedals'].agg('sum').reset_index()
    jamaican_grouped
    
    new_data = {'Year': [2018, 2022], 'City': ["Pyeongchang", "Beijing"], 'TotalMedals': [0, NaN]}
    future_results = pd.DataFrame(new_data)
    combined = pd.concat([jamaican_grouped, future_results])
    combined
    
    plt.bar(combined['Year'],combined['TotalMedals'] )
    fig.suptitle('test title', fontsize=20)
    plt.xlabel('Olympic Year', fontsize=18)
    plt.ylabel('Number of medals won', fontsize=16)
    plt.ylim((0,10))
    plt.xticks(combined['Year'])
    
    ([<matplotlib.axis.XTick at 0x111dbfcf8>,
      <matplotlib.axis.XTick at 0x111b19710>,
      <matplotlib.axis.XTick at 0x111b199e8>,
      <matplotlib.axis.XTick at 0x111dfbe48>,
      <matplotlib.axis.XTick at 0x1185083c8>,
      <matplotlib.axis.XTick at 0x118508908>,
      <matplotlib.axis.XTick at 0x118508e48>,
      <matplotlib.axis.XTick at 0x11850e3c8>],
     <a list of 8 Text xticklabel objects>)