Sunday, August 21, 2016

A Bayesian Olympics medals table

As the 2016 Rio Olympics draw to a close, much of the media coverage here in the UK focuses on how many medals Team GB has won, and how this compares to other countries and to previous Olympics. Team GB has done particularly well this year, rising to 2nd in the medal table (as of Sunday afternoon) and increasing the number of medals won compared to London - the first time a host country has improved its medal haul in the subsequent Olympics.

The medal table has become an increasingly prominent feature of the Olympics (at least in the UK). Many people have pointed out an simple flaw in looking at a country’s position in the table as a measure of its sporting ‘quality’ (whatever that means): larger countries win more medals, simply by having more people. The USA, China and in the past the Soviet Union have been large countries dominating the upper echelons of the table. The obvious way to compare countries ‘fairly’ is to look at a per capita medal table. One website that has done this places the Bahamas at the top of its list of per capita gold medals. On the one hand correcting for population size in this way seems like a sensible thing to do if you want to know whether a country performed well for its size or not. But I can’t help noticing that of the top 10 countries in this list, none has a population onf more than 10m people, and two have populations below 1m. A single gold medal in the Bahamas puts them top of the list. This suggests to me that places at the top of the per capita table are likely to be the result of statistical noise - whichever of the many small countries compteting manages to win one gold tops the table.

A more robust solution is to treat the medal table as a statistical sample that is generated from the underlying sporting quality of each country, and to try to infer this quality from the data that we observe. To do this we can use Bayesian inference. Let the quality of a country in Olympic sport be represented by a single number, \(q\), such that the expected number of gold medals that country will win is \(qN\), with \(N\) being the population of the country (I’ll ignore complications about differing proportions of athlete-age population). Bayes’ rule tells us that our belief about the quality of a country should be represented by a probability distribution that combines our prior beliefs about \(q\), \(P(q)\) and the likelihood of observing the medals we saw given a specific value of q, \(P(\textrm{# Golds = g} \mid q)\): \[ P(q \mid \textrm{# Golds = g}) \propto P(q)P(\textrm{# Golds = g} \mid q) \] The likelihood is easy to define. Given that gaining a gold is a rare event, the number of golds won should follow a Poisson distribution. Therefore: \[ P(\textrm{# Golds = g} \mid q) = \frac{(qN)^g \exp(-qN)}{g!} \] For the prior distribution of \(q\) we can use the Principle of Maximum Entropy: we use a distribution that has the most uncertainty given the facts that we know. We know what the mean number of golds per person over the whole world must be, since the total number of golds, \(G\) and the world population, \(N_W\) is fixed at the time of the Olympics. The maximum-entropy distribution defined over positive numbers and with a known mean is the exponential distribution: \[ P(q) = \frac{N_W}{G}\exp(-\frac{qN_W}{G}) \] Putting this together and discarding constants we get \[ P(q \mid \textrm{# Golds = g}) \propto q^g \exp \left(-q\left(N + \frac{N_W}{G}\right) \right) \] If we want a single number to represent this distribution we should use the mean value \(\bar{q} = \int_0^1 qP(q \mid \textrm{# Golds = g}) dq\), which we can calculate as below: \[ \bar{q} = \frac{\int_0^1 q^{g+1} \exp \left(-q\left(N + \frac{N_W}{G}\right)\right)dq}{\int_0^1q^{g} \exp \left(-q\left(N + \frac{N_W}{G}\right)\right)dq} \\ = \frac{g+1}{N + \frac{N_W}{G}}\frac{1-\exp(-(N + \frac{N_W}{G}))\sum_{i=0}^{g+1} \frac{(N + \frac{N_W}{G})^i}{i!}}{1-\exp(-(N + \frac{N_W}{G}))\sum_{i=0}^{g} \frac{(N + \frac{N_W}{G})^i}{i!}} \] where the final step is done using repeated integration by parts. In practice the exponential terms in the final expression tend to be extremely small, so this can be approximated as \(\bar{q} = \frac{g+1}{N + N_W/G}\). This shows what effect the Bayesian prior has: the simple per capita estimate is just \(\frac{g}{N}\); using the prior effectively increases the medal count by 1 and the population count by \(N_W/G\), the worldwide number of people per medal, so it is as if the country got one more gold medal at the cost of having an additional population of the worldwide average needed to do this.

So I’m sure if you’ve slogged through the mathematics this far you’re dying to know what the Bayesian medal table actually looks like. Here is the R code used to do the above calculations, and then finally the medal table:

library(knitr)

#Read in data
medal_table = read.delim("medal_table.txt")
medal_table$Population = as.numeric(gsub(",", "", as.character(medal_table$Population)))


#Define prior distribution mean parameter
world_pop = 7.4e9
prior_mean = sum(medal_table$Gold)/world_pop

#Define useful function for calculating posterior mean
myf <- function(n, k){
  s = rep(0, k)
  for (ii in 0:k){
    s[ii] = -n + ii*log(n) - lfactorial(ii)
  }
  
  y = 1 - sum(exp(s))
  return(y)
}

#Loop over countries and calculate posterior mean 
medal_table$Quality = rep(NA, dim(medal_table)[1])
for (i in 1:dim(medal_table)[1]){
  k = medal_table$Gold[i]
  n = medal_table$Population[i] + 1/prior_mean
  
  #Calculate mean of the posterior distribution
  medal_table$Quality[i] = ((k+1)/n)*myf(n, k+1)/myf(n, k)
  
}

#Order results by quality and print
medal_table_print=medal_table[order(medal_table$Quality, decreasing=TRUE), c("Country", "Gold", "Population", "Quality")]
#Print only countries with quality higher than the prior
medal_table_print = medal_table_print[which(medal_table_print$Quality > prior_mean), ]
row.names(medal_table_print) <-NULL

kable(medal_table_print, digits = 9)
Country Gold Population Quality
Great Britain 27 65138232 3.13e-07
Hungary 8 9844686 2.64e-07
Jamaica 6 2725941 2.60e-07
Netherlands 8 16936520 2.19e-07
Croatia 5 4224404 2.11e-07
Australia 8 23781169 1.88e-07
New Zealand 4 4595700 1.74e-07
Germany 17 81413145 1.70e-07
Cuba 5 11389562 1.69e-07
United States 46 321418820 1.36e-07
South Korea 9 50617045 1.34e-07
Switzerland 3 8286976 1.23e-07
France 10 66808385 1.21e-07
Russian Federation 19 144096812 1.19e-07
Greece 3 10823732 1.14e-07
Spain 7 46418269 1.13e-07
Georgia 2 3679000 1.08e-07
Italy 8 60802085 1.06e-07
Slovakia 2 5424050 1.01e-07
Denmark 2 5676002 1.00e-07
Kenya 6 46050302 1.00e-07
Serbia 2 7098247 9.60e-08
Kazakhstan 3 17544126 9.60e-08
Uzbekistan 4 31299500 9.00e-08
Sweden 2 9798871 8.80e-08
Japan 12 126958472 8.60e-08
Belgium 2 11285721 8.50e-08
Canada 4 35851774 8.30e-08
Bahamas 1 388019 8.10e-08
Fiji 1 892145 8.00e-08
Bahrain 1 1377237 7.80e-08
Kosovo 1 1859203 7.70e-08
Slovenia 1 2063768 7.60e-08
Armenia 1 3017712 7.40e-08
Puerto Rico 1 3474182 7.20e-08
Singapore 1 5535002 6.70e-08
Jordan 1 7594547 6.30e-08
Tajikistan 1 8481855 6.10e-08
North Korea 2 25155317 6.10e-08
Belarus 1 9513000 5.90e-08
Argentina 3 43416755 5.90e-08
Azerbaijan 1 9651349 5.90e-08
Czech Republic 1 10551219 5.80e-08
Colombia 3 48228704 5.50e-08
Poland 2 37999494 4.80e-08
Romania 1 19832389 4.50e-08
Ukraine 2 45198200 4.30e-08
Cote d’Ivoire 1 22701556 4.30e-08
Taiwan 1 23510000 4.20e-08

Team GB tops the chart! Mathematically, this is because GB combines a large rate of medals per capita with a large population. Therefore it has the statistical weight to move the inferred value of \(q\) away from the prior expectation. Smaller countries with several golds like Jamaica also do well, but tiny Bahamas is now much further down the list - 1 gold medal just isn’t enough information to tell you much about the underlying rate at which a country tends to win golds.

You could easily extend this analysis by aggregating the results of previous Olympics too. With data from more years there would be more evidence to move the quality of smaller countries away from the prior. In terms of predicting the future performance of countries you would need to decide on an appropriate weighting of past results, which you could in principle do by trying to make a predictive model for the 2016 results from 2012, 2008 etc. Data from Rio and previous Olympics is available here

Additional note: this is my first blog post written entirely in R Markdown.