Wednesday, January 7, 2015

What we do when we do regressions in social science

The Nobel prize is the most prestigious award a scientist, economist (not really a Nobel prize says every scientist simultaneously), writer or statesman (dubious) can win. As well as conferring enormous status on the recipient, these awards also carry substantial monetary value, both directly and in terms of future earnings. As with any prestigious and lucrative award, we'd like to think that the prizes are given on a purely meritocratic basis. But as we've seen in previous posts, academic selections are rarely free from the suspicion of bias.

Is anyone surprised that a disproportionate number of previous winners have been Swedish? After all, the prizes (except for the peace prize) are awarded by a committee from the Swedish Royal Academy of Science. More glaringly, an overwhelming majority of winners have come from western nations which are culturally similar to Sweden.

So are the Swedes culturally biased? I thought this question would be a good way to demonstrate the basic techniques used to answer such questions in empirical social science, as well as to discuss the problems with these approaches. So here we go...

Lets start by clarifying the hypothesis: The Nobel committee is biased towards awarding prizes to individuals from nations culturally similar to Sweden.

How do we measure cultural similarity? Thankfully the Swedish-founded World Value Survey has toured the world, asking people a series of questions about their values to try and answer this exact question. Their results are broken down into two main axes of values, survival versus self expressive values, and traditional versus secular/rational values (keen observers may note that these terms are somewhat value loaded in themselves!). The results for many countries are shown in the plot below

Conveniently Sweden is placed in the top right of the graph (everyone gasps in surprise). We can approximate the cultural distance between any country and Sweden by the distance separating them on this plot.

I collected data on per-capita Nobel prize awards by nation (data source) for 41 countries, along with their cultural distance from Sweden. The plot below shows that more culturally distant countries are definitely awarded fewer prizes.

So are the Swedes biased then? Not so fast! Of course cultural separation might not be the only force at play here. Western nations are rich and spend a significant proportion of their income on research and development. We'd expect this to yield more and better science, and thus to win more prizes. Sure enough, we see in the plot below that countries with higher research spending do tend to win more prizes (data source).

So what a good paid-up social scientist would do next is to 'control' for research spending before judging if any bias exists. For this we need to do a bit of regression. Lets say that the rate, R, at which individuals from a nation are awarded Nobel prizes is partly due to cultural distance, C, and partly due to spending, S

R = aC + bS 

where a and b are coefficients that express how strong each influence is. Technically what we're going to do is use a Generalized Linear Model with a Poisson distribution to model the number of prizes per 10 million citizens each country receives. With the S factor there, if only research spending is to blame for the disparity of prizes we should find that a is close to zero and statistically not significant. Carrying out a regression like this tells us how likely the correlation of R with C is to be due to random chance, given that R is also correlated with S. When we carry this out in Matlab, (not R stats!), we get highly significant effects for both culture and spending. So a social scientist would say that there is a significant effect of cultural distance on number of prizes, after controlling for research spending.

p-values: (culture: p = 1e-12,  spending: p = 0.2e-8)

Everyone knows however that p-values suck. A better way to test whether both culture and spending have effects is to do model selection. That is, to see if a model including only culture, only spending or both is best at predicting the data we see. I calculated approximate values of the marginal likelihood of the data for all these 3 models - i.e. the probability of the data, based on each model. Comparing these to a simple null model that prizes are given at the same rate to all countries, we get the results below, again showing that including both effects gives a better prediction than either alone (marginal likelihoods shown in log values).

So surely now we can conclude that the Swedes are biased? Well conclude away...but be prepared to be wrong. Or right. Who knows? Because although this basic procedure (with a little more tweaking and a few more control variables) is ubiquitous in social science, where the observational study is king, it rarely tells us anything conclusively.

On the simplest level, there may well be an additional factor which we haven't controlled for that causes all these apparent effects. Maybe its really cultural distance to the USA that matters. Maybe (God forbid!) Sweden really does produce unusually excellent research. In estimating bias we often assume that fundamentally all nations, genders or whatever category are genuinely equal before any bias kicks in (for example in this previous post). This may be the enlightened thing to do, but it is certainly a strong assumption that we should be aware of.

But beyond these simple problems, there lie deeper issues. What we have just done is a case of Ecological Regression, which, though widely used, is essentially a precise codification of the Ecological Fallacy. For instance, if developing an academic culture, producing highly quality research and winning international science prizes tended to make a country more liberal, richer and more secular and self-expressive, then we'd see exactly the same results, without any need for a bias on the part of the ever fair and impartial Swedish Academy. 

So what can we conclude then? Generally, to be very cautious about over-interpreting correlations, or even significant regression coefficients after controlling for other factors. Causation is a slippery beast, and Ecological Regression won't pin it down for you, no matter how many stories the Daily Mail runs saying that X causes cancer. Is the Swedish Academy biased towards western scientists? I genuinely don't know, and this data won't tell me. I wouldn't be surprised if they were, any more than I'd be surprised if grant awarding agencies were biased in favour of men. But unless someone can do a double blind randomised test, you can continue believing whatever you like about the meritocratic value of our most prestigious prize.