Wednesday, July 13, 2016

Brexit: a statistical demographic analysis

Britain voted for Brexit, defying the predictions from Betfair's prediction market. I was in the US at the time, giving me the dubious privilege of watching the votes come in without having to stay up all night. As a (relatively) young, (relatively) affluent graduate and resident of a major UK city you will be completely unsurprised to learn that I voted to remain.

There has been a lot of discussion in the press since the vote regarding different demographic splits between remain and leave voters. We are told that city-dwellers, graduates, the young and the affluent tended to vote remain, while poorer voters, those in small towns and villages, those without higher education and older voters tended to vote leave. The Scottish and the Irish voted in, the English and the Welsh voted out. The Guardian provides a breakdown of these trends, which appear to show a nation divided. I assume the data they use comes from the 2011 UK Census.

In an effort to channel my increasing angst in a positive direction I set out to do a more thorough statistical analysis of the data The Guardian presented to identify which demographic factors were most important in determining how people voted. After scraping the data from the Guardian website I first reproduced the graphs The Guardian had displayed (see end for scraping details. NB: I could have aggregated data from the UK Census directly, but this was quicker and ensured I was using exactly the same measures as the Guardian). My demographic data are all in arbitrary units since I had to scrape the values in pixel units from the webpage, but since this won't affect the statistics I wish to do - in fact, scaling each demographic variable to lie between 0 and 1 helps us to compare the magnitude of effects. On each subplot I have given the correlation coefficient between the demographic indicator and the proportion of leave voters.

 In short these plots (working left to right and top to bottom) seem to indicate that:
  1. Voters with degrees tend to vote remain
  2. Voters with no formal qualifications tend to vote leave
  3. Voters with higher incomes tend to vote remain
  4. Voters in the ABC1 classes tend to vote remain
  5. Older voters tend to vote leave
  6. Voters in areas with more non-UK born residents tend to vote remain
So far, so much in agreement with the general terms of discussion. How do these perceptions hold up when we actually do some statistics on the data?

The tool I used for this analysis is the Generalised Linear Mixed Effects Model.  I specified the model as:

proportion voting leave ~ (1 | Region) + proportion with higher education. + proportion with no formal qualifications + median income + proportion in ABC1 social classes + median age + proportion not born in UK

This model states that the proportion of leave voters in an electoral area is determined by the demographic characteristics plotted above, but with regional variations specified by the random effect (1 | Region). We know that each nation of the UK had distinctly different voting patterns, quite separate from their different demographics, e.g. older voters in Scotland didn't necessarily vote the same way as similarly-aged voters in England. We'd better account for this in the analysis if we want to identify the real underlying effects.

Running this model in R (lme4::glmer, scaling the independent variables to have zero mean and unit standard deviation) we infer estimated effect sizes for each of the demographic variables. Below I've listed these and plotted the effect sizes with 95% confidence intervals for visual comparison. Points plotted to the left of the vertical grey line indicate a negative affect on the leave vote, those on the right a positive effect.

Some of the initial impressions from the data are born out in this analysis. The intercept is weakly positive, indicating that overall the nation voted to leave (albeit by such a slim margin that the intercept is not significantly greater than zero! - worth noting by those claiming an uncontestable mandate). By far the most important predictor of how an individual will vote is whether or not they have had any higher education. Older voters do tend to vote leave in greater numbers (in fact this tendency is shown more strongly here than we saw in the first set of plots). But some of the other results are surprising. The proportion of residents who are not born in the UK has a negligible effect on how that area will vote. Class has a relatively weak effect despite showing one of the strongest correlations. Voters with higher incomes are more likely to vote leave (all other things being equal). Perhaps most surprising, areas where more people have no formal qualifications are substantially less likely to vote leave (again, all other things being equal). The strong positive correlation seen between proportion with no formal qualification and leave vote seen in the first figure appears to be a side effect of the strong anti-correlation between the proportion with no formal qualification and the proportion with higher education. 

Of course, that caveat all other things being equal is doing a lot of work. Its rare to find someone with a high income, but with no higher education and who would not be classified as being in the ABC1 social classes. Likewise there are not many areas where there are simultaneously a large number of graduates and a large number of people without formal qualifications. Nonetheless, the differences between the statistical results and the original impression from the data plots should make us pause before reading too much into the apparent demographic trends.

This analysis was a simple effort with a readily available model - hopefully some more sophisticated analysis will reveal a clearer picture. In particular, including interactions between these different indicators may give better predictions. As usual in such analyses, we should be aware of all the caveats surrounding ecological regression - data based on individual characteristics would be preferable, but that may be a pipedream.

How I got the data: scraping, xml and awk

The Guardian is one of the best newspapers in the world for presenting real data and analysis to the public. That it is free to access is an amazing privilege for those of us who are interested in the real evidence behind the headlines. It regularly presents beautiful summaries of important data in an easily understood format. 

However, on this page where the demographic data is plotted, there is no information on how one might view the original data is numerical form. That is the newspaper's prerogative, and may be due to worries that other publications would piggyback on the hard work Guardian journalists do in finding the information. It does however make Open Science difficult.

To get the data I needed I first inspected the elements comprising the interactive plot (in Chrome, right click: inspect)

Then I found the xml entries that gave the screen coordinates for each circle plotted on each graph

I copied this element, which specifies the location of each circle and, thankfully, a code for the electoral area, into a text file, getting text that looks like this:

To get the raw x, y positions for each circle I processed this text file using an awk script (credit for awk-ing goes to Roman Garnett). Using an xml processing tool may be more efficient (or at least more sensible).

awk 'BEGIN {RS="<"} /^circle/ {gsub("[[:punct:]]", " "); gsub("data id", "dataid"); for (i = 1; i <= NF; i++) {if ($i ~ "cx" || $i ~ "cy" || $i ~ "dataid") {printf "%s ", $(i + 1)}} printf "\n"}' input_file >> output_file

I rescaled these data so that every demographic indicator lies between 0 and 1, and then matched these data with far more easily obtainable data on how each electoral area voted from The Electoral Commission. (NB: the raw numbers are inverted in scale when collected from the website, because they indicate pixel positions from the top of the graph element.)

I am a little uncertain on whether one should make this data openly accessible. On the one hand the raw numbers I used are all publicly accessible on The Guardian's webpage (with a bit of work!), and could in principle be retrieved from the UK Census. On the other hand The Guardian didn't publish the numerical data, and so I will respect that and not do so here. These instructions should be sufficient to allow you to get the data yourself should you wish, and I would suggest contacting The Guardian if you want to do anything remotely commercial with them.


  1. Thanks -- it was interesting to see this in a regression context. There is a strong correlation between lacking formal qualifications and older age, which is mostly education policy (now you get lower grades at GCSE rather than nothing); may also be a small correlation with being educated abroad. So, age effect more important than lacking quals, maybe (?). The income/earnings data will have come from a different data source, like Annual Survey of Hours and Earnings rather than Census 2011 (which will have been the key source for lots of the data).

  2. Yes, my suspicion was that the lack of formal qualifications is not a fundamentally useful variable. It has such a strong anti-correlation with % with HE and as you say is very age dependent (I never established for sure that, say, one GCSE below C grade would give you 'formal qualifications, but I assume so). However, it does improve model fit significantly when included, compared to the same model without it (log-likelihood difference of over 7000). It could be a useful distinction within older voters, where the existence of FQ may be a more 50/50 split than having HE, since HE was so rare.