Prawns and Probability: July 2016

Monday, July 18, 2016

Will your job be automated? A critique of the predictions of Frey and Osborne

You cannot have failed to encounter the current hype and/or panic about job automation. The basic story is compelling. Drawing on the availability of Big Data, artificial intelligence is progressing at a breakneck speed, solving problems that once seemed like science fiction: driverless cars, recognising people in photos, giving eerily accurate suggestions about which films we might want to watch or even what email replies we might want to give. More mundane tasks that were once the preserve of highly-trained professionals are also at risk, such as legal research. A computer can scan millions of legal texts for relevant information while a lawyer is still finding the reference for the text they need.

All of this has led to a widespread belief that many people face the loss of their job in the near future. Of course, automation has been with us since the industrial revolution, and in some areas even before then. Resistance to, and despair about automation is as old as automation itself. But the new panic is about the possible scale of job losses, and the lack of useful employment opportunities for those displaced. An oft-quoted figure is that 47% of U.S. jobs are at risk of automation.

The figure of 47% originates in the work of Carl Frey and Michael Osborne, of Oxford University. Frey and Osborne persuasively argue that the progress in data collection, data analytics and artificial intelligence puts many tasks that were previously thought to be out of reach for computers and robots within touching distance of being automated. They contend that advances in pattern recognition mean that computers, which previously had been used to automate routine tasks, such as performing repeated calculations or fitting parts together in factories, will increasingly be able to tackle non-routine tasks. For example, Siri and similar artificial personal assistants take in unstructured voice requests and determine what the user wants, where to seek the required information and how to present it to them. With enough data, they suggest, almost any task can be automated by looking for patterns in the data that inform the task at hand:

"...we argue that it is largely already technologically possible to automate almost any task, provided that sufficient amounts of data are gathered for pattern recognition." [F&O]

These arguments are persuasive, and there is no doubt that modern machine-learning research has made great strides - it is worth trying to recall how outlandish some of today's AI technologies would have seemed just 10 years ago. Nonetheless, others such as Neil Lawrence, of Sheffield University, have argued that relying on huge data sets in this way is not the same thing as true artificial intelligence. Only a few organisations in the world such as Google and Facebook have access to truly vast amounts of data about our daily behaviours, and a great deal of their research is dedicated to targeting adverts at us with increasing precision. Moreover, if a computer needs a vast data set to learn what it should do, how readily can it adapt to new tasks? Will there always be a big enough relevant data set that has, or even could be collected? What about tasks where the computer may not have access to 'the grid' and the vast centres where data is stored? These are big questions that drive significant bodies of research in AI. Given these uncertainties, it is worth considering how F&O arrive at the rather precise number of 47% for the proportion of jobs at risk.

Fittingly enough, F&O use machine-learning itself to determine whether a job is automatable. They use a tool called Gaussian process classification (GPC) to predict whether a job is automatable, based on the characteristics of that job, as defined and measured in a data set called O*NET, collected by the US Department of Labor. O*NET lists the skills and knowledge required to perform each job. To use GPC to make predictions requires two things, a set of predictors (in this case the O*NET data) and a matching set of known outputs on which to train the classifier. In plain terms, they require not only the job characteristics, but also, for some of these jobs, a known risk of automation. Where does this second part come from? In short, they make an educated guess (or more precisely, they ask a group of well-informed people to make such a guess). In the paper they describe this process:

"First, together with a group of ML researchers, we subjectively hand-labelled 70 occupations, assigning 1 if automatable, and 0 if not. For our subjective assessments, we draw upon a workshop held at the Oxford University Engineering Sciences Department, examining the automatability of a wide range of tasks. Our label assignments were based on eyeballing the O∗NET tasks and job description of each occupation. This information is particular to each occupation, as opposed to standardised across different jobs. The hand-labelling of the occupations was made by answering the question “Can the tasks of this job be sufficiently specified, conditional on the availability of big data, to be performed by state of the art computer-controlled equipment”. [F&O]

To make the process plain, they took 70 of the jobs in the data set about which they were most confident, and made their best guess as to whether these were going to be automated. They then use the GPC to translate these subjective opinions about 70 jobs into predictions on the other 600 or so in the data set. Essentially they train the GPC to learn what it is about certain jobs that makes them believe they will be automated. Ultimately then, the GPC propagates this subjective opinion to all the other jobs, and determines that 47% are predicted to be automated.

As a side effect, the GPC is also able to identify the factors that seemed to influence whether the workshop participants thought a job would be automated. The factors identified seem reasonable: jobs requiring high social perceptiveness have a low risk for example. But we should perhaps treat these findings with care - the very fact that they seem reasonable to us suggests that they also seemed reasonable to the people making the predictions - no wonder then that they labelled jobs requiring high social perceptiveness as less likely to be automated. Moreover, while the participants of a workshop at the Oxford University Engineering Sciences Department no doubt have greater expertise than the average person in determining the capabilities of machines, we should also be aware that such groups are somewhat selective to technological optimism - few people choose to become researchers in artificial intelligence if they do not believe it is important, any more than you would become a teacher if you didn't think education made a difference. Any biases or blind spots these individuals might have will be translated into the final figure of 47%, as well as the characteristics chosen as most important.

There is a danger when reading the paper (if one does, no doubt many news sources do not), that one can be impressed by the mathematical sophistication of the GPC prediction machinery. It is an impressive piece of technical work. But the GPC can only work with what is is given - it generalises from known examples in the data. The old saying about computer science: 'garbage in, garbage out' is overly pejorative here - the predictions the GPC has been trained on are not garbage, but the best educated guesses of well informed people. They are internally consistent - the GPC can predict well the predictions made by workshop participants for unseen examples. But the GPC cannot predict more accurately than the individuals themselves. It is important to realise that the trained-GPC is effective a machine for making the predictions these same individuals would have made themselves if they were asked. With all the uncertainties involved in a still nascent and quickly changing field, making precise predictions is extremely speculative. Just imagine how different many of these predictions would have been if people had been asked 10 years ago. What might they look like in 10 years time?

All of this makes me very skeptical about the now ubiquitous assumption that masses job losses are inevitable. In many ways I hope they are - we should hope that more of the tasks we only do out of necessity will be automated, as long as the economic gains can be spread equitably (a whole other ball game!). But a narrative of huge disruption feeds into the rather millennial milieu in which we find ourselves, plagued with doubts about our economic system, possible catastrophic climate change, antibiotic resistance etc. It is very tempting to believe that disruptive, destructive change is now a permanent feature of our lives. F&O, to their credit, do not take this line - I have seen Michael Osborne present his work previously and he speaks to all great possibilities automation creates. It is also worth noting that many tasks that can be automated take an amazingly long time to be so. I recently took a trip to the National Coal Mining Museum, where I was amazed to learn that very few mines had any serious machinery involved in the actual hacking off of coal until nationalisation and unionisation drove up labour costs and pushed efficiency up the agenda after the war. I'm perpetually amazed, as a renter, how many people think dishwashers are optional! As Frey & Osborne note, but few news outlets pick up on, automation will only happen if the cost of labour is sufficiently high - many government policies are directed explicitly at lowering the cost of labour to the employer.

We shall no doubt see feats of automation in our lifetimes that would stagger us today, just as the household appliances created in the 20th century would amaze our ancestors. But exactly which jobs will disappear, when they will do so and how many people will become unemployed? I would not want to guess.

Reference: [F&O] The future of employment: How susceptible are jobs to computerisation? Carl Benedikt Frey and Michael A. Osborne

Wednesday, July 13, 2016

Brexit: a statistical demographic analysis

Britain voted for Brexit, defying the predictions from Betfair's prediction market. I was in the US at the time, giving me the dubious privilege of watching the votes come in without having to stay up all night. As a (relatively) young, (relatively) affluent graduate and resident of a major UK city you will be completely unsurprised to learn that I voted to remain.

There has been a lot of discussion in the press since the vote regarding different demographic splits between remain and leave voters. We are told that city-dwellers, graduates, the young and the affluent tended to vote remain, while poorer voters, those in small towns and villages, those without higher education and older voters tended to vote leave. The Scottish and the Irish voted in, the English and the Welsh voted out. The Guardian provides a breakdown of these trends, which appear to show a nation divided. I assume the data they use comes from the 2011 UK Census.

In an effort to channel my increasing angst in a positive direction I set out to do a more thorough statistical analysis of the data The Guardian presented to identify which demographic factors were most important in determining how people voted. After scraping the data from the Guardian website I first reproduced the graphs The Guardian had displayed (see end for scraping details. NB: I could have aggregated data from the UK Census directly, but this was quicker and ensured I was using exactly the same measures as the Guardian). My demographic data are all in arbitrary units since I had to scrape the values in pixel units from the webpage, but since this won't affect the statistics I wish to do - in fact, scaling each demographic variable to lie between 0 and 1 helps us to compare the magnitude of effects. On each subplot I have given the correlation coefficient between the demographic indicator and the proportion of leave voters.

In short these plots (working left to right and top to bottom) seem to indicate that:

Voters with degrees tend to vote remain
Voters with no formal qualifications tend to vote leave
Voters with higher incomes tend to vote remain
Voters in the ABC1 classes tend to vote remain
Older voters tend to vote leave
Voters in areas with more non-UK born residents tend to vote remain

So far, so much in agreement with the general terms of discussion. How do these perceptions hold up when we actually do some statistics on the data?

The tool I used for this analysis is the Generalised Linear Mixed Effects Model. I specified the model as:

proportion voting leave ~ (1 | Region) + proportion with higher education. + proportion with no formal qualifications + median income + proportion in ABC1 social classes + median age + proportion not born in UK

This model states that the proportion of leave voters in an electoral area is determined by the demographic characteristics plotted above, but with regional variations specified by the random effect (1 | Region). We know that each nation of the UK had distinctly different voting patterns, quite separate from their different demographics, e.g. older voters in Scotland didn't necessarily vote the same way as similarly-aged voters in England. We'd better account for this in the analysis if we want to identify the real underlying effects.

Running this model in R (lme4::glmer, scaling the independent variables to have zero mean and unit standard deviation) we infer estimated effect sizes for each of the demographic variables. Below I've listed these and plotted the effect sizes with 95% confidence intervals for visual comparison. Points plotted to the left of the vertical grey line indicate a negative affect on the leave vote, those on the right a positive effect.

Some of the initial impressions from the data are born out in this analysis. The intercept is weakly positive, indicating that overall the nation voted to leave (albeit by such a slim margin that the intercept is not significantly greater than zero! - worth noting by those claiming an uncontestable mandate). By far the most important predictor of how an individual will vote is whether or not they have had any higher education. Older voters do tend to vote leave in greater numbers (in fact this tendency is shown more strongly here than we saw in the first set of plots). But some of the other results are surprising. The proportion of residents who are not born in the UK has a negligible effect on how that area will vote. Class has a relatively weak effect despite showing one of the strongest correlations. Voters with higher incomes are more likely to vote leave (all other things being equal). Perhaps most surprising, areas where more people have no formal qualifications are substantially less likely to vote leave (again, all other things being equal). The strong positive correlation seen between proportion with no formal qualification and leave vote seen in the first figure appears to be a side effect of the strong anti-correlation between the proportion with no formal qualification and the proportion with higher education.

Of course, that caveat all other things being equal is doing a lot of work. Its rare to find someone with a high income, but with no higher education and who would not be classified as being in the ABC1 social classes. Likewise there are not many areas where there are simultaneously a large number of graduates and a large number of people without formal qualifications. Nonetheless, the differences between the statistical results and the original impression from the data plots should make us pause before reading too much into the apparent demographic trends.

This analysis was a simple effort with a readily available model - hopefully some more sophisticated analysis will reveal a clearer picture. In particular, including interactions between these different indicators may give better predictions. As usual in such analyses, we should be aware of all the caveats surrounding ecological regression - data based on individual characteristics would be preferable, but that may be a pipedream.

How I got the data: scraping, xml and awk

The Guardian is one of the best newspapers in the world for presenting real data and analysis to the public. That it is free to access is an amazing privilege for those of us who are interested in the real evidence behind the headlines. It regularly presents beautiful summaries of important data in an easily understood format.

However, on this page where the demographic data is plotted, there is no information on how one might view the original data is numerical form. That is the newspaper's prerogative, and may be due to worries that other publications would piggyback on the hard work Guardian journalists do in finding the information. It does however make Open Science difficult.

To get the data I needed I first inspected the elements comprising the interactive plot (in Chrome, right click: inspect)

Then I found the xml entries that gave the screen coordinates for each circle plotted on each graph

I copied this element, which specifies the location of each circle and, thankfully, a code for the electoral area, into a text file, getting text that looks like this:

To get the raw x, y positions for each circle I processed this text file using an awk script (credit for awk-ing goes to Roman Garnett). Using an xml processing tool may be more efficient (or at least more sensible).

awk 'BEGIN {RS="<"} /^circle/ {gsub("[[:punct:]]", " "); gsub("data id", "dataid"); for (i = 1; i <= NF; i++) {if ($i ~ "cx" || $i ~ "cy" || $i ~ "dataid") {printf "%s ", $(i + 1)}} printf "\n"}' input_file >> output_file

I rescaled these data so that every demographic indicator lies between 0 and 1, and then matched these data with far more easily obtainable data on how each electoral area voted from The Electoral Commission. (NB: the raw numbers are inverted in scale when collected from the website, because they indicate pixel positions from the top of the graph element.)

I am a little uncertain on whether one should make this data openly accessible. On the one hand the raw numbers I used are all publicly accessible on The Guardian's webpage (with a bit of work!), and could in principle be retrieved from the UK Census. On the other hand The Guardian didn't publish the numerical data, and so I will respect that and not do so here. These instructions should be sufficient to allow you to get the data yourself should you wish, and I would suggest contacting The Guardian if you want to do anything remotely commercial with them.