Monday, November 6, 2017

Feedback in academic selection

I recently finished reading Cathy O'Neil's excellent book, 'Weapons of Math Destruction', which describes how large scale use of algorithms and metrics are creating dangerous systems that create and perpetuate unfairness and pathological behaviour. I highly recommend it to anyone interested in how the seemingly opaque systems that increasingly govern our lives work, and came into existence.

One thing that O'Neil wrote about that struck a chord with me was about systems that have no feedback to tell them whether they are working well. For instance, O'Neil writes about the use of baseball statistics to select a winning team. In this case, if the algorithms don't work, the steam won't win and the team's statisticians are forced to change their models. She compares this to university ranking systems, where the true quality of a university is measured via a range of proxies, such as entrance scores, employment stats, publication metrics etc. In this case their is no external factor that can determine whether these measures are right or wrong, so in effect the proxies become the quality. As a result universities spend a lot of time chasing good scores on these proxies, rather than attending to their fundamental purpose of research and education

As I was reading this I started thinking about how many systems in academia, and elsewhere, operate with a similar lack of useful feedback. As a result, many decisions are being made without any meaningful opportunity to reflect on whether these decisions, and the criteria on which they were based, were any good. For example, in the past few years I have sat on both sides of various hiring committees. These typically involve a group of faculty members interviewing several candidates, reviewing their work and watching their presentations, before collectively deciding which would best serve the needs of the department. This collective decision can be more or less equally shared between members of the committee, and may focus on particular immediate needs such as teaching shortages, or more generalised goals such as departmental research directions and reputation. In some institutions the candidates face a relatively short interview, while in others (particularly in the USA), they meet with many members of the department over several days. Different systems no doubt have their own particular merits and downsides.

What is rarely done though is to precisely define what the department hopes to achieve with this hire. Even rarer is to evaluate later whether the hire was a right decision. For instance, a department may want to increase its research reputation. This is a goal which may mean different things to different people - some may think it implies gaining more research funding, others may consider that publications in top tier journals are more important. To define a measure of success, the department could decide that it wants the hired candidate to publish as many papers as possible in a defined set of acceptable journals, or to bring in as much grant income as they can. It can then measure the success of the decision with respect to these numbers later.

But there remains a problem here. What threshold determines a good decision? The goal of the hiring committee was to select the best candidate. They should not be considered a success if they picked one good candidate from many others, nor a failure if they hired one poor candidate from a generally weak field. To decide if the hiring process was successful or not, it is necessary to keep track of the paths not taken, the candidates not selected. Academics are fairly easy to keep track of online - we have a strong tendency to build up elaborate online presences to advertise our research. Therefore it should be possible to keep an eye on shortlisted candidates who were not hired, and see how they perform.

Such a process raises statistical and ethical issues. Selected candidates may perform better simply because they were given a chance while others were not. Would it be ethical or wise for the department to make tenure contingent on outperforming the other shortlisted candidates (I would argue not, but this would be similar to the practice of hiring more assistant professors than the department plans to give tenure to). Nonetheless, applied sparingly and with a little common sense, it could give some idea as to whether hiring committees were able to accurately judge which candidates were genuinely the best for the job better than picking from the shortlist at random. This could then be as evidence for improving hiring procedures in the future. For example, a department aiming to improve its research ranking might choose to employ young academics with papers in top journals, only to find that they struggled to replicate this success without the support of their previous supervisor. Over time they could recognise this pattern and look for more evidence of independent work and research leadership.

Similar questions could be asked of selection procedures for allocating grant money and publishing papers. In some cases there is a process for evaluating success (grant funders ask for reports, journals check their citations and impact factors), but all too rarely do those doing the selecting evaluate whether the people, papers or proposals that they rejected would have been better than those they selected, i.e. whether they succeeded in the task of selecting the best. Without this feedback, it is easy for institutions to lapse into making selections based on intuitively sensible criteria which have little hard evidence to support them.