Prawns and Probability: In defence of the Journal Impact Factor

With the possible exception of the BBC, academia must be the institution that spends the biggest percentage of its time criticising itself. The 'science of science' is an established field of research in its own right. Researcher's have rightly raised awareness of how statistical methods are misused, how career and funding incentives can be better aligned with good scientific practise and the inappropriate use of performance metrics.

Perhaps the most frequent target of criticism in this last category is the Journal Impact Factor (JIF). The JIF is a measure of how many citations papers in a specific journal tend to receive. Precisely, it is defined as the mean number of citations received in the last year by articles published in the journal in the previous two years.

A list of common complaints about the JIF would include:

Citation distributions are skewed, with many papers receiving few citations and a few papers receiving many citations. As such the JIF is a poor representation of a 'typical' paper in the journal.
The JIF is a statistic of relevance to the journal, but is inappropriately used to judge individual papers or researchers, which are better judged by their own number of citations.
Journals' pursuit of higher JIF scores biases them towards eye-catching papers and positive results, rather than solid research, replication studies and negative results
Journals engage in dodgy practices in order to artificially inflate their JIF

I don't hold any great admiration for the JIF, but my instinctive contrarianism has made me skeptical about these complaints. As an exercise in devil's advocacy, I'll try and give answers to each of them.

1. Citation distributions certainly are skewed. So are lots of things. People's salaries, for instance, are highly skewed. Take a look at this plot of household income from Wikipedia

That huge bar on the right hand side indicates a long tail of households with very high incomes. Because of this the mean income is about 20% higher than the median. This skew is more pronounced in some countries than others: The US and the UK have substantially more income inequality than most continental European countries for instance. As such one should treat differences in the mean income between countries with a little caution - the higher mean income in the US compared to most European countries is predominantly due to a small number of wealthy individuals. That being said, do we seriously doubt that there is a difference between a country with a mean income of $50,000 a year and one with $10,000 a year? Clearly mean incomes tell us something about the quality of life in different countries, the prospects of their citizens, their overall clout in the world. Compare the plot above with the distribution of citations to two journals, also from Wikipedia. They have the same basic features.

Sure, it may be daft to claim that a journal with a JIF of 5 is substantially different to one with a JIF of 4.5. One should not fetishise irrelevant differences just because they are presented with apparently high precision. But the truth of the matter is that knowing that one paper was published in a high JIF journal and another in a low JIF journal gives you some information about the likely quality of each. There will be many exceptions where bad papers appear in good journals and vice versa. But as long as it provides some information people will continue to use it. Seeking to banish impact factors from discussion will only make this use more opaque.

The skewed nature of the distribution introduces a lot of uncertainty into the statistics of estimating a population mean. It is often stated that because citations follow an approximate power-law distribution the mean of the distribution has no descriptive value. This is untrue. Estimates of the power-law coefficient are generally in excess of 3, meaning that both the mean and variance of the distribution are well defined. As such the Law of Large Numbers and the Central Limit Theorem apply and the sample mean converges to the underlying mean of the distribution, with normally distributed uncertainty. Therefore the JIF does what it says on the tin: gives a reasonable estimate of the expected number of citations a paper in that journal will receive.

For describing what is likely to happen to a single paper, the median may have been a better measure to use than the mean. But few people are claiming that a switch from mean to median would fix their issues with the JIF.

2. This is the point I take most issue with. In a recent pre-print paper on Biorxiv.org relating to the use of JIFs, the authors claim in their abstract that:

Although there are differences among journals across the spectrum of JIFs, the citation distributions overlap extensively, demonstrating that the citation performance of individual papers cannot be inferred from the JIF.

This obviously relates strongly to the discussion of point 1. To what extent can I predict how many citations a paper will receive, based on the JIF of the publishing journal?

Overlapping distributions. A simple reposte to the above quote is that just because distributions overlap does make them useless. The distributions in height of men and women overlap a lot. There are many men below 5ft 8' tall and many women taller than 5ft 10'. Nonetheless, the mean height of a man is significantly greater than the mean height of a woman, and knowing someone's gender gives you a lot of predictive power when estimating their height. Likewise there are plenty of people in developing countries who have incomes greater than the average British worker, but no one thinks the country someone lives in is irrelevant in determining their income. The case of JIFs is only different from this examples in the quantitative degree of overlap. Since JIFs are relatively stable over time, by definition the JIF must give accurate information about the expected number of citations a paper will receive. Indeed, studies show that the JIF is a better predictor of the citations a paper will receive than subjective judgements about paper quality. Unless the JIF was fluctuating wildly over time this simply has to be true.

Journal level vs article-level metrics. My major gripe about this point is not whether or not the JIF is a useful predictor of the number of citations a paper will get. It is the idea that the actual number of citations received is somehow a superior estimate of a paper's quality. New publication houses such as the Public Library of Science like to champion 'article-level metrics' over the JIF, arguing that the paper should be judged independently of the journal it is in. If we lived in a world where everyone took the trouble to read, consider and evaluate papers in their entirety, I'd be perfectly happy to get on the ditch-the-JIF bandwagon. But that simply isn't going to happen. If we stop looking at the journal metrics we are left looking at article-level metrics such as number of citations or social-media response. But the very arguments against JIFs are at least as valid against article-level metrics. The highly skewed distribution of citations is not necessarily due to a highly skewed distribution of article quality, but reflects the nature (or Nature?) of the science citation game. The simplest explanation for this skew is that papers with many citations tend to be cited more in future. This could be because they are intrinsically better papers, but the effect tends to be exponential rather than linear in time, suggesting the appearance of the paper in references adds to its salience for future citers. Moreover, papers with famous authors, papers with lots of co-authors and papers in popular areas tend to receive more citations. Untangling quality effects from random noise is extremely difficult. Do I think citation metrics are useless? No. But are they a clean estimate of a single paper's quality relative to other work. Not at all.

Citation-process noise reduction and new papers. In fact, in my opinion the JIF is superior to article-level citations in many instances. Consider that if a paper is published in, say, PNAS, several referees have read the paper in detail (hopefully) and decided that it is a piece of work that meets the general standards of that journal. The JIF of PNAS (which is about 9) then does the useful statistical job of averaging over all papers that meet that standard, removing a lot of noise in the process, and telling us something about how good the average paper meeting those standards is. In science we usually favour statistics from large sample sizes rather than single data points. Why should you be punished because your excellent paper wasn't one of the few runaway citation successes? Is the JIF perfect? Of course not! Publication in leading journals is also biased towards established leading scientists and their proteges, to native English-speakers, etc. Using the median would probably be more informative about the prospects of a 'typical' paper. But is it better to use only the citations to a specific article? Absolutely not. For one thing, the journal a paper is published gives immediate information about the paper, whereas citations can take years to build up. For researchers with few previous papers (I refuse to use the term Early Career Researchers, which seems to apply to anyone below 50 now), this can make a serious difference.

3. Journal's want people to read them, or more importantly they want librarians to subscribe to them. As long as journals make their money from subscriptions they will always want eye-catching results, and forever neglect less glamorous work, especially in the journals that leading publication houses use as the eye-candy to get people to subscribe on-mass to their less read titles. Trying to increase their JIF scores is a much a symptom of this problem as it is a cause.

Librarians simply are not going to look at the full distribution of citations from a journal when making subscription decisions. They want a few or ideally one number to use to make that choice. We could, for instance, redefine the JIF to be based on the median number of citations. This might stop top journals chasing a few geese that lay the golden eggs of mega-citations (which seem to be far more likely to later be proved flawed or even retracted). But ultimately journals will always want research that is more likely to be read and cited. I am more concerned with the mountains of academic research published at great expense and hardly ever read.

The only exception to this rule is journals that receive all their money from authors paying to publish. PLoS One charges about $1500 per article to authors for publication, and promises to publish anything that is technically correct. I will leave the reader to guess whether I think this is a good idea. (I used to publish in PLoS One, but now I conveniently can't afford to do so anymore).

Personally I'm broadly in favour of more open science (not capitalised), and the use of open repositories such as Arxiv and Biorxiv. I'm interested in the possibilities of formalised post-publication review. I think the amount of money spent on academic publishing is a disgrace. Everyone should learn to typeset their own papers properly as is standard in computer science fields. I hear about and read most papers as a result of the Twitter grapevine rather than browsing particular journals, but this is process subject to a whole load of biases of its own.

4. Goodhart's law. Whatever metric you choose as a proxy for quality will become corrupted if rewards accrue to those with higher scores. Gaming is all but inevitable. Perfectly reasonable standards of behaviour should be adopted, such as counting papers as being published when they first appear online rather than much later in print. The best way to ensure this is to shun journal's that obviously engage in dodgy practices. Almost no one is going to do this if the journal is a leader in their field. You already know a rough order of journal quality in your field, so gaming tactics that add a few points to the JIF should not unduly trouble you.

Ultimately the best way to judge a paper is to read it. Within our respective fields we all know what a 'good' journal is, and what it takes to be published there. Anyone who judges a researcher or makes a hiring decision by simply adding up the JIFs of all their papers is a fool. So is someone making the same decision based on total citations or the h-index. The data is now accessible about the detailed citation distributions in various leading journals. So if you want to find out which journal gives you the best chance of getting that h-index-improving n citations you can.

The 'game' of a scientific career is noisy, prejudiced, unfair and no sure way to health, wealth and happiness. The same is true of almost any career. Do work you believe in and enjoy, make reasonable adjustments to adapt to the system and don't make becoming a professor at a top institution or publishing in Science or Nature your only goals in life. This much is obvious. But the JIF is no more flawed than any other reductionist metric of outputs, and getting rid of it will, in and of itself, solve absolutely nothing.

Prawns and Probability

Wednesday, August 10, 2016

In defence of the Journal Impact Factor

No comments:

Post a Comment