April 24, 2017

Peer Review, Replication and Publication Bias

In 2005, three MIT graduate students Jeremy Stribling, Dan Aguayo and Maxwell Krohn wrote the program SCIgen to generate fake papers. In their sting, they submitted a paper to the 2005 World Multiconference on Systemics, Cybernetics and Informatics. That paper was entitled “Rooter: A Methodology for the Typical Unification of Access Points and Redundancy”.

Here’s the abstract from the fake paper:

Many physicists would agree that, had it not been for congestion control, the evaluation of web browsers might never have occurred. In fact, few hackers worldwide would disagree with the essential unification of voice-over-IP and public private key pair. In order to solve this riddle, we confirm that SMPs can be made stochastic, cacheable, and interposable…

The three authors were invited to speak at the conference, where they exposed the hoax. The program SCIgen is available on the internet free to download and use by anyone.

As recently as 2013, at least 16 SCIgen papers have been found in Springer journals.

According to the paper by Dominique and Cyril Labbe entitled “Duplicate and Fake Publications in the Scientific Literature: How many SCIgen papers in Computer Science?”, SCIgen papers had an acceptance rate of 13.3% at the ACM digital library, and 28% for Institute of Electrical and Electronics Engineers.

Now certainly the ACM digital library and the IEEE are not the most prestigious journals. But 16 got into Springer. Now I don’t know what percentage of SCIgen papers got in, but some did. And if completely bogus and ridiculous nonsense-jargon papers could get in at least some of time, what about papers which aren’t so transparently bogus? Whose authors are smarter liars than a text-spinning algorithm?

This is the point. Nobody would say that the prestigious journals are literally churning out thousands of SCIgen papers, but the fact that sometimes SCIgen papers can get through calls into question the seriousness of the peer review process.

Another sting operation was done by John Bohannon. Bohannon wrote essentially the same paper 304 times about some moss that inhibited cancer growth. The paper has glaring flaws that he describes in his Sciencemag article, “Who’s Afraid of Peer Review”.

Among them were descriptions of a correlation between moss exposure and cancer inhibition when his own chart showed zero correlation. He posed as researchers from various third-world institutes, using randomly generated names for the authors and institutions of his 304 fake papers, and moving paragraphs around.

These are the same text “spinning” techniques used by spammers to get past spam filters. He also ran his original text through google translate into French, and then back into English, and then manually corrected the biggest errors in the final translation. This was so he had the correct grammar, but the idiom of a foreign speaker.

The 304 slightly different papers were sent to 304 Journals. In total, 157 were accepted, 98 rejected, 29 were derelict, and 20 were still reviewing the paper by the time Bohannon published the results of his sting.

He sent the paper to 167 Directory of Open Access Journals (DOAJ), and 121 to Jeffrey Beall’s list, and 16 on both Beall’s list and the DOAJ.

Beall’s list is a list of Journals determined by Jeffrey Beall to be bogus. The Directory of Open Access Journals is run by Lars Bjørnshauge, a library scientist at Lund University in Sweden.

Bohannon says of the DOAJ,

Without revealing my plan, I asked DOAJ staff members how journals make it onto their list. “The title must first be suggested to us through a form on our website,” explained DOAJ’s Linnéa Stenson. “If a journal hasn’t published enough, we contact the editor or publisher and ask them to come back to us when the title has published more content.” Before listing a journal, they review it based on information provided by the publisher.

The results of the sting were as follows:

Reaction DOAJ Beall’s List Overlap
Rejected w/o peer review 44.4% 3.1% 3 (total)
Rejected with peer review 11.1% 10.3% 2 (total)
Accepted w/o peer review 24.3% 48.5% 6 (total)
Accepted with peer review 20.1% 38.1% 3 (total)
Total responses 144 97 14

The fact that junk journals accepted a junk article is not interesting. What is interesting is that journals run by Sage, Elsevier and Wolters Kluwer all accepted Bohannon’s bogus paper.

Sage’s journal named Journal of International Medical Research accepted the paper,

Wolters Kluwer’s journal Journal of Natural Pharmaceuticals accepted the paper, and

Elsevier’s journal Drug Intervention Today accepted the paper.

Springer, Sage, Wolters Kluwer and Elsevier all went into damage control mode with apologies and statements.

For example, Elsevier says that they don’t actually own Drug Intervention Today. The problem though is that it’s published by Elsevier, and anyone who reads something from Drug Intervention Todaywill see right up top a big “Elsevier” logo on it because it’s published right along with Elsevier’s other journals. The fact that they don’t legally own the journal is a red herring; and this distinction was only highlighted by Elsevier when it got caught in this sting.

Same with Wolter Kulwer’s Journal of Natural Pharmaceuticals. Wolters Kluwer shut down that journal in response to this sting. But there’s no reason to believe that Journal of Natural Pharmaceuticalswas any worse than any of Wolter Kluwer’s other journals. That just happened to be the journal targeted by Bohannon’s sting.

Borhannon’s sting and the SCIgen sting show that horrifically bad papers can get through with some regularity. From what I see, these intentionally terribad SCIgen or Bohannon-esque papers get through the big, prestigious journals maybe 2% of the time.

Now, I am NOT saying that these big journals routinely publish papers that are as bad as the intentionally bad paper Bohannon wrote. I am saying that the fact that such “boringly bad” papers can occasionally get through – even into the damage-controlling big journals – calls into question the review process. If Bohannon’s paper can sometimes get through, how often do better-crafted fakes get through?

Another sting was done by Fiona Godlee, an editor of the British Medical Journal, in 1998. She took a paper which was about to be published in the British Medical Journal, modified it so it had 8 major errors, and then sent the paper to 420 reviewers. Only 221 of the reviewers responded. The median number of errors found by the respondents was 2, 35 respondents didn’t find a single error, and nobody found more than 5 errors.

In 2008, a similar study was done, again by Fiona Godlee, and again with BMJ reviewers. This time the paper had 9 “major errors” and 5 “minor errors”, and had 607 reviewers.

The study compared the effects of training reviewers on how many errors they caught. One group had face-to-face training, one group was self-taught, and the other had no training (the control).

These were the results:

Average number of major errors found by group (out of 9):

Group Avg. Number of Errors Found
1 (Control) 2.74
2 (Self-Taught) 3.01
3 (Face to Face) 3.12

Average number of minor errors found by group (out of 5):

Group Avg. Number of Errors Found
1 (Control) 1.07
2 (Self-Taught) 0.85
3 (Face to Face) 1.03

While presented as a study testing the efficacy of a reviewer training program, it is a de facto sting against the British Medical Journal.

In 2014 Journal Citation reports gave the British Medical Journal an impact factor of 16.378, putting it at 4th place among all general medical journals in the world. In my opinion, the fact that Fionna actually engaged in these stings is evidence that the British Medical Journal is probably better than average. That other journals, which don’t even bother with this kind of self-testing, are probably even worse.

In 1997, from the paper “Who Reviews the Reviewers? Feasibility of Using a Fictitious Manuscript to Evaluate Peer Reviewer Performance”, the authors sent an intentionally bad manuscript to all of the reviewers at Annals of Emergency Medicine. At the time of submission, the journal had an acceptance rate of 26%. Today the journal has an impact factor of around 4.33, which is average.

Of the 262 reviewers the text was sent to, the response was:

63 – No response
117 – Rejection
67 – Revision
15 – Immediate acceptance

But keep in mind that the opinions of the reviewers who didn’t respond would not be included in the decision to accept or reject the article.

In my opinion, this fake study would have a 1.4% chance of being published by Annals of Emergency Medicine. This assumes it would have 3 reviewers and require unanimous acceptance to get published by the editors, and that half of the reviewers who requested revision would later accept the paper, and half would later reject it.

So this paper probably would not have been accepted. But what was more interesting is the number of errors identified:

Verdict Number of errors found on average
Accept 1.73
Reject 3.91
Revise 2.96
Total 3.423

Some of the highlights are:

Only 30.2% of reviewers noted that there was no randomization of treatment. Only 0.5%, as in one half of one percent, as in just one of the 199 reviewers, saw that the p-value calculations were incorrect. Only 10.6% noted that the drug being tested, Propranolol, wasn’t being compared to a known agent.

Despite the fact that this article probably wouldn’t be accepted based on my assumptions, the fact that so few of the intentionally planted errors were found is, in my opinion, a condemnation. And Annals of Emergency Medicine is an average journal it seems, so these results are probably typical.

In 2000, the journal Brain, which is an Oxford publication, looked into the agreement between reviewers of articles at other journals. Unfortunately, those journals only agreed to this on the condition that they remain anonymous. So we’re trusting Oxford that they picked good journals.

Journal A:
Acceptance agreement: 47% vs. 42.5% by chance alone

Priority agreement: 35% vs. 42.5% by chance alone

Journal B:
Acceptance agreement: 61% vs. 45.74% by chance alone

Priority agreement: 61% vs. 46.32% by chance alone

By the way, I inferred the numbers for chance here by counting the pixels in the bar chart. This is something I find myself doing ALL THE TIME when looking at published peer reviewed papers.

Replication

One other way to examine the efficacy of peer review is to look at replication. Now it’s possible that some researcher can make a bogus study with bogus results, and other researchers will replicate bogus findings for whatever reason. It’s also possible that a researcher will not quite follow the same methods, even though they think they are.

But the first step to replication is to have the methods available. In 2013, Melissa Haendel et. al. looked at 238 biomedical papers from 84 journals. Of all of the studies, she found the percent identifiable resources necessary for replication as follows:

Resource Percent Identifiable
Antibody 44%
Cell Lines 43%
Constructs 25%
Knockdown Reagents 83%
Organisms 77%

Only 5 of the journals analyzed had, by her definition, “stringent” resource reporting guidelines. Another issue is author’s straight up unwillingness to share data.

A report from the Institute of Medicine by Christine Lane entitled “Sharing Clinical Research Data: A Workshop” asked 389 researchers from 2008 to 2012 how willing they would be to share protocols and raw data (the bones).

In 2008, 80% of the respondents would be willing to share additional protocols beyond what was gone over in the methods section, but only 60% would be willing to share raw data.

In 2012, Only 60% of researchers said they were willing to provide additional protocols, and only 45% said they would be willing to share raw data.

Keep in mind this is just a survey. In my opinion, this overestimates how many REALLY would share this information.

Part of this may explain the difficulty of replication. In 2012 the company Amgen attempted to replicate 53 of what they labeled “landmark studies” on cancer. These are generally defined as studies which were published in “high impact” journals – journals that are cited a lot, and whose articles themselves have been cited a lot.

The result was that they failed to replicate 47 of the 53 studies selected. Maybe their difficulty in replicating had to do with not getting the original data and protocols, or maybe it’s because those studies were bunk. Maybe Amgen had some axe to grind and the whole thing was done in bad faith.

But that’s the problem. We don’t know. It’s just not being replicated.

Bayer had more success in attempting to replicate studies in 2011. Khusru Asadullah headed a replication attempt of 67 studies Bayer was interested in. Asadullah and the team were only able to replicate 14 of the 67 studies, marginally better than the Amgen sting.

One could say “conflict of interest” because these are corporate studies, not University ones. But this seems to be dubious since Bayer and Amgen did these replication attempts for the purpose of product development.

The website “psychfiledrawer.com” looks for replication attempts in the field of psychology. From their “article list” page, they show 20 successes at replication and 42 failures.

11 studies have only successful replications, 21 only have failed replications, and 5 have a mixture of successful and failed replications. This surprised me, as my preconception was that things like psychology and sociology would have MORE unsuccessful replications than things like medicine or computer science.

Bias Toward positive results

Another major problem is the bias toward positive results:

In the Novermber 2010 paper, “Testing for the Presence of Positive-Outcome Bias in Peer Review: A Randomized Controlled Trial”, the researchers sent two test manuscripts to 238 reviewers for The Journal of Bone and Joint Surgery and Clinical Orthopaedics and Related Research.

They were randomly given a paper that showed the effect of giving an antibiotic after surgery. The two papers randomly assigned were identical in everything EXCEPT the conclusions. The version that showed no effect for the antibiotic was accepted 80% of the time, whereas the paper that concluded a positive effect was accepted 97.3% of the time.

In addition, the reviewers of the paper with no positive results found more methodological errors. This is evidence of something I suspected, which is that scientists will find more methodological problems with ideas they don’t find appealing, all else being equal.

From the paper, “Publication Bias: The “File-Drawer” Problem in Scientific Inference”, Jeffrey Scargle argued that researchers themselves are generally more interested in publishing positive results, while negative or “null hypothesis” results are filed away.

Combine this with the fact that reviewers themselves are probably biased toward positive results, and you have another reason to doubt peer review.

Studies showing that most studies usually aren’t replicable have been replicated many times. This is a robust finding.

Scott Armstrong from The Wharton School at the University of Pennsylvania, wrote in my opinion a scathing subjective evaluation of what peer review is like in his paper “Peer Review for Journals: Evidence on Quality Control, Fairness, and Innovation”. The paper is mostly an analysis of other studies on Peer Review, but after looking at that data, here is how he described it:

“These papers are then reviewed by people who are working in related areas but generally not on that same problem. So the reviewers typically have less experience with the problem than do the authors….

… Reviewers generally work without extrinsic rewards. Their names are not revealed, so theirreputations do not depend on their doing high quality reviews…

… In any event, on average, reviewers spend between two and six hours in reviewing a paper although they often wait for months before doing their reviews. They seldom use structured procedures. Rarely do they contribute new data or conduct analyses. Typically, they are not held accountable for following proper scientific procedures…

… Reviewers’ recommendations often differ from one another, as shown by Cicchetti (1991). Most authors have probably experienced this. For example, here are reviews for one of my papers:

Referee #1: ‘. . . The paper is not deemed scientific enough to merit publication.’

Referee #2: ‘ . .This follows in the best tradition of science that encourages debate through replication.’

Authors are critical of the quality of the reviews that they receive. Bradley (1981) asked authors about their experience on the last compulsorily revised article published in a refereed journal. When asked whether the changes advocated by the referees were based on whim, bias, or personal preference, only 23% said none were, while 31% said that this applied to some important changes. Forty percent of the respondents said that some of the referees had not read the paper carefully.”

Now one may argue that while Peer Review is seriously flawed, it shows that the Universities are trying to create some method to combat bias, and as biased as they are, peer review boards are certainly less biased than the general public in general.

And maybe this is true, but the people who say it, I have not seen one who has given any real reason to believe that.

So peer review is a flawed process, full of easily identified defects with little evidence that it works. Nevertheless, it is likely to remain central to science and journals because there is no obvious alternative, and scientists and editors have a continuing belief in peer review. How odd that science should be rooted in belief.”

– Richard Smith, 13-year editor of the British Medical Journal

In terms of the benefits of the Journal System and Peer Review, there are two arguments that I have heard for it:

  1. That peer review keeps out bad science
  2. That it helps proliferate research

Regarding point #1, how on earth did researchers distinguish real science from quackery before peer review? Maybe it’s because they themselves actually are capable of distinguishing it. After all, aren’t they supposed to be experts? Why do you need referees, usually being 3 experts, to review something, before other experts can review it after it’s published in the Journals?

And the evidence presented in this section on peer review I believe calls into question the general efficacy.

In terms of proliferating the research, maybe this was an important thing before the internet. I suspect not, as the real explosion in publications and science Journals didn’t happen until after world war 2, which suggests something other than the logistics of idea proliferation were the cause.

But today, what do Journals do? Well they keep articles in limbo for 3-9 months or longer, which delays the article being reviewed by everyone else, then when they get published they’re behind a paywall.

“Peer review” are two words that people like to say. All scientific papers are peer reviewed. All of these researchers have colleagues, and if they publish in a non-peer-reviewed journal, it gets criticisms, they then modify it, et cetera. Peer review is code for “got through the journal barrier”.

But after all of this, I am left wondering how it got this way.

In an article in The Guardian, Randy Schekman characterizes the situation thusly:

“We all know what distorting incentives have done to finance and banking. The incentives my colleagues face are not huge bonuses, but the professional rewards that accompany publication in prestigious journals – chiefly Nature, Cell and Science.

These luxury journals are supposed to be the epitome of quality, publishing only the best research. Because funding and appointment panels often use place of publication as a proxy for quality of science, appearing in these titles often leads to grants and professorships.

It is common, and encouraged by many journals, for research to be judged by the impact factor of the journal that publishes it. But as a journal’s score is an average, it says little about the quality of any individual piece of research. What is more, citation is sometimes, but not always, linked to quality. A paper can become highly cited because it is good science – or because it is eye-catching, provocative or wrong. Luxury-journal editors know this, so they accept papers that will make waves because they explore sexy subjects or make challenging claims. This influences the science that scientists do. It builds bubbles in fashionable fields where researchers can make the bold claims these journals want, while discouraging other important work, such as replication studies.

In extreme cases, the lure of the luxury journal can encourage the cutting of corners, and contribute to the escalating number of papers that are retracted as flawed or fraudulent. Science alone has recently retracted high-profile papers reporting cloned human embryos, links between littering and violence, and the genetic profiles of centenarians. Perhaps worse, it has not retracted claims that a microbe is able to use arsenic in its DNA instead of phosphorus, despite overwhelming scientific criticism.

Like many successful researchers, I have published in the big brands, including the papers that won me the Nobel prize for medicine, which I will be honoured to collect tomorrow.. But no longer. I have now committed my lab to avoiding luxury journals, and I encourage others to do likewise.

Just as Wall Street needs to break the hold of the bonus culture, which drives risk-taking that is rational for individuals but damaging to the financial system, so science must break the tyranny of the luxury journals. The result will be better research that better serves science and society.”

So how did this happen? Well I think that the rise of peer review, and the shockingly baseless acceptance of it, is related to the increase in immediate power of the University.

According to the paper, “The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index”, there were roughly 32 times as many Chemistry papers published in 2012 than there in 1937. This is based on the overall rate of increase they showed from 1907 to 2007, and then I continued the trend out a few more years.

Chemistry Papers as multiple of 1907 value (1907 value = 1)

1907 – 1

1922 – 2

1937 – 4

1952 – 8

1967 – 16

1982 – 32

1997 – 64

2012* – 128

Mathematics Papers as a multiple of 1907 value

1907 – 1

1919 – 2

1931 – 4

1943 – 8

1955 – 16

1967 – 32

1979 – 64

1991 – 128

2003 – 256

2015* – 512

These are not exact figures, just approximations. However, the magnitude of the increase is so great that the qualitiative argument I am making: that science has been inflated into a science industry, is still supported. Half the magnitude of increase shown here (or double it), and the argument will still be supported.

And that argument is that there is a great journal industry. The big journals don’t provide anything of real value. Yes there are some subscriptions, and Universities are sure to stock up on all the best journals! But that’s not really the point. Universities have these journals for the authority points, and people publish in the journals for the authority points.

Universities like to be able to say that their professors are published in all the best journals, and when getting research funding from organizations like the National Science Foundation (established in 1950) it’s made easier by being published in all sorts of journals.

Professors are increasingly judged by how much they are published, and in what journals, and how often they are cited.

And this great journal industry, this great scam, this sham, this hoax has resulted in professors teaching less, teaching assistants teaching the actual classes at college, and as a result increasing the volume of research papers being published every year.

The increasingly authoritarian nature of the University, created largely by the rise of peer review, has in my opinion produced ANOTHER kind of authoritarianism – an authoritarian view of knowledge and truth.

Because there are so many papers, and these papers use unnecessary mathemagical tricks that are difficult to understand, and they are behind paywalls, this creates an impenetrable mass on any subject. And so if, for example, you wish to say that global warming is a great big scientific siren song, one common response is to point to the great mass of impenetrable research.

And it is impenetrable in 3 ways:

  1. It’s volume – caused by professors wanting to get published in peer-reviewed journals so they can get tenure. 1.486 million research papers were published in 2010
  2. The literal paywalls – caused by professors publishing in peer-reviewed journals instead of publicly available locations because they want to “get published”
  3. The use of unnecessary mathemagical techniques and obscure language – caused by researchers who want to produce big effect sizes in a paper where there are none and hide problems behind obscure language that referees only sometimes recognize

In conjunction with overspecialization, the increasing impenetrableness of research has the effect of scientists just going along with consensus – if they even know what the consensus even is. You can’t argue about a topic because you literally can’t read the research.

And so a great number of people, especially a certain kind of people, simply go with cues. Their thinking goes like this:

“Scientists are supermen. They speak a completely different language, understand mathematical and statistical concepts at a level I never will, and they are 99% honest because the hanging razor of peer review and discerning eyes on their work.”

And so when there is an argument between perceived University consensus and someone who opposes it, there are these people who are impervious to argument. They think,

“Well, you may pose some tricky questions and objections that I can’t counter, but one of these scientist supermen easily could. Really, this isn’t an exchange. This is me trying to advance the true view – established by supermen in Universities via the peer review process – against the heathens.”

On a final note, one problem with the criticism of peer review you normally hear is that it is reformist. Reform is for something that has worked in the past but has stopped working. There’s no evidence that peer review ever worked, so there’s no need for reform. Just abolish it.

It’s a practice that emerged in response to social pressures and University status-games, not scientific necessity. It emerged from the ugliest traits in people, and it is an authoritarian truth filter that belongs in the middle ages. It lies by calling itself “peer review” and says that anyone who opposes journal judgment is against peer review, and it’s just another tool to make the lives of heterodox scientists even more difficult.

It is a medieval process that ought to be scrapped root and branch.

Peer review: a flawed process at the heart of science and journals:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1420798/

“Peer Review for Journals: Evidence on Quality Control, Fairness, and Innovation” – J. Scott Armstrong

The Wharton School, University of Pennsylvania :

http://repository.upenn.edu/cgi/viewcontent.cgi?article=1110&context=marketing_papers

“Duplicate and Fake Publications in the Scientific Literature: How many SCIgen papers in Computer Science?” – Cyril Labbe and Dominique Labbe
https://hal.archives-ouvertes.fr/hal-00641906v2/document

The “Rooter” Paper:
http://pdos.csail.mit.edu/scigen/rooter.pdf

“Who’s Afraid of Peer Review” by John Bohannon

http://www.sciencemag.org/content/342/6154/60.full

GENDER FACTORS IN REVIEWER RECOMMENDATIONS FOR MANUSCRIPT PUBLICATION – Margaret E. Lloyd:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1286270/pdf/jaba00090-0150.pdf

Accuracy of References in Five Entomology Journals – Cynthia Kristof:
http://files.eric.ed.gov/fulltext/ED389341.pdf

Statistical Reviewing for Medical Journals:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.471.730&rep=rep1&type=pdf

Confirmational response bias among social work journals:

http://sth.sagepub.com/content/15/1/9.abstract

What errors do peer reviewers detect, and does training improve their ability to detect them? – Fiona Godlee et. al. , 2008 :
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2586872/

Melissa Haendel – “On the reproducibility of science: unique identification of research resources in the biomedical literature” –
https://peerj.com/articles/148/

“Sharing Clinical Research Data: A Workshop,” –
http://blogs.nature.com/spoonful/2013/09/researchers-less-willing-to-share-study-details-according-to-journals-survey.html

Amgen Replication attempts:
http://www.nature.com/nature/journal/v483/n7391/full/483531a.html

Bayer Replication attempts:
http://www.nature.com/nrd/journal/v10/n9/full/nrd3439-c1.html

Psych File Drawer:
http://www.psychfiledrawer.org/view_article_list.php

“Who Reviews the Reviewers? Feasibility of Using a Fictitious Manuscript to Evaluate Peer Reviewer Performance”:

http://www.annemergmed.com/article/S0196-0644(98)70006-X/fulltext

Oxford Reviewer agreement study:
http://brain.oxfordjournals.org/content/123/9/1964

Randy Sheckman’s guardian article:

http://www.theguardian.com/commentisfree/2013/dec/09/how-journals-nature-science-cell-damage-science

The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909426/

Testing for the Presence of Positive-Outcome Bias in Peer Review A Randomized Controlled Trial:

http://archinte.jamanetwork.com/article.aspx?articleid=226270

Publication Bias: The “File-Drawer” Problem in Scientific Inference:

http://www.scientificexploration.org/journal/jse_14_1_scargle.pdf

Facebook Comments