Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
85 CICHETTI: THE RELIABII .ITY OF PEER REVIEW people relied more on personal intuition than the collective views of often faceless referees, and their success rates did not compare unfavourably. The citation rates (not necessarily an index of excellence) for the Lancet persistently show it to be amongst the top six medical journals, but it is only in recent years that outside referees have played any significant part in the acceptance of articles for publication. Although it would be unwise to turn the clock back to these early beginnings, a quarter turn anticlockwise would not be out of place. Do peer reviewers really agree more on rejections than acceptances? A randomagreement benchmark says they do not Gerald S. Wasserman Department of Psychological Sciences, Purdue University, West Lafayette, IN 47907 Electronic mail: codelab@psych.purdue.edu Cicchetti ably summarizes the accumulating evidence indicating that peer review reliability is unimpressive. And he correctly concludes that this finding has implications that should influence the structure of peer review systems. He weakens this conclusion, however, by adding the notion that reviewers agree fairly strongly with each other when they reject, even if they do not agree strongly when they accept. This notion is troubling for two reasons. First, it leads to a seductive rationalization for inaction: It is easy and quite comforting to say that there will never be enough money (or journal space or whatever) to allocate to all good research. Therefore, we should be content if we can make sure that scarce resources are not wasted by allocating any of them to bad research. The present systems would supposedly do this despite their low overall reliability, if they really did consistently exclude bad research. The second reason is that the notion is counterintuitive: Acceptance and rejection are just opposite sides of the same dichotomy. What is true for the one should be true for the other. Tliis intuition prompted me to examine the evidence that led Cicchetti to his notion. Specifically, I compared his actual results with benchmark results one would obtain if no real agreement existed and if reviewers were making purely random judgments. This examination shows that the intuition is correct and his notion is unfounded. I will illustrate the examination with a detailed analysis of the data Cicchetti presents in his Table 6. And I will present expressions to calculate the general case: Figure 1 gives a graphical representation of Cicchetti's data. It represents the peer review process as a sequential flow chart, even though the reviews are actually done independently. The input to the process is 150 grant proposals, of which Table 6 indicates that 52 got high ratings and 98 got low ratings. I have interpreted these tabular entries to mean that the reviewers' average rating was high for 52 of the 150 proposals and low for the other 98. I have added the further assumption that the average individual reviewer's performance is given by the collective average of all the reviewers' performances. The proposals are read by one peer reviewer (Rater 1) who accordingly judges (on average) that 52 proposals are high and should be accepted (YES), while 98 proposals are low and should be rejected (NO). Then these proposals are read by the second peer reviewer (Rater 2), who also says YES to 52 (i.e., 28+24) proposals and NO to 98 (i.e., 24 + 74) proposals (which shows that the flowchart representation does not depend on which reviewer actually judged first). I backed into the relation between the two raters' judgments by using the agreement percentages in Table 6. They show, as one would expect from the weak reliabilities reported in Table Acceptance Disagreement Rejection Agreement Agreement = = 76% Difference = 22% Figure 1 (Wasserman). Flow chart intended to represent the evaluation of grant proposals by two peer reviewers. Data were taken from Table 6 of the target article. See text for detailed explanation. 6, that the reviewers' judgments are weakly correlated: Rater 2 gives 28 YESs to the 52 proposals rated YES by Rater 1; this figure is computed from the tabular acceptance agreement of 54%. On the negative side, Rater 2 gives 74 NO's to the 98 proposals rated NO by Rater 1; this figure is computed from the tabular rejection agreement of 76%. Figure 1 shows, as noted in the target article, that rejectiQn agreement is 22% higher than acceptance agreement. It is on this kind of difference that Cicchetti bases his notion. But the first question should be. Against what benchmark should these numbers be evaluated? Figure 2 shows a benchmark created by a completely random process. In this case, the peer reviewers do not read the proposals. Instead, each reviewer has a bucket that contains 150 balls with 52 white balls and 98 black balls. Each proposal is "judged" by reaching into the bucket and 150 White and Black Balls / 34 £es Rater 2 no> — Acceptance Disagreement Rejection Agreement Agreement = 35% = 65% Difference = 30% Figure 2 (Wasserman). Flow chart representing a benchmark based on purely random agreements between two separate reviewers evaluating 150 proposals. The first reviewer does not read the proposals, but instead evaluates the proposals by pulling balls from a bucket with 52 white balls (YES) and 98 black balls (NO). The second reviewer has another such bucket to use independently.