Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
80 CICHETTI: THE RELIABII .ITY OF PEER REVIEW for an example.) This matter deserves more formal study, but accurate judgments of scientific importance are probably reliable only years after the fact of publication, with the wisdom of hindsight. In summary, let us simply grant that the peer review system is inherently unreliable, to a great extent. Two reasonable people, both experts in their fields, can look at the same manuscript or grant proposal and reach quite different conclusions about its merit. But if scientists cannot really make valid judgments about such matters (which seems likely, too), then the unreliability may not actually be harmful. Perhaps the randomness introduced into the system is good for it, if even reliable judgments have little validity. If these conclusions are indeed facts, should we be depressed and give up peer review? I don't think so. After all, peer review does function well (a) to eliminate the real "bloopers," and (b) to provide expert opinion to authors, which is often helpful (in my experience). And there seems no reasonable alternative to peer review, no system that would work so well without engendering more problems than it solved. My recommendation is that editors and grant administrators recognize fully the potential flaws in the peer review system and work around them. In cases of divided opinion, editors may use the heuristic of "when in doubt, accept" (cited by Cicchetti). My view is that, in most fields, the unreliability of peer review does little harm and may do good, assuming that several journals are appropriate outlets for a piece of work. If a paper is rejected by one, the negative reviews can be used as advice for improvement for resubmission elsewhere. Given several outlets, persistent authors, and unreliability in the peer review system, worthy papers will eventually see the light of day, even if not in the outlet of first choice, and at a slight delay. The situation with regard to grant proposals is less optimistic, mainly because there are fewer sources of funds. A negative evaluation is more likely to mean that the work will not be carried out. Evaluating proposed research seems even more fraught with difficulty than evaluating completed work. One solution would be to follow the Canadian system in which (as I understand it) many researchers are given small seed grants at the beginning of their careers, and then the system rewards those who carry forward successful research programs. Perhaps in awarding grants we should place greater emphasis on the applicant's past record of research and less emphasis on the writing of a promissory note (in the form of a proposal) for future work. This recommendation assumes that greater reliability and validity can be exhibited by judges in evaluating research records than in evaluating research proposals, a topic that awaits future investigation. Some indices of the reliability of peer review Robert Rosenthal Department of Psychology, Harvard University, Cambridge, MA 02138 Cicchetti has performed an important service to the several sciences by summarizing what is known about the reliability of peer review. Given the impact of Behavioral andB rain Sciences target articles, it is likely that his paper will encourage further research and further thinking about the reliability of peer review. Its impact may also extend to the encouragement of the use of various indices of reliability of judgments. It is therefore of special importance to be clear about several issues relevant to the choice of indices of reliability. The purpose of this commentary is to suggest some friendly amendments to the evaluations of several indices of reliability referred to or used in the target article. Three more-informatlon-efflclent Indices. Three of these indices of reliability are very information-efficient in the sense that they use all the information available and give a single, unequivocal, focused, single df, easy to interpret index of magnitude of relationship (Rosenthal 1987; Rosenthal & Rosnow 1985; Rosenthal & Rubin 1982). These are the Pearson R, the intraclass correlation, and Cohen's (1960) kappa applied to the 2x2 table. Especially for that case of the intra-class r in which each rater judges all stimuli, all three of these indices are equivalent to product-moment correlations. Indeed, Fisher developed the intraclass R to be able to apply Pearson R to twindata in which it would be arbitrary to designate either twin as the X or the Y. Fisher originally dealt with this situation by listing each twin pair twice, once as XY and once as YX (Snedecor & Cochran 1967). Cohen's kappa in the 2 x 2 case is equivalent to the Pearson R in its 0,1 incarnation, an R sometimes referred to as the phi coefficient. In short, these three indices all tell essentially the same story, so it seems inconsistent to label the intraclass R as appropriate (Cicchdtti, sect. 3.3) and the Pearson R, from which the intraclass is derived, as inappropriate (sect. 3.4). The Pearson R "ignores the extent to which given pairs of reviewers disagree on any single evaluation" precisely to the same degree that the intraclass R (Model II) does. If it is desired that absolute differences in raters' judgments be considered, intra-class R Model I can be used. Incidentally, it should be noted that the equations given for intraclass R Models I and II are not standard. [Corrected in printed version, Ed.] The definitional equation (Guilford 1954; Snedecor & Cochran 1980) for Model I is: R,MSS - MSE MSS + (r-l)MSE (Model I) (1) where MSE pools raters and residual mean squares, whereas for Model II it is: MSS ~ MS(RS) MSS + (R—l)MS(RS) (2) where MS(RS) is the residual mean square only. Three less-lnformatlon-efflclent Indices. Three of these indices are usually less information-efficient, sometimes very much so: rates of agreement (sect. 4.7), \ 2 (sect. 4.7), and kappa for tables larger than 2 X 2 in which kappa has not been weighted to become effectively a focused, single df, effect-size estimate. Rates of agreement suffer from the problem that nearly perfect agreement can occur with actual R near zero (Rosenthal 1984; 1987). x 2 suffers from its being a product of R 2 and N so that it is driven up not only by increases in reliability but by increases in sample size as well (Rosenthal & Rosnow 1984). Kappa on df > 1 suffers from the same problem as any other diffuse or omnibus procedure, namely, that whatever its size, we cannot tell where the agreements or disagreements arise unless kappa approaches unity so that there are no disagreements (see Fleiss 1981). An example. Because of the valuable information provided in Cicchetti's Note 6 we essentially had the raw data for the Journal of Abnormal Psychology set of 1,313 articles and the ratings of two referees for each article. Each referee could use 4 levels of evaluation, so the data could be cast into a 4 x 4 table of agreement. The product moment R using linear contrast scores of-3, -1, +1, +3 for the 4 levels of evaluation was .189. The corresponding kappa was .108. When the 4X4 table was condensed to a 2 X 2 table, the product moment R was identical to kappa; both were .145, illustrating both the loss of information in going from 4 levels to 2 and the equivalence of R and kappa for a 2 x 2 table (d/= 1). The same data of Note 6 can be used to address an additional issue. In section 4.7, agreement rates had been used to assess the question of whether reviewers agree more on decisions to reject than on those to accept manuscripts. Table 5 of the target article shows agreement levels of 44% on decisions to accept and 70% on decisions to reject for the data on the Journal of Abnormal Psychology. Using kappa or Pearson R, however,