Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
48 CICHETTI: THE RELIABII .ITY OF PEER REVIEW future performance of NSF reviewing? Our data suggest that the difference in agreement levels on proposals with low and high ratings might even intensify. In contrast to this, Wiener et al. (1977) provided some suggestive data (variances and standard deviations) to show that reviewer agreement levels were highest on AHA grants receiving top grades, worst for grants receiving the lowest grades, and intermediate for grants receiving intermediate grades. An obvious question would concern what percentage of AHA grants were in fact funded during this 1974—75 period of grant submission and review. Again, further research and additional analyses of existing data on the peer review of grant proposals are urgently needed to help clarify these important issues. 5.2. Caveat #2: Field studies of peer review lack necessary controls for proper Interpretation. Because varying sets or numbers of reviewers examine different manuscripts or grant proposals, it is never possible in what we have called naturalistic research designs (sect. 3.1) to determine how much of the unreliability results from differences in the characteristics of the reviewers themselves: for example, level of experience; harshness or leniency as critics; the quality or technical difficulty of the manuscripts, abstracts, or proposals under review; blindness or openness of reviews; or some attribute that may be masked in the reasons the reviewer offers for recommending rejection. Such attributes include the following: theoretical biases; biases against statistically nonsignificant results; and the prestige of the author or institution. To make matters even more complicated, the unreliability of peer review may in fact involve some complex interaction among some or all of these or still other uncontrolled variables. 6. Clarifying issues of interpretation 6.1. Quasi-experimental and experimental studies of peer review. To study directly the influence of prestige of the author's affiliation on the reliability of peer review, Peters and Ceci (1982) resubmitted 12 articles that had already been published in prestigious psychology journals (between one and one-half and three years earlier) by authors from highly regarded and well-published American psychology departments. The authors' names and affiliations were fictionalized, the latter being made much less prestigious (e.g., 'Tri-Valley Center for Human Potential"). Only 3 of the 12 resubmissions were recognized as having been published previously. All but 2 of the 18 referees and editors recommended rejection of the resubmitted publications. One weakness of this study was the authors' contention that the findings provided evidence of reviewer bias in favor of high-status authors or high-status affiliations. A plausible alternative explanation has been offered by critics, namely, that the results provide evidence of reviewer bias against low-status authors and/or institutions. As Peters and Ceci appropriately respond however, "While we do not know for certain, which of the two forms of bias is more likely, neither is desirable." (Peters & Ceci 1982, p. 247). Consistent with Peters & Ceci's findings, a large number of authors using research designs other than quasi-experimental ones have reported a relationship between author affiliation and the likelihood of publication in major journals (e.g., see Berelson 1960; Beyer 1978; Cleary & Edwards 1960; Crane 1967; Goodrich 1945; Kraus 1950; Pfeffer et al. 1977; Yotopoulos 1961). A second criticism of the Peters & Ceci study is that it lacked an appropriate control group consisting of previously rejected manuscripts resubmitted for further review. Smigel and Ross (1970) tested just that: They resubmitted an "accidental" sample of eight rejected manuscripts that had remained in their editorial files to a new set of reviewers under a new editor of Social Problems. Of these, seven were rejected by both editorial referees and one was conditionally accepted by one referee with no opinion given by the second. Whatever interpretation one chooses to make of these findings (since neither study included proper controls), the results are consistent with the data presented in Tables 5 and 6, namely, that reviewers have much less difficulty in agreeing on rejection than on acceptance. In one of the best controlled studies of peer review (89% response rate, random assignment to experimental conditions) Mahoney (1977) invited 75 guest reviewers of the Journal of Applied Behavior Analysis to review manuscripts that all tested the same dominant behavior modification hypothesis. The manuscripts had identical Introduction and Methodology sections, but varied systematically in whether the Results and Discussion sections were (i) not provided at all, or the findings were described as either (ii) " positive," (iii) "negative," or (iv) "mixed." Referees were asked to judge the manuscript on the basis of overall scientific merit (publishability) and to apply normative criteria, including ratings'of topical relevance, methodology, and data presentation. The referees of the manuscripts reporting positive results usually recommended acceptance with moderate revisions. The referees who received papers showing mixed results consistently opted for rejection. Those who read manuscripts giving negative results typically recommended rejection or major revisions. Referees evaluating manuscripts that reported no results at all gave more positive recommendations than those whose manuscripts had a Results section. For both the positive and the negative manuscripts there was an R of .94 between ratings of perceived adequacy of "methodology" and potential publishability; there was a corresponding R of .56 between the perceived adequacy of "data presentation" and publishability. In another set of analyses, marked discrepancies were found between what referees predicted as their expected levels of interrater reliability on the various evaluative criteria and what turned out to be their actual levels of interrater reliability: The predicted reliability (R,) levels for the criteria (e.g., adequacy of methodology, extent of overall scientific contribution) varied within a narrow range of. 69 to .74. The actuailevels of R, ranged between — .07 (below chance expectancy) and +.30. In fact, Mahoney's finding of an R, of only .03 between referee ratings of methodologic adequacy, coupled with an R of .94 between perceived adequacy of the methodology and publishability is entirely consistent with the findings of two naturalistic studies discussed earlier (Cicchetti & Eron 1979; Scott 1974) and also with the results of two