Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)

DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation

48 CICHETTI: THE RELIABII .ITY OF PEER REVIEW future performance of NSF reviewing? Our data suggest that the difference in agreement levels on proposals with low and high ratings might even intensify. In contrast to this, Wiener et al. (1977) provided some suggestive data (variances and standard deviations) to show that reviewer agreement levels were highest on AHA grants receiving top grades, worst for grants receiving the lowest grades, and intermediate for grants receiving intermediate grades. An obvious question would concern what per­centage of AHA grants were in fact funded during this 1974—75 period of grant submission and review. Again, further research and additional analyses of existing data on the peer review of grant proposals are urgently needed to help clarify these important issues. 5.2. Caveat #2: Field studies of peer review lack neces­sary controls for proper Interpretation. Because varying sets or numbers of reviewers examine different manu­scripts or grant proposals, it is never possible in what we have called naturalistic research designs (sect. 3.1) to determine how much of the unreliability results from differences in the characteristics of the reviewers them­selves: for example, level of experience; harshness or leniency as critics; the quality or technical difficulty of the manuscripts, abstracts, or proposals under review; blind­ness or openness of reviews; or some attribute that may be masked in the reasons the reviewer offers for recom­mending rejection. Such attributes include the following: theoretical biases; biases against statistically nonsignifi­cant results; and the prestige of the author or institution. To make matters even more complicated, the unre­liability of peer review may in fact involve some complex interaction among some or all of these or still other uncontrolled variables. 6. Clarifying issues of interpretation 6.1. Quasi-experimental and experimental studies of peer review. To study directly the influence of prestige of the author's affiliation on the reliability of peer review, Peters and Ceci (1982) resubmitted 12 articles that had already been published in prestigious psychology journals (be­tween one and one-half and three years earlier) by au­thors from highly regarded and well-published American psychology departments. The authors' names and affilia­tions were fictionalized, the latter being made much less prestigious (e.g., 'Tri-Valley Center for Human Poten­tial"). Only 3 of the 12 resubmissions were recognized as having been published previously. All but 2 of the 18 referees and editors recommended rejection of the resub­mitted publications. One weakness of this study was the authors' contention that the findings provided evidence of reviewer bias in favor of high-status authors or high-status affiliations. A plausible alternative explanation has been offered by critics, namely, that the results provide evidence of reviewer bias against low-status authors and/or institu­tions. As Peters and Ceci appropriately respond how­ever, "While we do not know for certain, which of the two forms of bias is more likely, neither is desirable." (Peters & Ceci 1982, p. 247). Consistent with Peters & Ceci's findings, a large number of authors using research de­signs other than quasi-experimental ones have reported a relationship between author affiliation and the likelihood of publication in major journals (e.g., see Berelson 1960; Beyer 1978; Cleary & Edwards 1960; Crane 1967; Good­rich 1945; Kraus 1950; Pfeffer et al. 1977; Yotopoulos 1961). A second criticism of the Peters & Ceci study is that it lacked an appropriate control group consisting of pre­viously rejected manuscripts resubmitted for further re­view. Smigel and Ross (1970) tested just that: They resubmitted an "accidental" sample of eight rejected manuscripts that had remained in their editorial files to a new set of reviewers under a new editor of Social Prob­lems. Of these, seven were rejected by both editorial referees and one was conditionally accepted by one refer­ee with no opinion given by the second. Whatever in­terpretation one chooses to make of these findings (since neither study included proper controls), the results are consistent with the data presented in Tables 5 and 6, namely, that reviewers have much less difficulty in agree­ing on rejection than on acceptance. In one of the best controlled studies of peer review (89% response rate, random assignment to experimental conditions) Mahoney (1977) invited 75 guest reviewers of the Journal of Applied Behavior Analysis to review manuscripts that all tested the same dominant behavior modification hypothesis. The manuscripts had identical Introduction and Methodology sections, but varied sys­tematically in whether the Results and Discussion sec­tions were (i) not provided at all, or the findings were described as either (ii) " positive," (iii) "negative," or (iv) "mixed." Referees were asked to judge the manuscript on the basis of overall scientific merit (publishability) and to apply normative criteria, including ratings'of topical rele­vance, methodology, and data presentation. The referees of the manuscripts reporting positive results usually rec­ommended acceptance with moderate revisions. The referees who received papers showing mixed results consistently opted for rejection. Those who read manu­scripts giving negative results typically recommended rejection or major revisions. Referees evaluating manu­scripts that reported no results at all gave more positive recommendations than those whose manuscripts had a Results section. For both the positive and the negative manuscripts there was an R of .94 between ratings of perceived adequacy of "methodology" and potential publishability; there was a corresponding R of .56 between the perceived adequacy of "data presentation" and publishability. In another set of analyses, marked discrepancies were found between what referees predicted as their expected levels of interrater reliability on the various evaluative criteria and what turned out to be their actual levels of interrater reliability: The predicted reliability (R,) levels for the criteria (e.g., adequacy of methodology, extent of overall scientific contribution) varied within a narrow range of. 69 to .74. The actuailevels of R, ranged between — .07 (below chance expectancy) and +.30. In fact, Ma­honey's finding of an R, of only .03 between referee ratings of methodologic adequacy, coupled with an R of .94 between perceived adequacy of the methodology and publishability is entirely consistent with the findings of two naturalistic studies discussed earlier (Cicchetti & Eron 1979; Scott 1974) and also with the results of two

Next

/
Thumbnails
Contents