Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
45 CICHETTI: THE RELIABII .ITY OF PEER REVIEW completely conclusive and the editors must judge as best they can the inconclusive evidence which bears on the subjective acceptance criteria (Adair & Trigg 1979, p. 476). Consistent with Adair's assessment, Lazarus (1982, p. 219) notes that with respect to levels of interreviewer agreement for manuscripts submitted to the Physical Review Letters, "in only 10-15% of cases do two referees agree on acceptance or rejection the first time around and this with the authors' and institutional identities knownl" Adair (1982) has expressed optimism that this situation will improve. Formal studies of the reliability of peer review for manuscripts submitted to physical science journals, especially in the more general areas, must be conducted, however, so that our conclusions can be based on more quantitative results than have been available thus far. Since the Physical Review Letters has been considered one of the two most prestigious publications in the field (Beyer 1978; Lodahl 1970), and, similar to the general journals in behavioral science and medicine, it does use the two-initial-referee system, a more quantitative assessment of peer-review practices should be of more than passing interest to an important segment of the scientific community. If such a study were undertaken, we would predict that levels of referee consensus for Physical Review Letters would be of the same relatively low order of magnitude (typically below R, of .40) characterizing general journals in many other disciplines. The 1985-86 rejection rates of Physical Review Letters (consistent with the ordering of those for the Physical Review ) are the highest for the general subfields of general physics (74%, or 631 manuscripts [MS] rejected/854 MS received) and cross disciplinary physics (68%, or 71/106); the rejection rate was lowest for the much more specific subfield, atoms and molecules (52%, or 243/470). Moreover, these data are consistent with journal rejection rates in psychology (Summary Report of Journal Operations 1988) in which general focus journals have the highest rejection rates, for example, the Journal of Applied Psychology (93%), Psychological Review (89%), and the Journal of Experimental Psychology (JEP): General (81%). At the same time, the more specific focus journals have the lowest rejection rates, for example, JEP: Learning, Memory, and Cognition (58%), the Journal of Comparative Psychology (39%), and Behavioral Neuroscience (also 39%). These data are also consistent with those reported by Lock (1985) for medical journals. Similarly, Hargens (1988, p. 139) notes that "cultural anthropology journals have higher rejection rates than physical anthropology journals, and rates for journals in social, abnormal, and educational psychology exceed those in experimental, comparative, and physiological psychology." During the early 1980s, the general focus (cultural) journal, American Anthropology, had a rejection rate of 85%, while the American Journal of Physical Anthropology evidenced a sharply contrasting rejection rate of only 22% (Hargens 1988, p. 150). Our work and that of Hargens and Herting (1990b), support the argument that while manuscripts submitted to the journals studied in the behavioral and medical areas seem routinely to receive at least two independent reviews, this option is used in physics and related fields only when a manuscript seems problematic. In contrast to the experience of Physical Review and other physics journals (e.g., Abt 1988), fewer than 1 in 4 manuscripts (22% of 2274 manuscripts) submitted to the general Journal of Abnormal Psychology in 1973 received reviews based on the deliberations of a single referee. Moreover, the overwhelming majority of them (52/59 or 88%) were rejected. Since the only comprehensive study of peer review of grant proposals was undertaken by Cole et al. (1981), this area is completely open to further research. Roy (1985) reminds us that there are five systems of grant review which are so different that criticisms aimed at one of them are not applicable to the others. For example, although all five systems use mail reviewers, they differ in terms of: (a) who selects the reviewers (i.e., program managers or peers unknown to the program managers); (b) the specific method of grant evaluation (average of referees' ratings, or the decision of an independent panel of peers); and (c) whether or not peer reviews are followed by a panel site visit. One interesting research question accordingly concerns how such differences might influence both the reliability and validity of grant reviews. 4.6. Reliability of grant reviews. The major source of data on the reliability of grant reviews is NSF grant submissions in three areas of study (chemical dynamics, economics, and solid state physics) as analyzed by Cole 4c Cole (1981, pp. 71-79). Three sets of reviews were considered: (1) NSF "open" (nonblind) reviews, (2) the Committee on Science and Public Policy of the National Academy of Sciences (COSPUP) "open" reviews, and (3) COSPUP "blind" reviews. Commenting on the interTeferee reliability estimates from these data. Cole and Cole (1985, p. 38) wrote, "We have treated the reviewer variances as rough indicators of disagreement among reviewers." In order to derive direct indicators of disagreement among reviewers, we first identified the problem of assessing grant-review reliability as a case of a more general problem in which: (1) there are two or more independent examiners per subject or object being evaluated; (2) both the number and actual examiners can vary from subject to subject (here, submitted manuscripts), and (3) the data derive from a continuous, dimensional, or quasi-dimensional scale of measurement. In their description of a computer program for analyzing such data, Cicchetti and Showalter (1988, pp. 717-18) noted that "an area of inquiry to which this design would apply is the assessment of the reliability of the peer review of grant applications. Thus, there may be two independent reviewers for grant A and four different independent reviewers for grant B. The range of possible ratings may be between, say, 10 (lowest score possible) and 50 (highest score possible), such as the evaluation schema used by referees in the peer review of National Science Foundation (NSF) grants (e.g., Cole it Cole 1981)." As mentioned in section 3.3, the statistic of choice would be the intraclass correlation coefficient (R,, Mode) III), based on the average number of reviews per grant proposal, as discussed in both Bartko 4c Carpenter (1976) and in Cicchetti 4c Showalter (1988). These results are presented (for the first time) in Table 4 and once again indicate rather low levels of chancecorrected agreement. These range from . 18 for COSPUP