Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
59 CICHETTI: THE RELIABII .ITY OF PEER REVIEW validity of peer review is considerably trickier than designing studies to assess inter-reviewer reliabilities. In particular, difficulties in selecting an appropriate criterion measure with which to assess research quality have hindered efforts to conduct empirical research on this topic. Researchers typically use journal citation frequency as a criterion measure in these studies, testing the hypothesis that, if manuscript reviews, have predictive validity, then papers that receive highly positive reviews should be those that report the most important, welldesigned studies. These papers should therefore be cited more frequently than papers that receive less positive reviews (Gottfredson 1978). Although citation frequencies have been used to assess journal quality and the eminence of individual researchers (Garfield 1972; Lindzey 1977), the use of citation indices as a measure of the quality of a particular piece of research is questionable for several reasons. First, we make a number of assumptions regarding the quality of research based on the journal in which it appears. If a paper is published in a prestigious journal, we infer that it must be good and valuable research. Were the same paper to appear in a less prestigious journal, it would most likely be seen as less rigorous and important, and we would be less likely to cite it. Clearly, the well-known "halo effect" (Nisbett & Wilson 1977) influences our perceptions of psychological research. Second, variables unrelated to research quality will influence the number of citations a paper receives. Mediocre research in an area that is tangentially related to a variety of topics will probably receive a greater number of citations than excellent research in a more obscure and narrowly defined area. Research on experimental design and methodology tends to be the most widely cited in all branches of science (see Lindsey 1978). This is not surprising, given that such papers have implications for a wide variety of topics. Third, if a relationship between citation frequency and research quality does exist, this relationship is not likely to be linear. The relationship between research quality and citation frequency probably takes the form of a J-shaped curve, with exceedingly bad research cited more frequently than mediocre research (e.g. , as an example of an idea or a line of research that turned out to be a blind alley, or as an example of what not to do in a particular area). Finally, this outcome criterion does not allow the predictive validity of negative manuscript reviews to be assessed. Because studies receiving negative reviews may never be published (or may be published in obscure journals having very limited readerships), it is not possible to use criteria such as citation indices to assess the validity of these reviews. In any case, there have been very few studies of the predictive validity of peer review, and the results of these have not been reassuring. Gottfredson (1978) compared reviewers' ratings of psychological research papers to the number of citations received by these papers in the first nine years following publication. He found only low to moderate correlations between reviewers' estimates of manuscript quality and impact and the number of citations received by a paper. Reviewers' ratings of research impact were most strongly predictive of subsequent citation frequencies (R = .37). Ratings of research quality did not fare as well (R — .24). Thus, we know that: (1) inter-reviewer reliability in peer review is generally low (as Cicchetti et al. and others have demonstrated); and (2) we have no hard evidence that reviews have predictive (or discriminant) validity. To the extent that "confirmatory bias" (Mahoney 1985) and other variables unrelated to research quality demonstrably affect the outcome of peer reviews, the internal validity of the review process is also compromised. To anyone interested in the process of scientific inquiry and the dissemination of scientific knowledge, such findings are - to say the least - a bit unnerving. Because we regard peer review as a "test" or measure of the scientific worth of manuscripts and grant proposals (Bornstein 1990; Eichorn & VandenBos 1985), we should be able to demonstrate that this "test" is psychometrically sound. Yet, even a cursory reading of th^ American Psychological Association's Standards for educational and psychological testing (APA 1985) reveals that peer review fails miserably with respect to every technical criterion for establishing the reliability and validity of an assessment instrument (see APA 1985, pp. 9-44). If one attempted to publish research involving an assessment tool whose reliability and validity data were as weak as that of the peer review process, there is no question that studies involving this psychometrically flawed instrument would be deemed unacceptable for publication. It is not too late to make changes in the peer review process that will help improve its reliability and validity. Cicchetti makes some useful suggestions in this area, and other researchers (e.g., Bornstein 1990; in press; Mahoney 1985; 1987) have also proposed procedures for improving the review process. At any rate, in addition to investigating reliability in manuscript and grant proposal assessments, we must now rigorously assess the predictive and discriminant validity of peer review. Altering the peer review process in order to maximize its reliability and validity may be difficult (in practical/ procedural terms), costly (in monetary terms), and somewhat risky (the changes could create new problems instead of solving the old ones). I believe, however, that the costs and risks associated with changing - even experimenting with - the review process are far less than the costs and risks of continuing to support uncritically a process that, in its current form, has many significant flaws. Does group discussion contribute to the reliability of complex judgments? Patricia Cohen Columbia University School of Public Health and New York Stata Psychiatric Institute. New York, NY 10032 Cicchetti is to be congratulated on a useful summary of our knowledge in this field. It seems reasonable to conclude that these complex human judgments cannot be made very reliably, a state of affairs that has also been demonstrated in other arenas, including student and personnel selection and the identification of diagnostic levels of psychopathology. As a long line of research in these areas has shown, when more objective indices are available, they will typically have higher validities than decisions based on human judgment alone. Such objective measures in manuscript evaluation might include the status of the institution and the publication record of the authors. Alas, in such a case the "objective" criteria lead directly to the kind of bias that peer review is designed to minimize. When objective criteria are biased, the only sound alternative is to increase the number of evaluators, assuming that they will be less subject to such bias. Here practical constraints intrude, a 5 it is hardly possible for all journals to obtain a sufficient number of reviewers for all articles to ensure a reliable composite review. The situation is somewhat different with regard to peer review of grant proposals, however. Here two reviewers typically examine the entire set of materials provided and report both a summary and their critiques to a larger panel. The larger group then discusses the material presented to them and may often review some portion of the proposal as well, should a specific issue require it. This larger panel may be thought of as a means of increasing the size of the review panel and is certainly intended to improve the reliability and validity of the resulting judgment. To my knowledge, no hard evidence is available regarding the effectiveness of this subsequent segment of the review process. Because the judgments are very far from independent, and