Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
72 CICHETTI: THE RELIABII .ITY OF PEER REVIEW editor must play a very active role; Cicchetti apparently does not. In general, the proportion of manuscripts (or grant submissions) that one can accept has an important influence on issues that concerned Cicchetti. When evaluating a submission to a journal (or a grant proposal), most reviewers are quite aware of the percentage of manuscripts (or grants) that can be approved. This awareness has a significant influence on the percentage of submissions that they rate as excellent, very good, and so on. Currently at the National Institutes of Mental Health (NIMH) grant submissions need to be close to a rating score of 125 (meaning an average rating of 1.25 on a 1-5 scale) to be funded. A rating of two on the NIMH scale reads "very good," but the experienced reviewer knows very well that a vote of two on a five-point scale for NIMH is a vote not to fund. Hence, reviewers even interpret descriptive labels on a scale differently depending on the percentage of potentially successful applicants. The proportion of successful applicants is an important influence on the differences Cicchetti observed between the sciences. In general, the natural sciences and the behavioral sciences differ on this important statistic; the natural sciences typically have a higher acceptance rate on both grants and manuscript submissions. I argue that the differences that Cicchetti observed between these groups - what they specifically referred to as differences in emphasis on acceptance versus rejection - are a function of this difference in the probability of success. I would argue that if one were to equate the behavioral and the natural sciences for probability of success, whether regarding article or grant submissions, one would no longer find the differences observed by Cicchetti. Cicchetti also seems confused about the role of biases in the judgmental process. The reputation of the investigator, the quality of the institution the investigator works for, and prior work by the investigator all influence the judgment that a reviewer might make. What Cicchetti seems to miss is that all of these biases artificially inflate the kind of "rating reliability" they emphasize. Cicchetti seems to imply that biases decrease reliability. In the sense that I mean the term they probably do, but such biases would increase the simple numerical correlation between reviewers' ratings that Cicchetti is concerned with. Most of the issues described above apply to grants as well as manuscripts. There are some significant differences, however, that are worth noting. The potential impact of a delay is different for a grant than for a manuscript. As Cicchetti notes, 80 to 90% of manuscripts rejected by the journals to which they are submitted ultimately get published elsewhere. Having an article rejected by one journal may only mean a delay in publication of two to four months. A delay of four to six months necessitated by a grant resubmission (which would not be unusual and may be minimal) may force an investigator to shut down a research team that had been carefully built up over a period of years. In that sense, it is especially important that we focus on making wise judgments on grant reviews and give them an importance greater than manuscript review. For example, as rating scores inflate for grants, a "blackball" becomes a critical problem. If an agency is required to average all ratings of a proposal in the decision to fund, and if a score very near 100 is necessary to be funded, then a single reviewer can blackball a grant proposal by giving it a four or five (recognizing simultaneously that a score of two is a recommendation not to fund). This is a problem particularly for controversial, new, or innovative research. The potential for blackballing grant proposals is a critical variable for protecting advances in science. We need greater flexibility for granting agencies, the sequestering of funds for especially innovative or new ideas, and a loosening of the requirement that all ratings of grant proposals be counted. Regarding the last, one might either report the median rather than the mean, or only average the N — 1 best ratings. That is, one might throw out the worst rating of any grant proposal and average the remaining ones. Making wise decisions about publishing articles and funding grants is critical for normal progress in the sciences. Viewing the reliability of reviewers' judgments as simply correlations between ratings is to miss the most important part of that judgmental process. What is most important is that the outcome of the editorial decision or the agency's funding decision be a wise one, one that facilitates the development of our sciences. In no way does a wise decision depend upon a high correlation between the ratings of reviewers. Do we really want more "reliable" reviewers? Helena Chmura Kraemer Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA 94306 Electronic mall: mn.kra@forsythe.stanford.edu First of all, congratulations to Cicchetti for his excellent target article. This paper represents a comprehensive, stimulating, and provocative discussion of issues that not only profoundly affect our individual professional lives, but the quality, consistency, and rapidity of progress within our respective fields. It is particularly interesting to read this paper from the perspective of the various roles each of us is asked to play in our professional lives: as author and reviewer of papers and proposals, as well as "consumer" of the results of published papers. It is difficult, perhaps impossible, to be objective about one's own work, as a researcher, a reviewer, a "consumer," or an editor. The standards one might apply to a review of the review process are fundamentally different from these perspectives. Accordingly, the major contribution of Cicchetti (and others whom he cites) is the objective, unemotional, and quantitative approach to these issues. Only with such an approach is there hope of identifying or correcting faults in the review process. I doubt that I was invited to comment because of any such perspectives on the problem, more probably it was because of my research on the design and analysis of reliability studies and on kappa and intraclass correlation coefficients. I will briskly discharge my duties with regard to purely statistical issues and move on to more interesting themes. I have a few points of disagreement on what was done: the choice of forms of coefficients, the use of null tests, the use of point rather than confidence interval estimates, and the use of asymptotic approximations to distributions rather than Jackknife or Bootstrap methods (Block & Kraemer 1989). If the authors and I were required to resolve such issues, I would predict we would happily reach solutions agreeable to us all, and that any resulting changes would scarcely affect the messages delivered in this paper. A kappa of .3 might become .2, or vice versa. A wide confidence interval might lessen interest in one reported study, whereas a short one might highlight another. Instead, what requires reconsideration is not the magnitude of the reported reliabilities, but what to make of them. Difficulties arise because the word "reliability" is misleading when used with a general audience likely to interpret it in the sense of "to be trusted." A "valid" measure is one "to be trusted." A "reliable" measure may only be a highly reproducible wrong answer. Two facts about reliability are well known: (1) One may have a prefectly reliable (precise) measure that totally lacks validity (accuracy), and (2) one may improve reliability (precision) at the cost of validity (accuracy). Whether we err in judging the review process by assessment of interreviewer reliability is not therefore a trivial question. My impression is that editors frequently seek reviewers with different expertise related to the various areas pertinent to the