Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
70 CICHETTI: THE RELIABII .ITY OF PEER REVIEW research administrator does run out of funds at 155. Should we abandon the current system, or shore it up for the present, hoping for a better time when increased research budgets will at least allow more approved projects to be funded? The reliability question would not go away, but fewer people with high quality projects would go unfunded and the problem would appear less pressing. Cicchetti does not advocate abandoning peer review, but he does offer several recommendations to improve the system. Let us consider them. (1)"To improve the reliability of peer review, a minimum of three independent referees has been recommended." This is a very useful recommendation, and one that the research program in the Department of Veterans Affairs has used successfully for the last five years. DVA Medical Research Service uses four independent reviewers for each submitted research proposal in its Merit Review Program (equivalent to the NIH ROl Program). Two written independent reviews are submitted by scientists who are not members of the Merit Review Board. These reviewers are selected because their own work is closely related to the applicant's problems. Two additional written reviews are prepared by the primary- and secondary-reviewer members of the Merit Review Board. Thus, when considering the application, members of the board can debate the relative importance of each critique and weigh the specific evaluations of four independent reviewers in light of this debate. The strength of this process goes beyond the number of reviews per proposal, to the discussion and analysis of the four independent critiques that take place during the review session, leading to a consensus evaluation by board members. This process still cannot accurately fine-tune shades of excellence, but it is a practical method of weighing the opinions of four independent reviewers. (2) "Using author anonymity or 'blind' review'." This recommendation is impractical for the grant review process. Since past scientific productivity is a critical element of grant review, it would be extremely awkward to try to disguise the authorship of the proposal as well as the authorship of published papers from previous funding or training periods. Author anonymity would, in my opinion, significantly weaken the grant review process. (3) "Revealing reviewer identity." Cicchetti comes down on the side of encouraging reviewers to reveal their identity voluntarily. Our experience in the grant review process is that anonymity is critical. I believe it is important to avoid personalization of the review process. It should be emphasized that the evaluation of a project is a consensus opinion of a committee, the membership of which is public knowledge. The strength of the process is that of evaluation by a group of experts. (4) "Author review of referees." This recommendation is attractive and could in fact help weed out inappropriate reviewers. In addition to author reviews of referees, members of peer review committees could also identify reviewers who are problematic. Consideration will be given to implementing this idea in the DVA's scientific review process. The DVA's system has already adopted the practice of encouraging applicants to list potential reviewers and those they do not wish to review their proposals. (5) "Developing a peer review appeals system for grant submission. " This should be a critical aspect of any grant review process. The peer review system is a human enterprise and thus, not perfect. There must be a mechanism for applicants to appeal the results. The DVA system has had an effective appeals procedure for over a decade. Appeal has proved to be a complex and sensitive area and we have found it necessary to revisit the ground rules for appeals periodically. Cicchetti has raised some important concerns about the peerreview system and has made some useful recommendations for improving it. While much of the stress in the current grant review process is a function of the small percentage of high quality research projects that can be funded, efforts to analyze and strengthen peer review are to be applauded. Referee agreement in context Lowell L. Hargens Department of Sociology, University of Illinois, Urbana, IL 61801 Electronic mail: hargens@uiucvmd.bltnet Cicchetti provides a valuable summary of the procedures and results of studies of interreferee agreement in peer review. Many will be surprised by the generally modest associations between referee recommendations, with most studies yielding intraclass correlations in a range between .20 and .35. Cicchetti characterizes these levels of agreement as "poor," and others have claimed that they indicate that chance plays an important, if not dominant, role in the assessment of scholarship (Cole et al. 1981; Lindsey 1988; Mahoney 1976). These modest associations should be viewed in the context of the entire peer review process, however; failure to do so gives a misleadingly pessimistic impression of the value of referees' assessments. Below I focus on referees' evaluations of manuscripts submitted to scholarly journals, but my arguments hold in general for peer review of grant proposals, too. Those who argue that the modest associations between referees' evaluations imply that referee evaluations have low reliability usually base their interpretation on a psychometric perspective. This perspective, often called "classical test theory," views different referees' recommendations as parallel measures of a latent trait (see Lord & Novick, 1968); usually the trait is seen as the "scholarly quality" of a manuscript. Editors' discussions of the strategies they use in selecting referees cast doubt on the appropriateness of this perspective, however. They report frequently choosing referees they think will be sensitive to different aspects of a manuscript, perhaps one to judge its analytic procedures and another, its substantive contribution (Campbell 1982; Roediger 1987). If these different aspects are only moderately correlated across manuscripts, referees' assessments should show low agreement. In addition, in some fields scholars belong to competing "schools," and editors sometimes intentionally solicit evaluations from members of both sides of a controversy (Hull 1988). If an editor always followed this strategy for controversial submissions and a large proportion of submissions were controversial, referees evaluations might be negatively correlated. Thus, referees' evaluations are often not parallel measures of a latent unidimensional trait, and the low observed associations do not necessarily imply that peer-review evaluations are unreliable. Editors' summary rejection of submissions may also produce low associations between referees' recommendations. Papers that are not sent out for review are necessarily omitted from referee agreement studies. If editors are right in claiming that summarily rejected papers are of very poor quality or are obviously inappropriate for the journals to which they have been submitted, then screening out such submissions reduces the range of papers evaluated by referees, and hence the association between referees' evaluations. Thus, reported associations between referee evaluations cannot be assessed without also considering journals' summary rejection rates. Fragmentary data on this question indicate that summary-rejection rates for prestigious social science and medical journals can be as high as 50% (Gordon 1978; Zuckerman & Merton 1971). Even if referees' recommendations were parallel measures of manuscript "quality" and editors never rejected papers summarily, the modest levels of referee agreement summarized by Cicchetti should not be taken as an indication of the reliability of the entire review process. Under the assumptions of classical test theory, referee-recommendation intraclass-correlation coefficients estimate the reliability of the average individual referee's evaluation (Tinsley & Weiss 1975); the reliability of an assessment based upon two or three referee evaluations should be considerably higher. (See also Cronbach, 1981, who noted