Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
40 CICHETTI: THE RELIABII .ITY OF PEER REVIEW the point," or "excessively verbose" to judge the succinctness (attribute) of a given manuscript. In peer review, referees evaluate the attributes of scientific documents according to sets of specific criteria. Then editors or granting officials apply additional evaluation criteria to reviewers' reports to decide whether or not to accept a manuscript or fund a proposal. A number of evaluation criteria are relevant to the review of manuscripts as well as grant proposals. For example, reviewers are usually expected to use criteria to assess (1) the relevance and completeness of the review of the research literature; (2) the author's level of originality or imaginativeness; (3) the adequacy of the research methodology; (4) the data-analytic strategies; (5) the importance (usefulness) of actual or expected findings; and (6) the clarity and organization of the information the author presents. Other attributes are specific either to manuscript or grant review: Reviewers and editors must judge a manuscript's level of interest to the readership of the journal, whether its length is justified, and how much space is available in the journal. Grant reviewers and program directors must judge: the applicant's prior scientific contributions (or "track record"); the adequacy of the institutional setting in which the research would be undertaken; the appropriateness of the budget request relative to the objectives stated in the proposal; and the availability of funds. 1 3. Empirical issues: Methodology and dataanalytic strategies 3.1. Research designs used In peer-review studies. A wide spectrum of research designs has been used in studying peer review, including: 1. Qualitative or semiquantitative studies of reviews of selected journal manuscripts (Ingelfinger 1974; McCartney 1978; Patterson 1969; Smigel & Ross 1970) 2. Quantitative studies of hypothetical reviews of manuscripts (requiring referees to evaluate Or rank order the value of normative attributes "as if" they were applying them to actual submitted manuscripts; e.g., Kerr et al. 1977; Lindsey 1978; Rowney & Zenisek 1980) 3. Quantitative naturalistic studies of the reliability of referee reports on scientific documents, including papers submitted to professional societies (e.g., Cicchetti & Conn 1976; Conn 1974), journal manuscripts (Cicchetti 1980; Cicchetti & Conn 1978; Cicchetti & Eron 1979; Hargens & Herting 1990a; Ingelfinger 1974; Lock 1985; Orr & Kassab 1965; Scott 1974; Smigel & Ross 1970; and Whitehurst 1983; 1984), and grant proposals (Cole & Cole 1981; 1985; Cole et al. 1978; 1981) 4. Quasi-experimental studies (Peter & Ceci 1982) 5. Experimental studies of the reliability of the peerreview process (Armstrong 1980; 1982a; Goodstein & Brazis 1970; Mahoney 1977; 1978; and Mahoney et al. 1978). The distinction between quasi-experimental and experimental studies is based on the extent to which alternative interpretations of a given result can be ruled out. We agree with Peters & Ceci (1982, p. 246) that "the quasi-experimental design ... is, in general, insufficient to rule out alternative explanations unequivocally," but we also agree with the same authors (p. 247) "that quasi-experimental designs can provide causal inferences when used along with convergent and cogent reasoning and analysis." In a broader sense, when conclusions drawn from quasi-experimental and experimental studies are consistent with those from less well controlled studies (such as the first three research designs just described), one can be more confident that the missing controls did not materially influence the direction or quality of the results. This point will be reemphasized later when we compare conclusions from peer-review studies differing widely in how well potentially relevant variables were controlled. 3.2. Types of reliability assessments. One purpose of this paper is to examine the reliability of the peer-review process. Accordingly, it is important to analyze how reliability has been measured and what statistical approaches have been used. Depending on the specific research question, any of several types of reliability measures could be appropriate: internal consistency, interreferee agreement, or even stability across time. The most common measure is interreferee agreement at a single point in time. Interreferee reliability is defined as the extent to which two or more independent reviews of the same scientific document agree. To choose an appropriate statistic for assessing levels of interreferee agreement, it must be known whether or not the same referees evaluated the documents under study and whether or not the same number of referees evaluated a given document. The statistic should also assess how much referee agreement is influenced by chance alone (e.g., Watkins 1979). Finally, the scale of measurement by which the data are expressed needs to be identified. 3.3. Appropriate statistics. Which reliability statistics are appropriate will vary according to whether the reviewers evaluate papers for presentation at scientific meetings, manuscripts submitted to professional journals, or grant proposals submitted for research funding. Papers submitted to scientific meetings are sometimes all evaluated by the same two referees, since the scientific documents are usually rather brief (e.g., extended abstracts). Here, either the unweighted kappa statistic (Cohen 1960) or the weighted one (Cohen 1968) would be appropriate. 2 The choice would depend on the evaluative scale available to the referees. A nominal dichotomous scale such as "accept" or "reject" would require unweighted kappa, whereas an ordinal or rank-ordered evaluative scale, such as one ranging from "reject" to "accept only if time and space are available" to "accept unconditionally" would require the weighted kappa statistic. When the same three or more referees all independently evaluate the same set of papers, then the intraclass correlation coefficient (R,), Model II would be appropriate (e.g., see Bartko 1966, 1974; 1976; Bartko & Carpenter 1976; Cicchetti et al. 1976; Cicchetti & Conn 1976; Fleiss 1981). 3 Manuscripts submitted to professional journals are evaluated by different sets of reviewers, since it is obviously not feasible for the same two or more reviewers to undertake all the assessments. A statistic of choice here would be Model I of the R,. 4 (For more information about