Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
88 CICHETTI: THE RELIABII .ITY OF PEER REVIEW initial term MS +. I am indebted to another of the commentators, Hargens, who relayed this information to me several months ago by telephone. Readers will note that Rosenthal's commentary also questioned this R formula. Finally, in Table 6, the missing R, for combined data (.32) has now been inserted, and the R, for NSF and COSPUP reviews of proposals in Chemical Dynamics should be . 16 rather than . 12. Next examined are the more formal and involved criticisms of the methodologic, statistical or data analytic techniques presented in the target article. 1.2, Interpreting levels of kappa and R,. There is concern on the part of Eckberg about the presumed arbitrariness of the Cicchetti & Sparrow (1981) strength of agreement values for kappa and intraclass correlation coefficients, namely: POOR (below .40); FAIR (.40- 59); GOOD (.60.74); EXCELLENT (.75 and above). These values are similar to those provided by Fleiss (1981), although he uses a wider range to encompass values between .40 and .74 (designated as FAIR to MODERATE). Earlier, Landis and Koch (1977) proposed six evaluative categories: less than zero = POOR; 0—.20 = SLIGHT; .21-.40 = FAIR; .41- 60 = MODERATE; .61-.80 = SUBSTANTIAL; .81 and above = ALMOST PERFECT (see also Feinstein 1987, p. 185). These examples show the similarity of guidelines research biostatisticians recommend to differentiate mere statistical significance (kappa or R, larger than 0) from significance that may be of practical or clinical usefulness, as well. The general concept is analogous to Cohen's (1988) suggested effect sizes (ES) for interpreting sample correlation values (i.e., an R. of .15 representing a SMALL, .30 a MEDIUM, and .50 a LARGE effect, when compared to expected values of zero). More important, these guidelines are consistent with the frequency with which high and low kappa values are reported for many clinical phenomena. Koran (1975a; 1975b) has shown that when kappa is used to assess interexaminer reliability levels of the presence or absence of a wide range of clinical signs and symptoms, values rarely exceed .70. Concerning the application of these guidelines, Eckberg questions the plausibility of a specific hypothesis (sect. 4.5), namely, that if a formal study were conducted on the reliability of peer reviews for manuscripts submitted to Physical Review Letters (PRL), it would be the same order of magnitude (e.g., R, below .40) that characterizes general journals in many other disciplines. Given that an average of five or more PRL reviewers is required to arrive at consensus, coupled with a 45% rejection rate (Adair & Trigg 1979, sect. 4.5), I would consider the hypothesis reasonable rather than what Eckberg characterizes as "pure speculation." 1.3. Choice of statistical tests. It was suggested by Cilmore and Rosenthal that other statistical tests may have been at least as appropriate as the ones that were used. Cilmore prefers his "shared uncertainty index" (Gilmore 1979) to the kappa and R, approaches described in the target article. For the reasons cited in section 3.3,1 do not share his preference. In the list of multiple and unique advantages of kappa over any and all of its competitors, I would add that: a. Kappa has been widely generalized to fit (1) varying scales of measurement (Cicchetti 1976; Cohen 1968); different types of rater and subject reliability research designs, for example; (2) 3 or more raters (Fleiss 1971; Fleiss et al. 1979; Landis & Koch 1977); (3) differing numbers of raters per subject (Fleiss & Cuzick 1979); (4) multiple diagnoses per patient (Kraemer 1980; Mezzich et al. 1981); (5) multiple observations on small numbers of subjects (Gross 1986); (6) single subject reliability assessments (Kraemer 1979); (7) separate reliability assessments for each category on a given clinical scale (Cicchetti 1985; Cicchetti, Lee et al. 1978; Spitzer & Fleiss 1974). b. Other generalizations include those in which (8) rater uncertainty of response is the focus (Gillett 1985); (9) the rating categories have not been defined in advance (Brennan & Light 1974; Brook & Stirling 1984); (10) multiple raters are analyzed pair by pair, when each pair rates the same set of subjects (Conger 1980) or different sets of subjects (Uebersax 1981; 1982); (11) the data are continuous with a focus on the duration rather than the frequency of joint events (Conger 1985); (12) jackknifing functions are used to reduce bias in estimating standard errors of kappa (Davies & Fleiss 1982; Kraemer 1980). c. Kappa has also been (13) subjected to a number of empirical studies for testing and confirming or modifying the way it can be applied appropriately (e.g., Cicchetti 1981; Cicchetti & Fleiss 1977; Fleiss & Cicchetti 1978; Fleiss et al. 1969; 1979). d. Kappa (nominal data) and weighted kappa (ordinal data) have been shown under certain specified conditions to be (14) equivalent to various models of the intraclass correlation coefficient (R,) (e.g., Fleiss 1975; 1981; Fleiss & Cohen 1973; Krippendorff 1970; Shrout et al. 1987). Finally, e. Kappa and kappa-type statistics have also been used in conjunction with a number of multivariate approaches to reliability analysis: (15) cluster analysis (Blashfield 1976); (16) signal detection models (e.g., Kraemer 1988), (17) latent structure agreement analysis (Uebersax & Grove 1989); and (18) latent structure modeling of ordered category rating agreement (Uebersax 1989); and (19) Kraemer (1982) has shown, in the 2 x 2 case, the relationship between kappa values and the sensitivity and specificity of a given diagnostic procedure. Rosenthal writes, in the 2x2 case, of three "moreinformation-efficient" indices, kappa, R, and the standard Pearsonian product moment correlation (R), or the phi coefficient. He describes these indices as mathematically equivalent for that reliability research design in which the same two examiners independently evaluate all subjects (or objects). Rosenthal prefers their usage to three "lessinformation-efficient" statistics, namely, "rate of agreement" or what Rogot and Goldberg (1966) refer to as the "crude index of agreement" (uncorrected for chance); chi square(d); and unweighted kappa for 3 x 3 and larger tables. I agree with some of Rosenthal's conclusions. First, kappa, R,, and R (or phi) will be identical only when marginal frequencies or category assignments are identical for each of any two independent reviews. For peer review, if the acceptance (approval) and rejection (disapproval) rates are the same for both independent sets of reviews (e.g., 20% acceptances and 80% rejections),