Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993) | Könyvtár

Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)

DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation

89 CICHETTI: THE RELIABII .ITY OF PEER REVIEW then the data in the resulting 2 x 2 or four-fold table will produce identical results, whether one applies kappa, R,, or phi (e.g., see Cicchetti 1988; Cohen I960; Fleiss 1975; 1981). The example cited (the reviews for manuscripts submitted to the Journal of Abnormal Psychology, JAP, Footnote 6 of the target article) illustrates this equivalence, as Rosenthal correctly notes. This occurs because there is no intuitively obvious way to distinguish "first" reviews from "second" reviews Therefore, the required Model I R, that is applied to the data will produce equal rater marginals (category assignments to "accept" and "reject") for the two independent sets of reviews. In such a situation, the three mathematical formulae (for kappa, R,, and phi) become equivalent. These identities also hold in the Model II case (same two raters throughout) providing, again, that the category assignments are identical. When these assignments are not identical (the much more usual case), Kappa, R,, and phi (or R) will assume different values, the difference depending on specific distributions of the two category assignments. As an example of the effect of unequal category assignments on the values of kappa, R,, and phi, consider the data presented in Table 6 (target article). Here there was interest in distinguishing two identifiable sources of average ratings, namely those made by NSF and those made by COSPUP. The full data on which the condensed Table 6 entries are based, for the area "Economics," are shown in Table 1: Here, R, (Model II) = Kappa = .44. If we had instead considered that the distinction between NSF and COSPUP ratings are not of concern and used R, (Model I), which would take into account that different pairs of reviewers viewed different proposals, its value would be .38. In either case, R (or phi) would = .41. Thus, Kappa, R t, and R are identical when category assignments are identical) but not under any other combination of category assignments (the more usual case). Concerning Rosenthal's second point, I would agree that chi square(d) should not be used as a measure of examiner agreement, for the reasons he cites, as well as because chi square(d) measures associations of any type, whereas kappa and Rj measure agreement per se. 1 would partially agree with Rosenthal's caveat about applying unweighted kappa as an omnibus statistic to 3 or more categories of interest. Although the overall value of kappa might be of somewhat limited interest, the decomposition of kappa into levels of specific agreement (observed and chance-corrected) on a category by category basis, would, in fact, be quite informative (e.g., Fleiss 1981, p. 220). For peer review, there might be interest in the extent to which reviewers agree on such conceptually distinct evaluation attributes (nominal variables) as: imTable 1. Average NSF and COSPUP ratings of 50 proposals in the field of "Economics" COSPUP: Low Ratings High Ratings All NSF: (10-39) (40-50) Proposals Low (10-39) 29 3 32 High (40-50) 9 9 18 All Proposals 38 12 50 portance of the problem under investigation; adequacy of research design; and interpretation of research results. Each evaluative attribute could be scored as "acceptable" or "unacceptable. " If the reliability design were such that the same two reviewers evaluated all submissions independently, then the generalization of kappa developed by Davies and Fleiss (1982) would apply. If the reviewers varied from one submission to another, then the kappa statistic developed by Fleiss (1971) and extended by Fleiss et al. (1979) would be relevant. Again, while the overall (omnibus) kappa value averaged over the 3 categories of interest might be of limited value, the levels of observed and chance-corrected agreement on each evaluative attribute would be quite meaningful. On the other hand, if the overall kappa value were not even statistically significant, one would be less interested in the specific category reliability assessments. For these reasons, and the ones expressed in my reply to Cilmore, I would conclude that kappa is more "information-efficient" than its competitors. Finally, with respect to Rosenthal's application of kappa to the acceptance and rejection figures for the JAP data given in Table 5, my two values are . 14, as is true for overall kappa (again, the 2x2 equal marginals case). Although these values, as well as the 70% and 40% agreement levels, are describing the same data, each conveys valuable, though different, information, as explained more fully in my upcoming replies to Demorest and Wasserman. Reanalyzing data from Tables 5 and 6 respectively, Demorest and Wasserman arrive at the same conclusion, namely, that chance-corrected agreement on rejection (disapproval) is no better than on acceptance (approval). They are both right. The phenomenon, as Demorest correctly notes, however, is specific to degrees-of-freedom limitations inherent in data deriving from a 2 x 2 contingency table. As noted in my discussion of Rosenthal's commentary, overall kappa values are always mathematically identical to specific kappa values for acceptance and rejection (e.g., see also Cicchetti 1980; Cicchetti & Feinstein 1990; Fleiss 1975). A very important and relevant issue, however, discussed neither by Demorest and Wasserman, nor by the target article itself, still needs to be addressed. As noted recently (Cicchetti 1988, p. 621), the same kappa value can be reflected in a wide range of observed agreement levels. Some will be of substantive (practical or clinical) value and others will not. It thus becomes necessary to set some specific criterion forjudging the usefulness of both observed and chance-corrected levels of agreement as they may occur together. My colleagues and I have suggested that one should require a minimum level of agreement of 70% before correcting for chance, and an accompanying level of at least .40 ("fair" agreement) after correcting for chance (see Volkmar et al. 1988, p. 92). If we apply these criteria to the data presented by Demorest, in Table 2, namely, category-specific agreement levels for reviews of manuscripts submitted to the American Psychologist, the only category that meets these standards is category 5 ("reject"), for which the observed level of reviewer agreement is 75.9% and the chancecorrected level (weighted kappa) is .52. Consistent with these results, reviewer agreement levels on 866 manuscripts submitted to a purposely unidentified Major Sub-

Oldalképek

Tartalom