Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)

DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation

41 CICHETTI: THE RELIABII .ITY OF PEER REVIEW mathematical relationships between kappa and Models of R,, see (a) Fleiss 1975 for the nominal-dichotomous case; and (b) Fleiss & Cohen 1973; and Krippendorff 1970 for the ordinal case.) For the peer review of grant proposals, some granting agencies have used different sets of reviewers with the same number throughout (e.g., the American Heart Association, as described in Wiener et al. 1977): This is analogous to manuscript review; other granting agencies (e.g. , National Science Foundation [NSF] as described in Cole & Cole 1981), however, not only use different sets of reviewers for each evaluated document, but the number of reviews varies from one proposal to the next. This design calls for R, Model III based on the average number of reviews per proposal (e.g., see Bartko & Carpenter 1976; Cicchetti & Showalter 1988). 5 It should also be mentioned that Gilmore (1979, based on an earlier approach reported in Garner & McGill 1956) has described yet another statistic for assessing the relia­bility of peer review: It fits the case of a dichotomous decision (e.g., "accept" or "reject"), with two ratings per document and does not distinguish between ratings all made by the same pair of referees and those made by different pairs of referees. Gilmore notes that the statistic is "very similar to the percentage of explained variance. " The statistic therefore has some conceptual similarity to Lambda (due to Goodman & Kruskal 1954) a statistic that, with minor modifications, has been shown by Fleiss (1975) to be mathematically equivalent to kappa in the dichotomous case. 3.4. Inappropriate statistics. Two additional statistical tests have been applied, on occasion, to assess levels of interrater reliability for manuscript review. Both tests suffer from major defects. The first is the standard Pearso­nian product moment correlation (R) This statistic as­sesses the extent to which two independent sets of ratings (e.g., manuscript or grant reviews) covarv in the same order, but it ignores the extent to which given pairs of reviewers disagree on any single evaluation (e.g., see Bartko 1966; 1974; 1976; Bartko & Carpenter 1976; Kaz­din 1982; Robinson 1957). In the specific context of journal manuscript reviews, Hendrick (1976; 1977) was able to demonstrate artifactually inflated levels of re­viewer agreement when the Pearson r, rather than R„ was used to make the reliability assessment. Recently, Whitehurst (1983; 1984) reintroduced an­other statistic for assessing levels of referee consensus. The statistic was developed by Finn (1970) and can be symbolized by Rf. The mathematical difference between Rf and R, (or kappa) statistics derives from an underlying assumption about chance agreement levels between any set of raters. Statistics such as Rf use levels of chance agreement that assume that "every judgment has the same probability of occurring under the hypothesis that the judges have no understanding of the scale applied and their ratings are purely random" (e.g. , Lawlis & Lu 1972, pp. 17-18). In the specific context of manuscript review, this would mean that the recommendation to accept, reject, or resubmit a specific article would occur equally frequently, by chance alone. Given the known high rejection rates of many journals (often in excess of 80%), this definition of chance agreement cannot be valid (e.g., see Cicchetti 1985). Consistent with this argument, it has recently been shown that Rf (but not R, or kappa) would fail to distinguish chance reviewer agreement from sub­stantially higher levels (i.e., see again Cicchetti 1985). 6 4. Empirical Issues: Major studies in peer review 4.1. Evaluative criteria: Scientists Judge their value. Five studies are briefly considered here. Each bears on how scientists place weight on the various evaluative criteria we have mentioned. All five studies examined (1) the "importance" of the study to the field and (2) the per­ceived adequacy of the "research design" on their rating lists; otherwise, they were quite different. Two used "as if' designs for major behavioral science manuscripts and depended on mail responses, but the journals they stud­ied were not the same ones; response rates also varied widely (50% in Wolff 1970, and 82% in Lindsey 1978). Two other studies used actual manuscript reviews, but again not the same journals (i.e. , Journal of Personality and Social Psychology in Scott 1974, and the Journal of Abnormal Psychology in a study by Cicchetti & Eron 1979). The fifth study (Cicchetti & Conn 1976; Conn 1974) used only three referees who made "blind" assess­ments (author's identity unknown) of extended abstracts sent to a major professional medical society (The Ameri­can Association for the Study of Liver Disease). The five studies also differed in data-analytic techniques. The "as if" studies asked referees to rank order the set of eval­uative criteria as if they were being used for recommend­ing the acceptance or rejection of a hypothetical manu­script. The remaining three studies used the size of the correlation between the ranking of a given evaluative criterion and the judged level of scientific merit of the document. Despite the extreme heterogeneity of these studies, all five indicated that the level of perceived "importance" of the contribution to the field and the perceived level of adequacy of the "research design" were the two most important evaluative criteria referees use forjudging the merit of a given scientific document. Although we are not aware of comparable studies on the peer review of grant proposals, information derived from a study (Weiner et al. 1977) of the reliability of reviews of grants submitted to the American Heart Asso­ciation (AHA, New York State Affiliate) merits brief discussion. Primary reviewers (2 were assigned to each proposal) were given a set of 10 criteria to use in evaluat­ing each grant. Each criterion received an a priori weight ranging from a low of 1 to a maximum of 2.5. Four criteria received the maximum weight. Three of them pertained to importance and research design issues. They were: (1) "The value of the expected data in increasing knowledge in a scientific field or in advancing the diagnosis and therapy of vascular disease"; (2) "Methodology: Is it valid and feasible?"; and (3) Research plan: (a) overall rationale; (b) quality of individual experiments, controls. (For fur­ther details, see Wiener et al. 1977, p. 307.) 4.2. Reliability of evaluative criteria. How well do pairs of referees agree in evaluating the relevance of criteria as they apply them to the same scientific documents? Avail­able data for both manuscripts and abstracts (once again derived from several sources) are presented in Table 1 and indicate levels of interreviewer agreement. These

Next

/
Thumbnails
Contents