Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993) | Library

Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)

DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation

91 CICHETTI: THE RELIABII .ITY OF PEER REVIEW ue of 83.99. In distinct contrast, the continuity-corrected chi square(d) value of 3.413 (p = .06), for the 72 manuscripts submitted to Developmental Review (entry 3 of Table 5) increases to 4.46 (p = .02), when the correction for continuity is not used. Similar effects can be noted for the data in Table 6. Eckberg asks two additional questions: (1) In the case of comparing NSF and COSPUP open reviews (Table 6), how was it decided who would be the two reviewers? Each average COSPUP rating for a given grant proposal (first "reviewer") was compared to each average NSF rating (second "reviewer"). (2) Why is the number of disagreements exactly the same in both the "Acceptance" and "Rejection" columns (Table 5) and in the "High" and "Low Ratings" columns (Table 6)? This is because the disagreed-on cases for acceptance and rejection cannot differ in the 2x2 case, because of degrees of freedom restrictions (see also Cicchetti 1988, Tables 6-10, pp. 611-615, and p. 619). 1.5. Interpreting the data In Table 3. Based on experience with behavioral psychology journals, Cone notes that journals with lower submission rates will tend to have higher acceptance rates. Therefore this variable needs to be controlled in peer review research. He concludes that the data presented in Table 3 (target article) provide partial support for this notion in the case of manuscripts submitted to the Physical Review (PR). For example, the Nuclear Physics section of PR has a higher acceptance rate and lower submission rate than those sections with two or three times as many submissions as Condensed Matter or General Physics. A more comprehensive analysis of these data do not support Cone's contention. Thus, the two sections with the lowest submission rates, Nuclear Physics and Particles & Fields, with a combined submission rate of 31.5% (or 1658/5264), have a combined acceptance rate of73.3% (or 1215/1658). There is a similar combined acceptance rate of 75.3% (or 2717/3606) for the two sections (General Physics and Condensed Matter) with more than twice the percentage of submissions (3606/5264 or 68.5% vs. 1658/ 5214 or 31.5%). Chi square(d), corrected, 1 df = 2.35(p = n.s.). More important, the strength of association (Effect Size (ES), Cohen 1988) between manuscript submission rate and acceptance rate, as measured by phi (or X 2ldf (uncorrected)/N) is only 0.02, a zero-order effect. In a related issue, pertaining again to the type of data presented in Table 3, Cone contends that there is no evidence for my assertion that "manuscripts requiring more than one reviewer tend to be those that are problematic. " This is based on a misunderstanding about how the single initial referee system works. In the field of physics (e.g., Physical Review, PR), the editor sends a manuscript initially to a single reviewer. If the reviewer recommends acceptance, the editor typically supports that decision. Only when the initial referee detects a problem (i.e., recommends rejection) is the manuscript sent to a second referee. If the second referee also recommends rejection, then the editor typically rejects the article. If the second reviewer recommends acceptance, however, then the paper is viewed as "problematic." Such a manuscript is usually sent to a third referee who will decide the fate of the submission (see also Hargens 1988). Kiesler's comments about "explaining" differences between natural and behavioral scientists in terms of their "success" with manuscript or grant applications seem confused, so I am unable to respond. They presumably have something to do with the data presented in Table 3, but I simply can not follow his arguments. Clarification in BBS Continuing Commentary is suggested. The next several sections of my Response focus on varying interpretations of the overall results presented in the target article, namely, that across disciplines and type of submission (manuscript, grant) levels of interreferee agreement (corrected for chance) tend to be rather low (R, usually below .40). 2. Interpretation of the results 2.1. Reliability levels ere correct as reported. A majority of commentators accepted the low levels of reliability as valid, and offered a number of suggestions for improving the reliability (and at times even the validity) of peer reviews (Adams, Bornstein, Cohen, Cole, Colman, Cone, Crandall, Delcomyn, Fletcher, Ciimore, Gorman, Greene, Kraemer, Laming, Lock, Mahoney, Nelson, Roediger, Rourke, Salzinger, Tyrer, and Zentall). These views are discussed in later sections of the report. Cole feels that both editors and granting officials need to admit that since reliability is so poor, much high quality research is rejected or disapproved, whereas some poor quality research is accepted or funded. Therefore, editors should gradually increase the number of manuscripts they accept and granting officials should put funding aside for meritorious but disapproved proposals. The major problem with this otherwise good idea is that the time required to reverse a funding decision may equal or exceed the time required to revise the proposal and resubmit it to the same or a different funding agency. Zentall, Roediger, and Laming doubt that levels of reliability could ever be improved substantially. Zentall argues that much of the disagreement reflects deep theoretical and methodological (confirmational) biases. Similarly, Roediger argues that the corpus of psychological literature has demonstrated consistently that human judgments of such complex issues as hiring decisions or making clinical diagnoses, are of questionable reliability and validity. Hence, the similarity in results for peer reviews is to be expected. Laming, in a most imaginative commentary, argues by analogy with the results of a number of psychophysical studies across sensory modalities that the constantly shifting frames of reference with which successive stimuli are compared limit the accuracy of human judgments to the extent that about 2/3 of the variability in judgments can be attributed to the variability in frames of reference. Thus, it is the absence of a stable frame of reference that sets limits on the extent of judgmental accuracy. Applying this knowledge to the field of peer reviews of manuscript and grant submissions, Laming concludes that the shared variability between independent reviews would be restricted to an upper limit of about 0.33. He ends his commentary on a rather sombre and pessimistic note that he contrasts to my own more optimistic view of progress in science (in general) and peer review (in particular). Laming's pessi-

Thumbnails

Contents