Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993) | Könyvtár

Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)

DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation

90 CICHETTI: THE RELIABII .ITY OF PEER REVIEW Table 2. Category-specific agreement levels for 866 submissions to a Major Subspecialty Medical Journal Reviewer Average Frequency Type of Agreement Corrected Recommendation of Usage (%) Observed (%) Chance (%) for Chance Accept/Excellent 5 53 34 .29 Accept/As Is 7 65 53 .27 Accept/Revise 21 78 68 .33 Resubmit 24 78 75 .12 Specialty Journal 10 74 72 .08 Reject 33 81 66 .44 All Recommendations 100 77 67 .30 Note. Weighted kappa (Cohen 1968; Fleiss, Cohen & Everitt 1969) was used with a weighting system developed and recommended by Cicchetti (1976); Cicchetti & Fleiss (1977); and Cicchetti & Sparrow (1981), in which: complete reviewer agreement is assigned a weight of 1, followed by disagreement which is one ordinal category apart (.8), two categories apart (.6), three (.4), four (.2), and five categories apart (O, i.e., "Accept/Excellent" vs. "Reject Outright"). The corresponding R, value for these data was .37, which was shown in Table 2 of the target article. Source : from Cicchetti & Conn (1976). specialty Medical Journal (Cicchetti & Conn 1976) are shown in Table 2. The only reviewer recommendation category that meets the Volkmar et al. (1988) criterion is "reject, " with an observed rate of agreement of 81 % and a chance-corrected level of .44. In summary, the data indicate that the accompanying levels of observed agreement are substantially higher for negative than for positive evaluations and that the phenomenon holds, more generally, in the 3-or-more category case where there are varying levels of chancecorrected agreement possible. 1.4. Interpreting the data In Tables 1, 2, 5, and 6. The question is raised by Eckberg why the numbers of manuscript reviews for the Journal of Abnormal Psychology (JAP) and the Journalof Personality and Social Psychology (JPSP) vary from Table 1 to Table 2. For JPSP manuscripts, the two samples were different ones. The JAP data in Table 2 are based on a complete sample of 1,319 manuscripts submitted between 1973 and 1978. They focus on overall reviewer recommendations (scientific merit). The data in Table (target article) 1 are based on evaluation criteria (deriving from specific rating forms) that reviewers applied to JAP manuscripts submitted between 1976 and 1978. For the approximately 50% of the remaining manuscripts (1974-1975), these rating forms were unavailable for reviewers. To clarify this issue in Table 1, row A now reads: "For manuscripts submitted to the Journal of Abnormal Psychology (1976-1978)," rather than (1973-1978). Referring to the data presented in Tables 5 and 6 (target article), Eckberg wonders why I conclude that reviewers agree more on rejection than acceptance, rather than that reviewers simply reject more often than they accept. He also wonders whether the chi square(d) values in Tables 5 and 6 are incorrect. Concerning the first question, the data do, in fact, indicate substantially more agreement on rejection than on acceptance. This phenomenon is conceptually independent of the fact that reviewers recommend rejection much more often than acceptance. Take the data for JAP (first entry of Table 5). Of the 462 manuscripts that received positive reviewer recommendations, how many were in agreement? This is 44%, or 203. For those 857 manuscripts receiving negative recommendations, however, there was agreement on 70%, or 600. The question raised here is simply whether there is significantly more agreement on rejection than on acceptance. The chi square(d) value of 83.99 means that the difference is statistically significant at beyond the .00001 level. The figures reported in both Tables 5 and 6 are all correct as they are reported in the target article. Two factors will cause chi square(d) values to vary, however. The most obvious (and least important) pertains to how many places beyond the decimal point are considered. This produces differences from simple rounding errors. The conceptually more serious source of variation arises from whether the chi square(d) test (here with 1 degree of freedom) is applied with or without the Yates (1934) correction factor. Fleiss (1981) argues correctly (p. 27) that "because the incorporation of the correction for continuity brings probabilities associated with y 2 and Z into closer agreement with the exact probabilities than when it is not incorporated, the correction should always be used." Soper et al. (1988) demonstrated in a recent computer simulation that the random application of the chi square(d) test to neuropsychological data resulted, as expected, in values that were indistinguishable from nominal or chance levels (e.g., .05 or .01) when the continuity correction was used. When it was not, many more significant chi square(d) values were produced than were warranted by the data. These results support Fleiss's arguments and are also consistent with the earlier recommendations of Delucchi (1983, p. 169) and of Lewis and Burke (1949), much earlier. Given the necessity of using the correction for continuity, what effect would its nonusage (albeit incorrect) have on the chi square(d) and p values shown in Tables 5 and 6? These range from trivial to substantial depending on the size of the continuity-corrected chi square(d) value and the number of cases on which the test is based. Thus the chi square(d) value for JAP, based on 1,319 cases, increases to 85.08, which, "p-wise," is indistinguishable from the reported continuity-corrected chi square(d) val-

Oldalképek

Tartalom