Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)

DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation

83 CICHETTI: THE RELIABII .ITY OF PEER REVIEW its fictions and errors, but never make any real progress. Refer­ring specifically to "educational and psychological studies" (p. 310) as examples, Feynman (1985) has characterized this type of science as "cargo cult science: They follow all the apparent precepts and forms of scientific investigation, but they are missing something essential, because the planes don't land" (p. 311). What is missing is this: "It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty - a kind of leaning over backwards." (op. cit.). Lately, signs have multiplied that Feynman's characteriza­tion of the behavioral sciences is less humorous than it sounds. Especially if one adopts a historical perspective, one finds two recurrent invariants: Both (a) absurd and sterile research trends, and (b) transparently erroneous claims persist far longer than would be expected on the basis of Rubin's roulette theory. Estes (1975) cites axiomatic measurement theory as an example of (a): "One reason for the relative paucity of connections between measurement theory and substantive theory in psychology may arise from the fact that models for measurement have largely been developed independently as a body of abstract formal theory with empirical interpretations being left to a later stage. " (p. 273). That much is clear. What is unclear is why it took 20 years to notice that "the difficulty with this approach is that the later stage often fails to materialize" (op. cit.). Similarly, in his recent autobiography, Luce (1989) cites mathematical learning theory as another example of false starts: "In learning, hundreds of papers studying and testing stochastic operator and Markov models have, in my opinion, come to very little" (p. 286). Purely random peer review might have pro­duced this insight earlier. "At the risk of offending some col­leagues," Luce can think of "only three areas where mathe­matical modelling can be shown to have had a profound impact. " One of them is psychological testing, which, in his view, "is more mathematized than most people realize" (p. 285). This may have been a mixed blessing, however, since it provides numerous examples of (b). Perhaps the best known instance is the Burt scandal, which was less a scandal about Burt than about his peers: "What . . . are we to make of the fact that Burt's transparently fraudulent data were accepted for so long, and so unanimously, by the "experts' in the field?" (Kamin 1981, p. 105). My own experiences are linked to the rediscovery of the factor indeterminacy in the early '70s (Schönemann 1971; Schöne­mann & Wang 1972; Steiger & Schönemann 1976). The signifi­cance of the indeterminacy is that is vitiates all claims that "intelligence" can be operationally defined as "g. " This flaw of Spearman's factor model went unnoticed for a quarter of a century, until Wilson (1928) finally pointed it out in a review of Spearman's (1927) Abilities of man. More recently, I found that one of the most popular formulae for estimating "hcritability," Holzinger's h 2, which is supposed to estimate the proportion of genetic variance in the total (genetic plus environmental) variance, is erroneous because Holzinger (in Newman ct al. 1937, pp. 94-116) had made a mistake in his derivations. As a result, h 2 contains no environ­mental variance at all (Schönemann 1989). When this mistake was finally spotted after 60 years of uninterrupted use of h 2, several statistical editors refused to publish the correction for a variety of reasons that had nothing to do with the facts at issue: "As in the earlier review by another journal, the referees do not claim to have found mathematical errors in your development" (Solomon 1989). [Sec also Wahlsten "Insensitivity of the Analy­sis of Variance to Heredity-Environment Interaction" BBS 13 (1) 1990.] A final example of (b) is Rosenthal & Rubin's (1978) failsafe solution of the "file-drawer problem" of meta-analysis. They proposed a formula intended to estimate the ("failsafe") number of suppressed studies from the number of published studies. Because they retrieved 345 studies on the experimenter expec­tancy cffect, they estimated this number as 65,000 and dis­missed the bias hypothesis as unreasonable for these data because, "it is unreasonable to suppose that there existed enough unretrieved nonsignificant studies to overwhelm the studies we were able to retrieve" (p. 385). Darlington (1980) soon noticed a problem with this reasoning, however: "Imagine that all the tested null hypotheses in a certain area are true, and that the only results published are the 5% of studies which achieve significance by chance. Suppose the 345 studies were published this way . . . then we are imagining that the total number of studies performed was T = 20 x 345 = 6900. Thus, a correct analysis of the data from the 345 published studies should in fact lead T = 6900" - not 65,000, as Rosenthal and Rubin's failsafe formula had predicted. Almost a decade went by after Rosental and Rubin published their Behavioral and Brain Sciences target article, before Statis­tical Science, patterned after BBS, finally published an obliquely worded criticism of Rosenthal and Rubin's failsafe logic (Iyengar & Greenhouse 1988). After first praising the failsafe method as a "clever formulation of the file-drawer problem," the authors point out "several drawbacks that limit its usefulness" (p. 115). One such drawback is "the assumption that the unpublished studies are in fact a random sample of all studies that were done" (p. 110), because it conflicts with the very file-drawer hypothesis the failsafe number is supposed to cure: "Now if there were publication bias in favor of studies with statistically significant findings, then the Z values for the un­published studies would not be a sample from the standard normal distribution" (p. 115). In this case, too, the recorded evidence is at odds with the charitable null hypothesis that the long delay in correcting Rosenthal & Rubin's claims was solely due to chance. In fact, not just one, but two authors repeatedly tried to alert editors that something was amiss with the failsafe argument Rosenthal (1979) had described in more detail in the Psychological Bulletin. Shortly after the article appeared, Darlington (1980) submit­ted a Note in which he challenged the failsafe argument with the simple counterargument cited earlier, concluding: "(Rosen­thal's) formula appears to be incorrect, grossly overestimating X in some cases and grossly underestimating it in other cases" (Darlington 1981, Abstract). The editor encouraged him to revise his paper and then rejected the revision. A few years later, Thomas (1985) reached the same conclusion: "The solu­tion proposed by Rosenthal for the 'file drawer problem' is a product of faulty reasoning and should be forgotten" (Abstract). Pinpointing the flaw in Rosenthal and Rubin's reasoning pre­cisely: "The conclusion is inescapable. In general Z* [the failsafe number; and by similar argument Z'] are not standard normal, i.e. n(0,l) in distribution" (p. 9), Thomas anticipated Iyengar and Greenhouse by more than five years. In the end, neither Darlington nor Thomas received any credit. To summarize: As long as the validity of peer review is negative, as these and numerous other examples suggest, the rational course of action is to diminish its reliability further, not to enhance it. Disagreement among journal reviewers: No cause for undue alarm Lawrence J. Strieker Educational Testing Service, Princeton, NJ 08541 The modest agreement among reviewers of journal manuscripts amply documented by Cicchetti is not a cause for undue alarm. 1. Interrevlewer reliability Is greater than It seems. The average intraclass correlation or kappa of approximately .30 between reviewers' ratings describes the reliability of ratings by a single reviewer. 1 Reviewers are analogous to test items in this situa­tion, and the value of. 30 is akin to the reliability of one test item.

Next

/
Oldalképek
Tartalom