Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)
DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation
92 CICHETTI: THE RELIABII .ITY OF PEER REVIEW mism is based on his examination of journal articles in his field of interest (experimental psychology) that were published between 50 and 100 years ago. He concludes that if more than 90% of this research had never been published, the state of experimental psychology would be no different than it is today. In contrast to the pessimism shared by Laming, Roediger, and Zentall, I must state emphatically that the progress in my own field of inquiry, assessing the reliability and validity of standard and state-of-the-art diagnostic instruments in both behavioral science and medicine, has been nothing short of dramatic. Thus, my colleagues and I have developed highly reliable and valid instruments over a wide range of disorders: 1. In behavioral science, for example, adaptive behavior (Sparrow et al. 1984a; 1984b; 1985), alexithymia (Krystal et al. 1986), personality disorders (Cicchetti & Tyrer 1988; Tyrer, Cicchetti etal. 1984; Tyrer, Strauss et al. 1984), anxiety (Tyrer, Owen et al. 1984), affective behaviors of demented patients (Nelson et al. 1989), and dissociative disorders (Steinberg et al. 1990); and 2. In medicine, the Yale Observation Scales for identifying seriously ill febrile children (McCarthy et al. 1982; McCarthy et al. 1990), new methods for classifying cataracts both in vitro (Cicchetti et al. 1982); and in vivo (Cotlier et al. 1982), and accuracy of the barium enema in diagnosing (a) Hirschsprung Disease (Rosenfield et al. 1984), and (b) acute appendicitis (Garcia et al. 1987). For each of these diverse areas, we have consistently shown levels of reliability in the GOOD to EXCELLENT range (usually kappa or R, values . 90 and above), as well as good evidence for validity. When I became actively involved in research more than two decades ago, there was little optimism that the low levels of reliability and accuracy of judgment (especially in the behavioral sciences) would ever become "respectable." Yet, less that a decade ago, the field of psychiatric diagnosis had improved dramatically as encapsulated in the writings of Grove et al. (1981, p. 408): For years, achieving adequate diagnostic reliability in psychiatry was considered to be a hopeless undertaking. A number of landmark studies suggested that psychiatrists looking at the same patients frequently disagreed about the appropriate diagnoses. As a consequence, the importance of diagnosis was minimized in both research and clinical work. . . The reversal of nihilistic attitudes about psychiatric diagnosis has led to a rigorous (and successful) attempt to rework the entire American diagnostic system used by clinicians, DSM-III, which demonstrated in field trials that good agreement could be achieved even in routine practice. The specific details about how I believe that similar breakthroughs can be made in the field of peer review (namely, improving both its reliability and validity) are expressed in a later section of the report. 2.2. Reliability levels are worse than Indicated. Examples are given by Schönemann from the published literature in which false claims about a number of phenomena have been made and perpetuated (e.g., indeterminacy, heritability, the results of mathematical modeling). Because manuscripts with high reliability (editors and reviewers agree they should have been published at the time) have "negative" validity, this can only mean that reliability is lower than one would think, perhaps at random or chance levels. Although Schönemann*s argument has a certain facevalidity appeal, I am hard pressed to calculate the actual frequency with which the unfortunate phenomena he reports occur relative to the mammoth corpus of research that has been published. In mathematical terms, we are faced with trying to interpret a ratio with both unknown numerator (the number of invalid published research findings) and unknown denominator (the total number of nonredundant published findings). In short, it is not possible for me to draw a cause and effect conclusion on these matters given the data presented thus far. Perhaps, given the enormousness of published research in such diverse outlets, one could never arrive at a valid conclusion. 2.3. Reliability Is better than Indicated. Several commentators (Hargens, Marsh & Ball) mention that reliability levels may have been underestimated by taking into account only the recommendations of two independent reviewers. Marsh & Ball, for example, note that in addition to the initial two reviews, the editor often has his own review, author revisions, and further reviews of the revised manuscript on which to base a decision, thereby probably increasing the reliability of the process. The additional review, however, whether by the editor or a third reviewer, is often not an independent one and so may be heavily influenced by the results of the initial two reviews. Despite this problem, there is a factor mentioned by both Hargens and by Marsh & Ball that one can test empirically, namely, that the editor's process of weeding out very poor quality manuscripts (rejected without being sent out for review) might reduce the variance and subsequently increase the levels of interreviewer agreement, because these very submissions are the type we have shown to produce the highest levels of consensus. Hargens cites both Gordon (1977) and Zuckerman and Merton (1971) to suggest that the editor's sole "summary-rejection" rates for prestigious journals in both social science and medicine may reach levels as high as 50%. Fortunately, I have been able to analyze further some additional data deriving from reviews for the Journal of Abnormal Psychology (JAP) during 1973 and 1977. As given in Table 3, and based on 996 submissions, there was an overall R, (or kappa) value of .24 with 73% agreement on rejection, 51% on acceptance, and 65% overall agreement. In addition to these 996 submissions, the editor received 384 additional manuscripts. He rejected 333 (86.7%) and accepted the remaining 51 (13.3%). If we make the assumption that the rejected manuscripts would also have been rejected by another independent reviewer because of their obvious poor quality or inappropriateness for JAP, the results show that: Overall agreement increases from 65% to 74%; agreement on rejection increases from 73% to 82%; agreement on acceptance remains at 51%; and Rj (or kappa) increases from .24 to .34. In conclusion, even if one assumes that the reliability of negative editorial reviews is perfect, it may not have a profound effect on increasing the reliability of the peer review process. Thus, whereas the agreement level on rejection improves, the lack of a corresponding increase in reliability for acceptance keeps the R, value at relatively low levels.