Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)

DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation

74 CICHETTI: THE RELIABII .ITY OF PEER REVIEW places maximal weight on avoiding the Type I error. How to estimate such a kappa from reliability data may not be known, but the greater agreement reported among reviewers for rejec­tion than for acceptance gives some hope that the review process may be operating better than indicated by the unweighted kappa in avoiding Type I errors. Be that as it may, some of the strategies proposed (sect. 7.7­.9) are directed not at improving validity or reliability per se, but at reducing what is here labelled Type II error. I agree that Type II error should be reduced, but not at the cost of increasing Type I error. Well-done reviews leading to rejection may be beneficial in the long run. If a fatal flaw is detected, it prevents embarrassment, as one is allowed to withdraw quietly. If a flaw is remediable, the authors have the opportunity to revise the paper to one of substantially higher quality than the original. In my view, it is the editor s responsibility (not the authors') to detect and ignore poorly done reviews, or to ignore the occa­sional weak points in otherwise well-done reviews. In place of the appeals processes Cicchetti suggests (sects. 7.8, 7.9), let me propose an external quality control panel. For each published paper, the names of the authors and those of the reviewers and editors and their recommendations would be filed with this group. This group would then receive and com­pile all challenges to the scientific validity of the results reported in the Abstract of the paper (i.e., ignoring typos or minor errors). A few such challenges now appear as letters to the editors or as papers submitted to the same or other journals, but are subject to the review of the same editors, reviewers, and sometimes the submittors, who may have erred in the first place. If enough evidence accumulates in these challenges to indicate a major flaw, one sufficient to raise questions about the validity of the overall conclusions, the journal should publish a summary of the challenges compiled by the quality control panel, along with the names of the authors, and those of the editors and reviewers who recommended publication. No at­tempt at adjudication should be made. It should be required that any paper on which such a question is raised should continue to be listed in the authors' CV, followed by such a note as, "Results questioned (reference)." I share what I perceive as Cicchetti's view that, with respect to the review process, the cup is more full than empty, but that there is merit in seeking to fill the cup further. I would differ in proposing that the review process be judged more by the results it produces (valid findings) than by the procedures it uses to produce those results, such as the "reliability" of reviewers. The approach used by Cicchetti does an excellent job, however, of discussing what should be done, regardless of which criteria are emphasized. Finally, the value of the discussion is as much in its potential to cause readers to reevaluate their roles in the review process as in the specific proposals presented. Why is the reliability of peer review so low? Donald Laming Department of Experimental Psychology. University of Cambridge, Cambridge. England CB2 3EB Electronic mall: drjl@phx.cam.ac.uk I compliment Cicchetti on a careful and detailed survey of studies of peer review in many different disciplines. Of the 58 tabulated correlations between independent referees, only four fall short of 0.18 and only four exceed 0.40 (of which the highest reduced to 0.38 on replication). What is to be made of these low levels of interreferee agreement? Cicchetti is dispassionate in his presentation and I am not sure how he feels about these results. But most scientists would, I believe, say these levels are not good enough and need to be improved. I am going to argue, on the contrary, that significant improvement may not be possible. 1. Summary of argument. Laboratory studies of absolute judg­ment of simple stimuli (frequencies of pure tones, for example, or sound pressure levels) show that such judgments are never­theless relative - relative, usually, to the preceding stimulus in the experiment. This means that successive stimuli are com­pared with a constantly shifting frame of reference that limits the accuracy of judgment much more than any specifically sensory confusion. There are three quite different statistics from studies of the judgment of sound intensity that indicate that variation in the frame of reference accounts for about two-thirds of the variability of the judgments. Now transpose that result into the field of peer review. Two different referees use two different frames of reference for the evaluation of a submitted article or grant proposal. If those different reference frames contributed two-thirds of the variability of each evaluation, the correlation between independent peer reviews would be limited to about 0.33. I now fill out the details of my argument. 1.1. Absolute Identification of simple stimuli. The most com­pelling example of the limited accuracy of absolute judgment comes from Pollack (1952). Pollack presented a series of tones to his subjects with frequencies selected at random from some number (m) of chosen values in the range 100 to 8,000 Hz, the number of different values ranging from 2 to 14 in different parts of the experiment. Each tone was presented for 2.5 sec. at about an 85 dB loudness level. The subject identified the tone by assigning it a number in the range 1 to m, and was then told the correct identification. As the number of different auditory fre­quencies and response categories increased above four (up to which point identification was nearly error-free), errors in­creased at such a rate that the accuracy of identification never exceeded a level equivalent to the use of just five categories without error. This result - specifically the limit of five catego­ries without error - is not peculiar to frequencies of tone, but is obtained for many other sensory attributes as well, with only a few exceptions. (See Gamer 1962, Chapter 3; Laming 1984, p. 155, Table 1). This surprisingly low limit does not depend on sensory con­fusability. Jesteadt, Wier and Green (1977) found that there were about 2,000 just noticeable differences between 100 and 8,000 Hz. Moreover, in a series of supplementary experiments, Pollack (1952) manipulated several variables that affect discrimi­nability without materially increasing the accuracy of identifica­tion of single tones. The only manipulation that increased accuracy was the presentation of a fixed reference tone (of a frequency known to the subject) prior to each stimulus to be judged (Pollack 1953). The limit to the accuracy of absolute judgment has to do with lack of a stable frame of reference. The same conclusion may be drawn in a quite different manner from a study of magnitude estimation by Baird et al. (1980). When the intensities of two successive noise bursts differed by not more than 5 dB, the respective judgments (log magnitude estimates) correlated about +0.8. So some 0.64 of the variability in the judgment of the second stimulus was inherited from error in judging its predecessor (see Laming, in press). This result has been replicated several times. It is found in the work of Jesteadt, Luce & Green (1977), Green et al. (1977), Luce & Green (1978), and Green et al. (1980). All these experiments used the intensity of a pure tone as the stimulus­attribute to be judged, but Baird et al. (1980) have demonstrated the result with the area of an arbitrary geometric figure as well. 1.2. Transmission of error In absolute Judgments. If each stim­ulus, and the judgment assigned to it, is used as a reference point for the judgment of its successor, any error in the first assignment will be transmitted to the second. Herein lies a substantial source of inaccuracy. The experiment by Baird et al. indicates that about two-thirds of the error judgment may be accounted for in this way. There are two other experiments (more particularly, different statistics from two other experi­ments, not a mere replication of this present result) that also point to a proportion of about two-thirds.

Next

/
Oldalképek
Tartalom