Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)

DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation

75 CICHETTI: THE RELIABII .ITY OF PEER REVIEW J. C. Stevens and Tulving (1957) reported on a class of 70 undergraduate students making their first-ever magnitude esti­mates of loudness. Subsequently, S. S. Stevens (1971, Figure 8) plotted the cumulative distributions of those judgments (after "modulus equalisation" [Stevens 1971, p. 428] to remove dif­ferences in the absolute scale of different subjects'judgments) to show that those distributions were approximately log-normal. The inverse gradient of the cumulative distribution function (cumulative normal probability versus log estimate) estimates the standard deviation, and the variabilities of successive log­magnitude estimates, calculated in this manner, are tabulated by Laming (1984, p. 168, Table 2). For the very first judgment by each of the 70 subjects the variance was 0.010. For the second judgment, the variance was 0.020, and so on, increasing to an asymptotic level of about 0.030. So the variability contributed by a single stimulus presentation amounted to about one-third of the asymptotic value. The other two-thirds must have been inherited from the preceding judgment. Of necessity, magnitude estimation requires that the subject receive no knowledge of results, lest it bias the judgments. In absolute identification, on the other hand, feedback after each trial is the rule. It is possible to compare the two procedures by conducting an absolute identification experiment without feed­back, however, using the same set of stimuli and the same presentation schedule in both kinds of experiment. Braida & Durlach (1972, Experiment 1) is a case in point. My third estimate comes from an as yet unpublished replica­tion of Braida & Durlach's experiment. The stimuli were 10 1­kHz tones of 0.5 sec. duration, ranging in level from 50 to 86 dB SPL in 4 dB steps. For the first 3,000 trials the subject was asked simply to estimate the loudness (without being told that there were only 10 different stimuli). For the next 3,000 trials the subject was asked to identify the stimulus, but without feed­back. The final 3,000 trials were again absolute identification, but with the correct response indicated immediately after each judgment. The data from all three tasks were analysed using Torgerson's (1958, Chapter 10) model of categorical judgment (see also Braida & Durlach 1972). This model estimates d' for the separation between adjacent pairs of stimuli (cf. Luce et al. 1982) and Figure 1 plots the cumulative d' for one subject performing each of the three tasks. There is not much difference between discrimination in the magnitude estimation and abso­lute identification tasks, both without feedback. When immedi­ate knowledge of results is provided, however, d' improves. Comparing the aggregate from 50 to 82 dB, d' improves by the factor 1.79. This is equivalent to a decrease in the model variance to 0.31 (= 1.79­2) of its former value. Evidently, in the point of reference used for the ensuing judgment, imme­diate knowledge of results substitutes the correct response for the response actually made, thereby preventing the transmis­sion of error from one trial to the next. The proportion of error inherited from the preceding trial by this particular subject is 0.69. I have no theoretical foundation for this proportion of about two-thirds; it probably signifies no more than a fortuitous selec­tion of experimental sources. But while, of necessity, the experi­ments are all somewhat similar, my three estimates are obtained from different kinds of experimental statistic. For this reason, two-thirds is a defensible value to transpose to the domain of peer review. 1.3. Application to peer review. In section 2 of his target article Cicchetti discusses a set of evaluative attributes and specific criteria for peer review of journal articles and grant proposals. He seems to envisage that these criteria are internalised by referees, possibly in the manner in which some musicians exercise "perfect pitch." An alternative scenario treats those criteria as no more than empty verbal formulae, which do not generate any particular behavioural correspondence between the bases of judgment by different referees. Instead, different referees formulate their judgments against different back­Stimulus amplitude in dB SPL Figure 1 (Laming). Cumulative d' , cumulated from the smallest stimulus (50 dB) upward, for magnitude estimation, for absolute identification without feedback, and for identification with im­mediate knowledge of results. grounds of ideas which, according to the foregoing analysis, should account for about two-thirds of the variability of their assessments. If those different frames of reference are truly independent, then the scope for agreement between two referees is limited to the remaining one-third of the variance, and that one-third corresponds as closely as one could reasonably expect to the spread of correlations reported by the authors. Test theory (Gulliksen 1950, p. 13) tells us that the reliability of any one assessment may be measured by the correlation between two independent judges, here 0.33. An editor has two or more such assessments on which to base a decision whether to accept an article for publication, however. The Spearman-Brown fonnula (Gulliksen 1950, p. 63) tells us that the combination of two independent referees raises the reliability of the editorial deci­sion to 0.50. But I think most scientists would still regard this as unacceptably low. Some further exploration of the process of assessment is needed to discover what improvement may be possible. 2. Comparison with the marking of examinations. Public ex­aminations in the United Kingdom (GCE "O" and "A" levels) typically achieve a mark-remark correlation of 0.9 or better (Murphy 1978; 1982). The re-marking in Murphy's studies was undertaken by an independent examiner and in that respect is comparable to peer review. I exclude from this comparison examinations in mathematics, physics, and kindred subjects, because for those subjects examiners are provided with, in effect, a list of the admissible answers and the marks to be assigned to each. Even though a professional article falls within one of those disciplines, it is nevertheless open-ended (the choice of topic is always at the author's disposal), so the correct comparison must be with examinations in subjects such as

Next

/
Oldalképek
Tartalom