Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993) | Könyvtár

Braun Tibor, Schubert András (szerk.): Szakértői bírálat (peer review) a tudományos kutatásban : Válogatott tanulmányok a téma szakirodalmából (A MTAK Informatikai És Tudományelemzési Sorozata 7., 1993)

DOMENIC V. CLCCHETTI: The Reliability of Peer Review for Manuscript and Grant Submissions: A Cross-Disciplinary Investigation

76 CICHETTI: THE RELIABII .ITY OF PEER REVIEW English, history, and sociology. In these subjects a reliability of 0.9 or better is still commonly achieved. The difference vis-á-vis peer review is the use of a marking scheme, however imprecise, which is practised by the examiners. This difference is immediately apparent in comparison with university examinations (Byrne 1980; Cox 1967; Eells 1930, Hartog et al. 1936; Laming 1990; again excluding mathematical and physical sciences), which usually have no such scheme. Take away any pretence at a marking scheme and the reliability of examination marks falls to near the levels reported for peer review. There is a substantial argument to be made in favour of my alternative scenario, that the "specific criteria" referees are assumed to use have little or no behavioural reality. 3. Scientific progress? The argument that follows next is, I suppose, a flagrant abuse of classical test theory. But it provides the vehicle for a particular pessimistic view of scientific progress that needs to be exposed to scrutiny. Suppose that referees typically accord a weight w to an article or a grant proposal to be appraised, the residue (1— w) being contributed by the variability in the background with which the article is implicitly compared. The quantity (1— to) corresponds to the estimate two-thirds in section 1.3. Suppose, however, that the frames of reference implicitly used by different referees are not independent, but correlate r with each other, because the referees are chosen from within a common scientific tradition. The correlation to be expected between two independent assessments is then ([1— u>]r+tü]). If a journal editor bases his decision on the reports of n different referees, application of the general Spearman-Brown formula (Gulliksen 1950, p. 78) suggests that the editorial decision will have a reliability of r' - n((l-w)r+uí)/{(n-l)[(l-u>)r+u>)+l}. (1) In Equation 1, r' is the concordance between the articles ultimately published and the criteria referees are supposed to apply. It is also the correlation between the frames of reference with respect to which subsequent referees will formulate their assessments of the next generation of journal submissions. What happens to successive values of r'? Do they converge to a limit and, if so, what is the value of that limit? For admissible values of n and w the process does, indeed, converge, and the only possible limit is 1. That is fine; the process of peer review converges on a common frame of reference that, in a scientific discipline, is presumably in concordance with the state of Nature. But the uncomfortable import of the correlations reported in the target article is that this does not seem to be happening. The only reconciliation of the theoretical argument and the empirical data that I can at present think of runs as follows: Once attention is confined to the rather narrow stratum of potentially plausible grant proposals and publishable papers, referees are, for the most part, unable to tell the meritorious from the rest, and scientific "progress" is principally a random progression. It is clear from their espousal of proposition (a) in section 8 that Cicchetti does not share my pessimism. He takes an optimistic view of scientific progress, but on what evidence? The optimistic view envisages that most published research will have some detectable effect on the state of the field 50 or even 100 years hence. It is difficult to see what evidence could be brought to bear on such a proposition. But I have had occasions to consult journal articles in my subject (experimental psychology) from 50 and sometimes 100 years ago. On those occasions, I have often glanced at the table of contents of the journal volume being consulted, just to see what else was there. It is interesting to discover an article of historical significance that one has heard of in a different context. But, usually, nine articles out of ten, even 19 out of 20, have proved to be completely unknown. The present state of my subject would be no different if those articles had never been published. Is the situation any different today? Should the blinded lead the blinded? Stephen P. Lock British Medical Journal, Tavistock Square, London WC1H 9JR, England Given the apparent inherent variation in opinions, not only between referees themselves, but between referees and editors, what can be done to improve things? My personal hierarchy of proposals would start with the editors' subcategorizing the questions they expect reviewers to consider. For example, instead of answering the question, "Is the work original?" the referee could indicate, "New to me; known to me: (á) by rumour, (b) by personal communication, (c) from presentation at a conference (with or without abstract), (d) from published work, or (e) from retrieval from database." Next, I would advocate two unconfirmed hunches. First, that the quality of a decision is enhanced by having an editorial "hanging committee" (named by analogy with the selection body of the London Royal Academy of Arts), which discusses most of the articles with "grey" reviewers' reports (that is, 2-4 on a 1-5 scale of reject/accept). Second, that for a very general journal better quality reviews are obtained if the choice of reviewer is delegated to an assistant editor in the subfield; expert knowledge by one competent reviewer is more helpful for making a decision, in my view, than having two or more opinions from referees with no specific expertise. Paramount among my suggestions, however, is the need for blind review - or at least for editors to study it under their own circumstances. To earlier suggestions of the cogency of this view by Mahoney (1977) and Peters and Ceci (1982) must be added the results of the rigorous study by McNutt et al. (1990). Not only did the last show that blinding was feasible for the editorial office and successful for 76% of reviewers, but on a 3-point scale there was a 21% improvement in the quality of reviews, as well as a striking increase in the proportion of excellent reviews among the blinded reviewers. So, in addition to replicating these findings for other journals, another study that is now urgent would determine the effect of blinding on the editors themselves, particularly in some recent findings (Garfiinkel et al. 1990). Some 25 manuscripts that had clearly been revised and accepted for publication in the Journal of Pediatrics were sent for re-review by two additional referees, and then reevaluation by three experienced, independent assistant editors. Most manuscripts were thought by the new reviewers to have defects that warranted further revision, but, though one of the participating assistant editors would have requested revision more often than the other, there was infrequent disagreement among them about the basic decision to accept or reject. My second group of comments relates to publication bias, in particular, the preference for original over replicative work and for manuscripts reporting positive results. Possibly, now that editors have recognised the pitfalls of this attitude, which were well discussed at the First International Congress on Peer Review in Biomedical Publication (Chalmers 1990; Chalmers et al. 1990; Dickersin 1990; Sharp 1990), the problem will diminish, particularly if authors appeal on this account. Nevertheless, the editorial decision must depend on circumstances: For example, whatever the findings, I believe that the British Medical Journal would be interested in publishing other studies (however long and detailed) of the incidence of leukaemia in children of fathers who had been exposed to high doses of radioactivity in their work in various atomic power stations, thus confirming or refuting the recent work by Gardner et al. (1990) at Sellafield. The recent introduction of structured abstracts (Ad Hoc Working Group for Critical Appraisal of the Medical Literature, 1987) may also make it easier for editors to find the space for confirmatory reports or reports with negative results. With a limit of 400 words and a tightly defined vocabulary, these allow a detailed statement of the study's objectives, setting, methods,

Oldalképek

Tartalom