Az Eszterházy Károly Tanárképző Főiskola Tudományos Közleményei. 1996. Vol. 1. Eger Journal of English Studies.(Acta Academiae Paedagogicae Agriensis : Nova series ; Tom. 24)
Ramesh Krishnamurthy: Change and continuity at COBUILD (1986-1996)
between Cobuild staff and academics at the University. The note of continuity is particularly evident in our publications. For example, most of the innovative features of the original COBUILD Dictionary (1987) have been retained in the new edition (1995), although many of them have been refined or enhanced. 2 Data 2.1 Size The 1987 COBUILD Dictionary was based on an initial detailed examination of a 7.3 million word corpus (six million words of written texts, 1.3 million words of spoken texts). This analysis was subsequently enhanced with reference to a 20 million word corpus prior to publication of the dictionary. By 1995, the corpus (now called the Bank of English) had grown to over 211 million words, and provided the evidence for the 1995 edition. During 1996, we anticipate that it will exceed 300 million words. 2.2 Vintage The 20 million word corpus consisted mainly of data from 19751985, whereas the 211 million word corpus data originated largely from 1985-1995. So the new corpus was, of course, more up-to-date and reflected many linguistic and real world changes. 2.3 Corpus And Subcorpora The 20 million word corpus was stored as one single entity. The 211 million word corpus is held as 16 subcorpora, distinguished by source or text type. This allows finer tuning of the analysis, including contrastive studies of different genres of language, such as informal speech and broadcast speech, broadsheet newspapers and tabloids, etc. We hope to add two new subcorpora during 1996. 2.4 Integerization Another major difference is not apparent to the user, but has had a substantial impact on the speed of the corpus retrieval programs: the 20 million word corpus was stored as characters, and therefore even the simplest search program had to match each character of the search word with each character of the corpus word. The 211 million word corpus is held as integers (i.e. each word is encoded as a number), so the search program now only has to match one number with another. 62