CALL FOR PAPERS ACL Workshop: COMPARING CORPORA October 2000 Hong Kong University of Science and Technology THEME ===== Anyone who has worked with corpora will be all too aware of differences between them. Depending on the differences, it may, or may not, be reasonable to expect results based on one corpus to also be valid for another. It may, or may not, be appropriate for a grammar, or parser, based on one to perform well on another. It may, or may not, be straightforward to port an application from a domain of the first text type to a domain of the second. Currently, characterisations of corpora are mostly textual and at different levels of generality. A corpus is described as ``Wall Street Journal'' or ``transcripts of business meetings'' or ``foreign learners' essays (intermediate grade)''. It would be desirable to be able to place a new corpus in relation to existing ones, and to be able to quantify similarities and differences. Allied to corpus-similarity is corpus-homogeneity. An understanding of homogeneity is a prerequisite to a measure of the similarity -- it makes little sense to compare a corpus sampled across many genres, like the Brown, with a corpus of weather forecasts, without first accounting for the one being broad, the other narrow. Given the centrality of corpora to contemporary language engineering, it is remarkable how little research there has been to date on the question. Biber's work, coming from sociolinguistics, has made a considerable impact, with various researchers in computational lingustics taking forward the model (Biber 1989, 1995). Studies in text classification, genre and sublanguage are also salient, but it is rarely evident how well the technologies ddeveloped in these fields are suited to measuring corpus similarity or homogeneity. The workshop will welcome contributions concerned with measuring and comparing corpora using quantitative methods, from any field. Where and when ============== The workshop will last half a day and will be on either 7th or 8th Oct, the main ACL conference being 3rd-6th Oct. The venue will be the as for the main conference. Submissions: ============ Submissions are limited to original, unpublished work. Papers may not exceed 3200 words (exclusive of title page and references). They must be received by July 8, 2000, in hard copy (4 copies) OR postscript OR rtf format. Electronic delivery is to compcorp@itri.brighton.ac.uk and hard copies are to be mailed to Compcorp submission ITRI University of Brighton Lewes Road Brighton BN2 4GJ United Kingdom Important Dates: July 8, 2000 Submission (of full-length paper) August 17, 2000 Acceptance notice September 5, 2000 Camera-ready paper due October 7 or 8 Workshop date Co-ordinators ============= Adam Kilgarriff - University of Brighton, UK Tony Berber Sardinha - Catholic University of Sao Paulo, Brazil Programme committee =================== Douglas Biber Northern Arizona University Jeremy Clear University of Birmingham Ted Dunning MusicMatch Software, Inc. Tomaz Erjavec Jozef Stefan Institute, Slovenia Pascale Fung University of Science and Technology, Hong Kong Sylviane Granger (tbc) Universite Catholique de Louvain Greg Grefenstette (tbc) XRCE, Grenoble Benoit Habert LIMSI, France Przemek Kaszubski (tbc) Adam Mickiewicz University, Poland Adam Kilgarriff University of Brighton David Lee University of Lancaster Oliver Mason University of Birmingham Doug Oard University of Maryland Tony Rose Canon Research Tony Berber Sardinha Catholic University of Sao Paulo, Brazil George Tambouratzis ILSP, Athens Christopher Tribble King's College, London University Website ======= http://www.itri.bton.ac.uk/events/compcorp