CALL FOR PAPERS Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization Workshop at the Annual Meeting of the Association of Computational Linguistics (ACL 2005) Ann Arbor, Michigan June 29, 2005 Submission Deadline May 2, 2005 We are aware that this submission deadline conflicts with a proposal deadline from a US funding agency. If this conflict causes you significant hardship, please contact the workshop organizers and let us know. http://www.isi.edu/~cyl/MTSE2005/ This one-day workshop will focus on the challenges that the MT and summarization communities face in developing valid and useful evaluation measures. Our aim is to bring these two communities together to learn from each other's approaches. In the past few years, we have witnessed---in both MT and summarization evaluation---the innovation of ngram-based intrinsic metrics that automatically score system-outputs against human-produced reference documents (e.g., IBM's BLEU and ISI/USC's counterpart ROUGE). Similarly, there has been renewed interest in user applications and task-based extrinsic measures in both communities (e.g., DUC'05 and TIDES'04). Most recently, evaluation efforts have tested for correlations to cross-validate independently derived intrinsic and extrinsic assessments of system-outputs with each other and with human judgments on output, such as accuracy and fluency. The concrete questions that we hope to see addressed in this workshop include, but are not limited to: - How adequately do intrinsic measures capture the variation between system-outputs and human-generated reference documents (summaries or translations)? What methods exist for calibrating and controlling the variation in linguistic complexity and content differences in input test-sets and reference sets? How much variation exists within these constructed sets? How does that variation affect different intrinsic measures? How many reference documents are needed for effective scoring? - How can intrinsic measures go beyond simple n-gram matching, to quantify the similarity between system-output and human-references? What other features and weighting alternatives lead to better metrics for both MT and summarization? How can intrinsic measures capture fluency and adequacy? Which types of new intrinsic metrics are needed to adequately evaluate non-extractive summaries and paraphrasing (e.g.,interlingual) translations? - How effectively do extrinsic (or proxy extrinsic) measures capture the quality of system output, as needed for downstream use in human tasks, such as triage (document relevance judgments), extraction (factual question answering), and report writing; and in automated tasks, such as filtering, information extraction, and question-answering? For example, when is an MT system good enough that a summarization system benefits from the additional information available in the MT output? - How should metrics for MT and summarization be assessed and compared? What characteristics should a good metric possess? When is one evaluation method better than another? What are the most effective ways of assessing the correlation testing and statistical modeling that seek to predict human task performance or human notions of output quality (e.g., fluency and adequacy) from "cheaper" automatic metrics? How reliable are human judgments? Anyone with an interest in MT or summarization evaluation research or in issues pertaining to the combination of MT and summarization is encouraged to participate in the workshop. We are looking for research papers on the aforementioned topics, as well as position papers that identify limitations in current approaches and describe promising future research directions. SHARED DATA SETS To facilitate the comparison of different measures during the workshop, we will be making available data sets in advance for workshop participants to test their approaches to evaluation. For the details for accessing the data sets, please go to workshop's website at http://www.isi.edu/~cyl/MTSE2005. WORKSHOP FORMAT The workshop will include presentations of research papers and short reports, an invited report on the TIDES 2005 Multi-lingual, multi-document summarization evaluation, and significant discussion time to compare results of different researchers. The workshop will conclude with a panel of invited discussants to address future research directions. TARGET AUDIENCE The topic of this workshop should be of significant interest to the entire MT and Summarization research communities, and also to commercial developers of MT and Summarization systems. It should be of particular interest to the program managers and participants of the MT and Summarization programs funded by the US Government, where common evaluations are an integral part of the research program. SUBMISSION INFORMATION Submissions will consist of regular full papers, reports on evaluations using shared data sets, and position papers, formatted following the ACL 2005 guidelines. Details for submission will be posted on the workshop website. The submission and review processes will be handled electronically. IMPORTANT DATES All submissions due: Mon, May 2, 2005 We are aware that this submission deadline conflicts with a proposal deadline from a US funding agency. If this conflict causes you significant hardship, please contact the workshop organizers and let us know. Notification: Sun, May 22, 2005 Camera-ready papers due: Wed, June 1, 2005 ORGANIZERS Jade Goldstein, jgstewa@afterlife.ncsc.mil, DoD, USA Alon Lavie, alavie@cs.cmu.edu, LTI, CMU, USA Chin-Yew Lin, cyl@isi.edu, Information Sciences Institute, USC, USA Clare Voss, voss@arl.army.mil, Army Research Laboratory, USA PROGRAM COMMITTEE Yasuhiro Akiba (ATR, Japan) Leslie Barrett (TransClick, USA) Bonnie Dorr (U Maryland, USA) Tony Hartley (U Leeds, UK) John Henderson (MITRE, USA) Chiori Hori (LTI CMU, USA) Eduard Hovy (ISI/USC, USA) Doug Jones (MIT Lincoln Laboratory, USA) Philipp Koehn (CSAIL MIT, USA) Marie-Francine Moens (Katholieke Universiteit, Leuven, Belgium) Hermann Ney (RWTH Aachen, Germany) Franz Och (Google, USA) Becky Passonneau (Columbia U, NY USA) Andrei Popescu-Belis (ISSCO/TIM/ETI, U Geneva, Switzerland) Dragomir Radev (U Michigan, USA) Karen Sparck Jones (Computer Laboratory, Cambridge U, UK) Simone Teufel (Computer Laboratory, Cambridge U, UK) Nicola Ueffing (RWTH Aachen, Germany) Hans van Halteren (U Nijmegen, The Netherlands) Michelle Vanni (ARL, USA) Dekai Wu (HKUST, Hong Kong)