CALL FOR PAPERS

        Intrinsic and Extrinsic Evaluation Measures
              for MT and/or Summarization

             Workshop at the Annual Meeting of
      the Association of Computational Linguistics (ACL 2005)

                Ann Arbor, Michigan
                    June 29, 2005

	Submission Deadline  May 2, 2005

We are aware that this submission deadline conflicts with a proposal
deadline from a US funding agency. If this conflict causes you
significant hardship, please contact the workshop organizers and
let us know.

             http://www.isi.edu/~cyl/MTSE2005/ 	

This one-day workshop will focus on the challenges that the MT
and summarization communities face in developing valid and useful
evaluation measures. Our aim is to bring these two communities
together to learn from each other's approaches.

In the past few years, we have witnessed---in both MT and
summarization evaluation---the innovation of ngram-based intrinsic
metrics that automatically score system-outputs against
human-produced reference documents (e.g., IBM's BLEU and ISI/USC's
counterpart ROUGE).  Similarly, there has been renewed interest in
user applications and task-based extrinsic measures in both
communities (e.g., DUC'05 and TIDES'04). Most recently, evaluation
efforts have tested for correlations to cross-validate
independently derived intrinsic and extrinsic assessments of
system-outputs with each other and with human judgments on output,
such as accuracy and fluency.

The concrete questions that we hope to see addressed in this
workshop include, but are not limited to:

- How adequately do intrinsic measures capture the variation
   between system-outputs and human-generated reference documents
   (summaries or translations)? What methods exist for calibrating
   and controlling the variation in linguistic complexity and content
   differences in input test-sets and reference sets?
   How much variation exists within these constructed sets?
   How does that variation affect different intrinsic measures?
   How many reference documents are needed for effective scoring?

- How can intrinsic measures go beyond simple n-gram matching, to
   quantify the similarity between system-output and human-references?
   What other features and weighting alternatives lead to better
   metrics for both MT and summarization?  How can intrinsic measures
   capture fluency and adequacy? Which types of new intrinsic metrics
   are needed to adequately evaluate non-extractive summaries and
   paraphrasing (e.g.,interlingual) translations?

- How effectively do extrinsic (or proxy extrinsic) measures capture the
   quality of system output, as needed for downstream use in human tasks,
   such as triage (document relevance judgments), extraction (factual
   question answering), and report writing; and in automated tasks,
   such as filtering, information extraction, and question-answering?
   For example, when is an MT system good enough that a summarization
   system benefits from the additional information available in
   the MT output?

- How should metrics for MT and summarization be assessed and
   compared? What characteristics should a good metric possess?
   When is one evaluation method better than another? What are the
   most effective ways of assessing the correlation testing and
   statistical modeling that seek to predict human task performance
   or human notions of output quality (e.g., fluency and adequacy)
   from "cheaper" automatic metrics? How reliable are human judgments?

Anyone with an interest in MT or summarization evaluation research or
in issues pertaining to the combination of MT and summarization is
encouraged to participate in the workshop. We are looking for research
papers on the aforementioned topics, as well as position papers that
identify limitations in current approaches and describe promising
future research directions.

SHARED DATA SETS
To facilitate the comparison of different measures during the
workshop, we will be making available data sets in advance for
workshop participants to test their approaches to evaluation.
For the details for accessing the data sets, please go to workshop's
website at http://www.isi.edu/~cyl/MTSE2005.

WORKSHOP FORMAT
The workshop will include presentations of research papers
and short reports, an invited report on the TIDES 2005 Multi-lingual,
multi-document summarization evaluation, and significant discussion
time to compare results of different researchers. The workshop
will conclude with a panel of invited discussants to address future
research directions.

TARGET AUDIENCE
The topic of this workshop should be of significant interest to
the entire MT and Summarization research communities, and also to
commercial developers of MT and Summarization systems.  It should be
of particular interest to the program managers and participants of the
MT and Summarization programs funded by the US Government, where
common evaluations are an integral part of the research program.

SUBMISSION INFORMATION
Submissions will consist of regular full papers, reports on evaluations
using shared data sets, and position papers, formatted following the
ACL 2005 guidelines. Details for submission will be posted on the
workshop website. The submission and review processes will be
handled electronically.

IMPORTANT DATES
All submissions due:             	Mon, May  2, 2005
     We are aware that this submission deadline conflicts
     with a proposal deadline from a US funding agency.
     If this conflict causes you significant hardship,
     please contact the workshop organizers and let us know.
Notification:                    	Sun, May 22, 2005
Camera-ready papers due:         	Wed, June 1, 2005

ORGANIZERS

Jade Goldstein, jgstewa@afterlife.ncsc.mil, DoD, USA
Alon Lavie, alavie@cs.cmu.edu, LTI, CMU, USA
Chin-Yew Lin, cyl@isi.edu, Information Sciences Institute, USC, USA
Clare Voss, voss@arl.army.mil, Army Research Laboratory, USA

PROGRAM COMMITTEE

Yasuhiro Akiba (ATR, Japan)
Leslie Barrett (TransClick, USA)
Bonnie Dorr (U Maryland, USA)
Tony Hartley (U Leeds, UK)
John Henderson (MITRE, USA)
Chiori Hori (LTI CMU, USA)
Eduard Hovy (ISI/USC, USA)
Doug Jones (MIT Lincoln Laboratory, USA)
Philipp Koehn (CSAIL MIT, USA)
Marie-Francine Moens (Katholieke Universiteit, Leuven, Belgium)
Hermann Ney (RWTH Aachen, Germany)
Franz Och (Google, USA)
Becky Passonneau (Columbia U, NY USA)
Andrei Popescu-Belis  (ISSCO/TIM/ETI, U Geneva, Switzerland)
Dragomir Radev (U Michigan, USA)
Karen Sparck Jones (Computer Laboratory, Cambridge U, UK)
Simone Teufel (Computer Laboratory, Cambridge U, UK)
Nicola Ueffing (RWTH Aachen, Germany)
Hans van Halteren (U Nijmegen, The Netherlands)
Michelle Vanni (ARL, USA)
Dekai Wu (HKUST, Hong Kong)