BAGEL Statistical NLG - Training and Evaluation Corpora

Training and Evaluation Corpora for the
BAGEL Statistical Language Generator

BAGEL is a fully trainable language generator based on dynamic Bayesian networks. It can be trained on semantically-aligned utterances in the target application domain. These training utterances can be queried iteratively using uncertainty-based active learning, by searching for the semantic input yielding the lowest confidence score according to BAGEL's model. BAGEL was shown to generate natural and informative utterances from unseen input semantics (Mairesse et al., 2010).

You can download the 404 annotated utterances used for training and automated evaluation in the ACL paper (2 paraphrases for 202 distinct dialogue acts), produced and aligned by 42 untrained annotators using Amazon Mechanical Turk. The participants first generated an utterance from an abstract semantic representation as in this form, and they were then asked to align the semantic concepts using this interface. The utterances were manually corrected to avoid alignment inconsistencies and spelling errors. The resulting semantic stacks are represented between brackets before the labelled phrase, e.g. [food+Chinese] represents the stack inform(food(Chinese)) (the inform symbol is ommited in the data). Concepts are pushed on the right. The dialogue act representation follows the CUED dialogue act scheme. Non-enumerable values such as place names are replaced by an X in the abstract dialogue act definition (ABSTRACT_DA), as well as in the utterance. Results reported in the paper were obtained through a 10-fold cross-validation.

You can also download the naturalness and informativeness ratings of the generated utterances used for the human evaluation of the different learning configurations (the full training set does not include the test fold, i.e. models were trained on 90% of the training corpus). The evaluation data was collected from 18 native speakers using Amazon Mechanical Turk, evaluating 8 utterances for each of the 202 dialogue acts. AL_N refers to models trained on N utterances collected using active learning, RAND indicates random sampling, and GOLD indicates an utterance produced by a human. Ratings range from 1 to 5, with 1=bad and 5=excellent. More details can be found in the ACL paper.


[ Back to homepage ]



Francois Mairesse, 2010