Training and Evaluation Corpora for the BAGEL Statistical Language Generator
BAGEL is a fully trainable language generator based on dynamic Bayesian networks. It can be trained on semantically-aligned utterances in the target application domain. These training utterances can be queried iteratively using uncertainty-based active learning, by searching for the semantic input yielding the lowest confidence score according to BAGEL's model. BAGEL was shown to generate natural and informative utterances from unseen input semantics (Mairesse et al., 2010).
You can download the 404 annotated
utterances used for training and automated evaluation in the
ACL paper (2 paraphrases for 202 distinct dialogue acts), produced and
aligned by 42 untrained annotators using Amazon Mechanical Turk. The
participants first generated an utterance from an abstract semantic
representation as in this form, and
they were then asked to align the semantic concepts
using this interface. The utterances
were manually corrected to avoid alignment inconsistencies and
spelling errors. The resulting semantic stacks are represented between
brackets before the labelled phrase, e.g. [food+Chinese]
represents the stack inform(food(Chinese))
(the inform symbol is ommited in the data). Concepts are
pushed on the right. The dialogue act representation follows the CUED
dialogue act scheme. Non-enumerable values such as place names are
replaced by an X in the abstract dialogue act definition
(ABSTRACT_DA), as well as in the utterance. Results reported in the paper were obtained through a
10-fold cross-validation.
You can also download the naturalness and
informativeness ratings of the generated utterances used for the human evaluation of the
different learning configurations (the full training set does not
include the test fold, i.e. models were trained on 90% of the training
corpus). The evaluation data was collected from 18 native speakers
using Amazon Mechanical Turk, evaluating 8 utterances for each of the 202
dialogue acts. AL_N refers to models trained on N utterances collected
using active learning, RAND indicates
random sampling, and GOLD indicates an utterance produced by a
human. Ratings range from 1 to 5, with 1=bad and 5=excellent. More details can be found in
the ACL paper.
[ Back to homepage ]
Francois
Mairesse, 2010
|