When Labeling AI Chatbots, Context Is a Double-Edged Sword

Authors:
(1) Clemencia Siro, University of Amsterdam, Amsterdam, The Netherlands;
(2) Mohammad Aliannejadi, University of Amsterdam, Amsterdam, The Netherlands;
(3) Maarten de Rijke, University of Amsterdam, Amsterdam, The Netherlands.
Table of Links
Abstract and 1 Introduction
2 Methodology and 2.1 Experimental data and tasks
2.2 Automatic generation of diverse dialogue contexts
2.3 Crowdsource experiments
2.4 Experimental conditions
2.5 Participants
3 Results and Analysis and 3.1 Data statistics
3.2 RQ1: Effect of varying amount of dialogue context
3.3 RQ2: Effect of automatically generated dialogue context
4 Discussion and Implications
5 Related Work
6 Conclusion, Limitations, and Ethical Considerations
7 Acknowledgements and References
A. Appendix
Abstract
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems (TDSs). Obtaining high-quality and consistent groundtruth labels from annotators presents challenges. When evaluating a TDS, annotators must fully comprehend the dialogue before providing judgments. Previous studies suggest using only a portion of the dialogue context in the annotation process. However, the impact of this limitation on label quality remains unexplored. This study investigates the influence of dialogue context on annotation quality, considering the truncated context for relevance and usefulness labeling. We further propose to use large language models (LLMs) to summarize the dialogue context to provide a rich and short description of the dialogue context and study the impact of doing so on the annotator’s performance. Reducing context leads to more positive ratings. Conversely, providing the entire dialogue context yields higher-quality relevance ratings but introduces ambiguity in usefulness ratings. Using the first user utterance as context leads to consistent ratings, akin to those obtained using the entire dialogue, with significantly reduced annotation effort. Our findings show how task design, particularly the availability of dialogue context, affects the quality and consistency of crowdsourced evaluation labels.[1]
1 Introduction
With recent advances in pre-trained language models and large language models (LLMs), task-oriented dialogue systems (TDSs) have redefined how people seek information, presenting a more natural approach for users to engage with information sources (Budzianowski and Vulic´, 2019; Wu et al., 2020). As TDSs become increasingly integral to information-seeking processes, the question of how to accurately and effectively evaluate their performance becomes critical. Due to the poor correlation of automatic metrics with human-generated labels (Deriu et al., 2021), evaluation of TDSs has shifted towards relying on user ratings or crowdsourced labels as ground-truth measures (Li et al., 2019).
Various crowdsourcing techniques have been employed to collect ground-truth labels, such as sequential labeling (Sun et al., 2021), where the annotators go through each utterance and annotate them one by one. This approach introduces certain risks in the annotation process, such as annotators’ fatigue and high cognitive load in extra-long dialogues, requiring them to remember and track the state of the dialogue as they annotate the utterances (Siro et al., 2022). While following and understanding the dialogue context is crucial and can influence the annotators’ ratings, reading and understanding very long dialogues can lead to degraded performance.
To address this issue, another line of research proposes to randomly sample only a few utterances in each dialogue to be annotated (Mehri and Eskenazi, 2020; Siro et al., 2022, 2023). While addressing the high cognitive load and fatigue, limiting annotators’ understanding of the dialogue poses obvious risks, such as unreliable and biased labels (Schmitt and Ultes, 2015; Siro et al., 2022). In particular, the amount of dialogue context can lead to biases. For example, annotators who lack rich context may unintentionally lean towards positive or negative ratings, neglecting the broader quality of the response. Thus, offering annotators too little context risks misleading judgments, potentially leading to inaccurate or inconsistent labels. Conversely, flooding annotators with excessive information can overwhelm them, which can lead to lower returns in terms of label quality.
Prior work has investigated factors that affect the quality and consistency of crowdsourced evaluation labels, including annotator characteristics, arXiv:2404.09980v1 [cs.CL] 15 Apr 2024 task design, cognitive load, and evaluation protocols (see, e.g., Parmar et al., 2023; Roitero et al., 2021, 2020; Santhanam et al., 2020). However, no previous work studies the effect of random sampling and the number of sampled utterances on the annotation quality.
In this study, we aim to address this research gap by investigating how different amounts of contextual information impact the quality and consistency of crowdsourced labels for TDSs, contributing to understanding of the impact of such design choices. We experiment with crowdsourcing labels for two major evaluation aspects, namely, relevance and usefulness under different conditions, where we compare the annotation quality under different dialogue context truncation strategies.
Addressing the challenge of insufficient context at the turn level, we propose to use heuristic methods and LLMs to generate the user’s information need and dialogue summary. LLMs can play the role of annotation assistants (Faggioli et al., 2023) by summarizing the dialogue history, facilitating a more efficient and effective understanding of the dialogue context before annotating an utterance. To this aim, we use GPT-4 for dialogue context summarization and compare the performance of annotators’ under different conditions, as well as different context sizes. Through these experiments, we answer two main questions: (RQ1) How does varying the amount of dialogue context affect the crowdsourced evaluation of TDSs? (RQ2) Can the consistency of crowdsourced labels be improved with automatically generated supplementary context?
Our findings reveal that the availability of previous dialogue context significantly influences annotators’ ratings, with a noticeable impact on their quality. Without prior context, annotators tend to assign more positive ratings to system responses, possibly due to insufficient evidence for penalization, introducing a positivity bias. In contrast, presenting the entire dialogue context yields higher relevance ratings. As for usefulness, presenting the entire dialogue context introduces ambiguity and slightly lowers annotator agreement. This highlights the delicate balance in contextual information provided for evaluations. The inclusion of automatically generated dialogue context enhances annotator agreement in the no-context (C0) condition while reducing annotation time compared to the full-context (C7) condition, presenting an ideal balance between annotator effort and performance.
Our findings extend to other task-oriented conversational tasks like conversational search and preference elicitation, both relying on crowdsourced experiments to assess system performance.
[1] To foster research in this area, we release our data publicly at https://github.com/Clemenciah/ Effects-of-Dialogue-Context