[Stanford Ph.D. Thesis] Language Model Design and Evaluation for Human-Computer Interaction

cfd97dfd80d9a7093c6600699b116077.png

来源:专知
本文为论文介绍,建议阅读5分钟
本论文专注于设计和评估用于人机交互的LMs。

72e2ab1493f3bb503cfb7c7798cdc7f6.png

https://searchworks.stanford.edu/view/14784050

Although language models (LMs) are ubiquitous in real-world applications (e.g., web search, text auto-completion, and content generation), most LMs are not optimized for human user interaction with LMs, nor have they been evaluated in this regard. . To address this gap, this paper focuses on designing and evaluating LMs for human-computer interaction. We first focus on a specific need that authors encounter in the revision process: to come up with content given the surrounding context. To support this need, we propose a training method that enables any pre-trained LMs to complete the blank-filling task, helping to better facilitate human-computer interaction. Second, we build a platform, CoAuthor, for capturing interaction traces of human-computer interactions. With CoAuthor, we show how to collect large-scale interaction datasets and analyze these traces, providing unique insights into the capabilities of LMs in language, thought development, and collaboration. Finally, we propose a new evaluation framework, Artificial Intelligence Language Interaction Evaluation (HALIE), which defines the components of interactive systems and metrics for human-computer interaction tasks beyond writing tasks. Finally, we discuss open challenges and future directions in this field.

Writing a dissertation in a rapidly changing environment is a special endeavor. The field of natural language processing (NLP) is going through an era of constant change and innovation, and this paper aims to capture a snapshot of the field and examine one timeless quality from this changing landscape: designing and evaluating language models (LMs). LMs have grown tremendously to say the least since I started my PhD research. In 2017, the most common way to build an LM is to choose a specific task, collect a custom dataset, design a custom model, and train the custom model from scratch, as I demonstrated in my first project (Lee et al., 2019). By 2023, even without prior knowledge of NLP or programming, we will be able to quickly "build" and interact with LMs to perform a wide range of tasks by prompting pre-trained LMs through APIs or simple user interfaces, as I follow up As demonstrated in the project (Lee et al., 2022a,b, Bommasani et al., 2023).

However, despite the unprecedented capabilities and widespread applications of recent language models (LMs) (Radford et al., 2019; Brown et al., 2020; Rae et al., 2021; Zhang et al., 2022; Chowdhery et al., 2022; Lieber et al., 2021; OpenAI, 2022, 2023), but most existing LM research in NLP mainly focuses on non-interactive scenarios: given an input text, the model generates an output text, only focusing on the quality of the output. In such cases, human involvement is either ignored or limited to specific purposes or forms, such as human evaluation of model outputs (Ribeiro et al., 2020; Kiela et al., 2021) or strict interactions like dialogue (Paranjape et al. et al., 2020; Thoppilan et al., 2022; Shuster et al., 2022). Almost all benchmarks, even those that incorporate diverse tasks (Gehrmann et al., 2021; Hendrycks et al., 2021; Liang et al., 2022), take this non-interactive view. In contrast, a central thesis of my work is to place interaction at the forefront of LM design and evaluation. Taking the question answering task as an example, instead of building a model that works in isolation (i.e., taking predefined questions as model input and comparing the model output with predefined answers in static benchmarks), I focus more on interactive scenarios. In this scenario, users engage in an iterative process of writing questions, interrogating (or querying) the model, interpreting and processing model output, tailoring their questions to the output, and gradually adapting their questions as they learn about the model. strategy. My work on story writing follows a similar philosophy (Lee et al., 2022a). The LM I am trying to develop is not a model that can generate an entire story by itself (Fig. 1.1a), but a model that can enhance and support our writing process (Fig. 1.1b), possibly by generating parts of the story for users to choose and adjust. This interactive use of LM is consistent with Engelbart (1962), Skagestad (1993,

For the human-computer interaction (HCI) community, recent language models (LMs) present exciting opportunities for novel interaction design. We are starting to see many applications and prototypes utilizing LMs for rapid prototyping and designing novel natural language interactions (Calderwood et al., 2020; Buschek et al., 2021; Wang et al., 2021; Chen et al., 2021; Chakrabarty et al., 2022; Ippolito et al., 2022; Valencia et al., 2023). To study the generative capabilities of LMs, the most traditional approach in HCI is contextual inquiry, inviting and interviewing users (Calderwood et al., 2020; Clark et al., 2018b; Gero and Chilton, 2019; Wu et al., 2020, 2022; Yang et al. People, 2019a). However, due to the time and resource-intensive nature of contextual inquiry, it is more effective at capturing subjective interpretations of LM capabilities and less effective at covering diverse contexts.

At the heart of my research are interaction traces, the sequences of events that unfold during interactions between human users and LMs (Fig. 1.1b). These traces contain various behaviors, including key presses, cursor movements, system queries, and navigation through system suggestions. They contain rich information, capture the dynamics of HCI, and provide insight into the capabilities of LMs in interaction scenarios. For example, by examining the frequency of user queries, we can quantify how much users rely on LMs, and how helpful LM responses are. Furthermore, interaction traces also allow us to understand the strategies users adopt when interacting with LMs, as well as the temporal properties of the interactions. Last but not least, utilizing interaction traces enables coverage of various contexts, as designers can capture human-machine interactions at scale once, and reuse and replay them multiple times for analysis.

I believe that by leveraging these interaction traces, the NLP and HCI communities can devise more targeted and user-centered approaches to LM development and deployment.

This paper includes the following chapters:

•  Chapter 2 builds the foundational understanding for subsequent chapters by providing background on language models (LMs), HCI, and the design space of HCI in writing.

• Chapter 3 delves into a specific interactive context, the revision process for writing, and focuses on user needs that most LMs cannot directly address. Specifically, we propose a training method that enables LMs to fill gaps (i.e., text filling).

•  Chapter 4 introduces CoAuthor, a platform designed to capture and analyze human-computer interactions in collaborative writing. The platform facilitates the collection of interaction traces, resulting in a rich and reproducible dataset. Using this dataset, I show how examining these interaction traces yields invaluable insights into the capabilities of LMs in areas such as language, creativity, and collaboration.

•  Chapter 5 proposes a new evaluation framework, Artificial Intelligence Language-Based Interaction Evaluation (HALIE), which defines the fundamental components of an interactive system and introduces new evaluation metrics for evaluating human-computer interaction beyond writing performance of related tasks. This framework covers a wider range of interaction scenarios, making it possible to fully understand and evaluate the performance of LMs in various situations.

•  Chapter 6 discusses open challenges in the field of HCI to stimulate further research and innovation.

Part of the work in the thesis has been published in academic conferences. Chapter 3 is based on the work of Donahue et al. (2020), which was presented at the Association for Computational Linguistics (ACL) 2020. Chapter 4 is based on the material of Lee et al. (2022a), which will be presented at the Conference on Human-Computer Interaction Systems (CHI) in 2022. Chapter 5 is based on the study by Lee et al. (2022b), which is currently under review.

9e765062b2fd54a8c670830a45a3ca68.png

7b29bce4421648e21c3098082b1cb153.png

60682c26367d15dfac8542d9690b5e8f.png

f278ac418dc6639a4b3c346fafc4ed86.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131989707