Enhancing Foundation Model Robustness with Human-Centric Comparative Oversight Mechanisms
Institution: Stanford Department of Computer Science
Timeline: September 2023-Present
Project Background:
Project Goals:
The development of foundation models has led to transformative applications across variousfields, but these models have limitations. One significant challenge is the inadequacy of human feedback mechanisms in refining model behavior. Current practices like Reinforcement Learning from Human Feedback (RLHF) often deploy binary judgments, presenting an overly simplistic representation of human preferences. This rudimentary form of feedback poses barriers to addressing issues of bias, fairness, and lack of transparency. Recent studies have indicated that human feedback could be more nuanced than binary judgment captures, highlighting a gap in the validity and reliability of such approaches. Recently, there has been some work showing that ”inter-rater agreement rates” and ”intra-method agreement rates” are suboptimal when employing these conventional feedback methods.
This project proposes new mechanisms for learning from human feedback with guarantees of truthfulness. The proposed mechanisms will enhance robustness and provide new methods for AI oversight. We suggest adopting a multi-stage evaluation approach to overcome these limitations, inspired by work on amplification and drawing from mechanism design theories like the ESP game. This method encourages the truthful reporting of subjective data and permits the elicitation of rich, natural language critiques. In essence, we aim to implement comparative oversight as a more nuanced and human-centered approach to enhance the robustness and trustworthiness of foundation models
Methods:
The project will be conducted in two phases - simulation development and a pilot study. In the first phase, we will develop basic simulations using GPT to generate synthetic samples for comparative assessment. The goal is to create an initial codebase and baseline for oversight tasks. This will allow us to refine evaluation prompts and experimental design. In the second phase, we will conduct a pilot study on Amazon Mechanical Turk to test our oversight agreement mechanism with real human evaluators. We will recruit participants to provide comparative reviews on samples from GPT, as well as oversight on the agreement between reviews. The pilot study will manipulate the preference elicitation method as the independent variable across subjects. We will gather data on inter-rater agreement and intra-method agreement as the key dependent variables to evaluate the proposed method.