Thoughts and Lessons for Planning Rater Studies in AI • David Stutz

Planning and conducting rater studies has become a core activity in AI research as more and more tasks become very ambiguous, depend on subjective opinions or require specialized expertise. This is true across model types, from LLMs over multimodal models to purely generative image or video models, but also for standard classification tasks. Unfortunately, many engineers and researchers are not used to working with raters directly, also because AI research was largely driven by standardized benchmarks over the last decades. Being involved in various projects that required sophisticated rater studies, in this article, I want to share a recipe of how I — and many others I work with — go about successfully designing and conducting rater studies:

This recipe has five key parts that I learned to go through in this exact order:

Defining the goal

Using the right data

Designing the questionnaire

Recruiting the raters

Performing analysis

While planning a rater study, all of the above steps need to be addressed and thought about — including the analysis, even if there is no data yet to analyze. The most important and thus first step, however, is defining the rater study goal.

Defining the Goal

In my experience, each rater study can address at most one primary goal. There might be secondary goals that can be addressed on the way, but these have to be aligned very well with the study’s primary goal. However, for all the decisions that have to be made — selecting data, drafting questions/tasks, recruiting raters, etc. — there is ideally only one goal that informs these decisions. This follows from a simple argument of contradiction: Let us assume we are planning a rater study to address two equally important and sufficiently disjoint goals. These goals are by construction disjoint, otherwise one could be formulated as a subgoal of the other. Then, this means that we are essentially trying to conduct two studies. Their ideal design might be similar, but cannot be the same as the goals are distinct. If we could simulate the study, we would run them in parallel or sequence and share as much logic as possible. But with humans we cannot simply do this because human raters — experts or not — come with many biases and limited attention. Trying to make human raters perform two fairly different tasks in parallel or in sequence will almost surely reduce the quality of labels we get for both tasks by introducing unwanted biases or increasing error rate.

The primary goal of every rater study is usually defined by some need. Often, in AI, we aim to obtain labels for training or evaluation of a single, fairly specific AI task. For example, for training, the task will also determine the format of labels (e.g., class labels, pairwise comparisons, etc.); for evaluation, the task will often be informed by a claim to be made in a paper or a leaderboard to be built. Being able to write down the goal precisely and communicate it to colleagues is a good sign that the goal is clear. For evaluation, explicitly formalizing the metrics and labels is a good start; for training, the loss and should be clear. In some settings it may also be worth clarifying the goal with domain experts, product owners or even potential raters. The key is to keep iterating until the goal can be written down in a clear, concise manner.

Using the Right Data

Now that the labeling goal is clear, which should include an idea about the labels that are to be obtained, it is time to think about how to select the right data to be labeled. In some tasks, the data might be given, for example when re-labeling or adding labels to a fixed training or test set. Oftentimes, however, selecting data is part of planning the rater study. This might include selecting data from a large set of available data, acquiring the required data, or synthetically generating the data. For now, I will ignore the case where data needs to be acquired in the “real world” since that can be extremely involved by itself. Instead, I will consider cases where data can be selected from large databases or generated synthetically.

In this setting, data selection usually starts by budget considerations. On a high-level, rater studies can usually be broken down to individual tasks. This could be one task for each selected example or one task for each label to be obtained. Sometimes, for example for pairwise comparisons, it might also involve two or multiple examples. Then, the number of tasks, the number of replications, the time per task and the rater qualifications usually determine the price of a rater study. The latter two will be discussed below; leaving the number of tasks and the number of replications as key considerations. This will determine how many examples should be selected. For a start, these can be rough numbers. It’s more important to know whether to label 1000s out of 100000s or 10s out of 100s. The former will have to be automated, the latter could be done manually by researchers, engineers or domain experts.

Once the rough numbers are clear, the main question is which examples to select. In my experience, there are often two key considerations: maximizing the information obtained from labeling by selecting “interesting” examples, and making sure no unwanted biases are introduced through this selection. What is considered “interesting” may vary by tasks, but often includes selecting difficult cases. This could be examples where AI models, auto-raters, or experts make mistakes or disagree with each other. This could also include examples from a new domain where there is little training data and AI models are confidently wrong. On the other hand, it is important to think about how this selection may introduce biases in the obtained labels. This boils down to keeping distributions over important attributes of examples fixed if deemed important. For example, when labeling medical data, preserving the distribution over patient demographics may be important. Otherwise, training or evaluating AI models with the newly labeled data would introduce an unwanted bias that may negatively affect performance for patient groups that are under-represented in the obtained labels. Similarly, for generative AI, it can happen that “interesting” prompts come from fairly specific domains, ignoring many others.

To be more concrete and pragmatic, selecting “interesting” examples is often about formulating appropriate heuristics to then automatically subselect examples. Then, continuously monitoring attribute distributions of interest can be done semi-manually, looking at histograms or key statistics. This process can then be refined to achieve a rough target number of examples while not introducing unwanted biases. For these heuristics, it is important to calibrate any threshold or models to high recall of interesting examples. For example, they may be built on top of auto-raters, detectors of specific attributes for the modality of interest, or existing labels, among many other options. Coming up with the right heuristics is the tricky bit of data selection that requires intuition and domain expertise. It also benefits from feedback and discussion to ensure that people agree on which examples are deemed interesting and to avoid introducing biases that are not obvious on first sight.

After narrowing down data selection to a significantly smaller set of examples (for example, selecting 1000s from 100000s), it is worth doing a more manual validation phase using engineers, researchers, or selected expert raters that double check some of the heuristics. The idea is to get a sense of accuracy of the used heuristics that can then be used to refine and re-run the heuristics to select the final set of examples to be labeled (for example, selecting 100s from 1000s). Often, this process will involve randomness in which case it is worth selecting multiple candidate sets and then use manual inspection to select or even adapt the right set. Ultimately, this will be the set of examples to be labeled.

Designing the questionnaire

The questionnaire includes the instructions and actual rating tasks. Usually this is a set of questions that, put together, allow to address the primary goal of the rater study. Sometimes it can be as easy as a single question, for example, to select a label in a well-defined classification task. More often than not, however, there will be auxiliary questions even in such simple cases. For example, this could be a question asking about confidence or allowing the raters to flag images for quality, safety or because they cannot fit a pre-specified class. On top of this, instructions need to be as unambiguous as possible. For example, what is the rater supposed to select if an image fits two classes, even though this was not anticipated to happen. This illustrates that questionnaire design can be very complex, even for seemingly standard labeling tasks.

In my experience, the following process works well in coming up with the questions to be asked and how to ask them: First, I write everything down as explicitly as possible, including the exact questions, the flow through the questions, and how the examples are revealed and shown. Here, I will not focus too much on how different modalities are best presented as this can be quite complex, e.g., for annotating videos or audio as well as large volumes of text. Second, set up a mock questionnaire with a few concrete examples to work through with colleagues, domain experts and/or potential raters to get feedback. This is important as people understand instructions and questions differently and an important goal of this step is to avoid any confusion before launching the rater study. Third, keep track of the feedback and any discussions and design decisions made along the way. I learned that similar issues come up repeatedly when gathering feedback and it is important to make sure that design decisions are backed up by reasoning, references, data, or experiments.

The primary purpose of doing questionnaire design properly is to reduce the likelihood of introducing rater biases, rater errors and misunderstandings as much as possible. This is because, once the rater study is launched, even simple misunderstandings can easily render a significant fraction of your ratings useless. Because rater studies often involve significant budget and time investment, reducing these risks ultimately saves money and time.

Finally, the questionnaire needs to be implemented. In my opinion, it is always worth it to consider a range of options. On the one extreme, some studies can be conducted with simple Google forms or spreadsheets. Especially if the questionnaire addresses few examples with low replication, there is no rater tracking required, and the rating task is rather simple. On the other extreme, there exist rather complex tools, internal or from various vendors, that allow setting up custom UIs, complex interaction flows with various modalities, etc. Different options will define different trade-offs in terms of time investment, tool complexity, and the study’s goal. In all cases, testing is key. This means making sure that all questions are correctly marked required or optional, the exact right number of answers can be selected (e.g., in multiselects), text elements allow for all required special characters or long enough text, error messages are displayed correctly etc. Again, the main goal here is to reduce errors that come from raters misunderstanding the UI or the form incorrectly processing inputs.

Recruiting the raters

Available raters are usually informed by internal rater pools or external providers. Requesting and onboarding new raters may take a significant amount of time and requires a fairly clear idea of of how rater recruitment works. For many settings, providers have a reasonable pool of raters for fairly generic tasks where the average human is expected to perform well. However, there may still be questions around balancing demographics (countries, gender, etc.) or appropriate language skills and equipment (e.g., for audio tasks). In more complex or specialized domains such as safety or health, requirements can get more complex. In safety, for example, we often care about particularly diverse rater pools and raters need to be aware of potentially offensive content. In health, requiring raters with specific certifications or specializations can make recruitment and scheduling more difficult as availability is lower. It also affects budget since more raters may be required for a specific throughput (for example, per week or month) and specialized raters generally get more expensive. Also, instructions and questions may need to be tailored to the rater experience and expectations.

As a first step, I usually think about the ideal rater profile in terms of skills, experience and availability. It is worth writing this down and discussing it. Often, this is informed by the type of evaluations or claims in a paper I want to make. Often, this exercise also reveals a prioritization of requirements. I also recommend thinking about no-gos, meaning the minimum requirements without which the rater study cannot go ahead. Then, starting discussions with providers will usually inform trade-offs to be made between requirements, budget, and availability.

A final issue that is easily neglected is the assignment of tasks to raters. In simple cases, this is not hugely important and tasks can just be assigned randomly. However, in other cases, having each rater go through all examples can be useful, for example, to get estimates of rater reliability.

Analysis

The analysis is the only part of the rater study that can easily be repeated and iterated on. Nevertheless, it is important to think about the analysis ahead of launching the rater study. This is to make sure that the ratings collected meet the rater study’s goal. For evaluation tasks, this often involves defining the metric of interest and determining how ratings are aggregated and filtered. For training tasks, this involves setting up losses and data pipelines. These simple steps can often reveal blind spots in the ratings to be obtained which in turn may inform all previous steps. This work can also be combined with a pilot study with very few participants to test everything end-to-end. Alternatively, it is easy to work with synthetic ratings (e.g., from one or multiple auto-raters or using random ratings).

Conclusion

Planning a rater study is a challenging endeavor and can be extremely technical. Rater studies are becoming a key component in many deployed systems and technical reports around large models. Above, I outlined some lessons I learned of how to plan rater studies, focusing on five key elements: the rater study goal, the right data, the questionnaire, rater recruitment, and analysis. Across all these elements, collecting decisions and discussions in a central document can be incredibly useful. Beyond that, each element can also include rather significant and research-heavy sub-projects. Ultimately, I found working on rater studies an incredibly collaborative and iterative process that touches on a wide range of skill sets — both technical and non-technical.

Source link
lol