SEAS Researchers Expose Hidden “Alignment Discretion” Shaping AI Behavior

Research demonstrates how human judgment steers AI safety training and calls to make alignment more transparent and accountable

April 21, 2025

Facebook Twitter Email LinkedIn

How do we design AI models, such as large language models, that behave in accordance with human intentions and social, legal, and ethical principles? That question is at the heart of the field of research known as AI alignment.

Today, one widely adopted approach of AI alignment involves collecting examples of possible AI outputs, asking humans to annotate which output is better and training the model to follow these preferences. But this approach fails to capture the rationale behind preference decisions — why some outputs are better than others. Annotators must decide how to balance multiple goals — such as accuracy, protecting privacy or avoiding harm — often without clear guidance on which principle should win when they conflict.

Researchers at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS) have developed new tools to examine this largely invisible force shaping modern AI systems.

The research, led by Flavio du Pin Calmon, the Thomas D. Cabot Associate Professor of Electrical Engineering at SEAS, won best paper at the New England NLP Meeting Series.

Drawing on legal theories of judicial discretion, the team demonstrated that some degree of judgment is unavoidable when AI is trained to follow broad principles such as “protect privacy” or “avoid harm.” But by analyzing widely used safety datasets, they found that this discretion is currently exercised with little structure or oversight. Principles frequently conflict, annotators often override what the written rules would suggest, and different models learn strikingly different ways of prioritizing values like helpfulness, safety, and freedom of expression.

To make this process interpretable, the researchers introduced new quantitative metrics and tools to compare human annotators, reward models, and state-of-the-art language models, uncovering large gaps in how each applies the same stated principles of alignment.

The researchers argued that in order to make AI systems more transparent, accountable, and trustworthy, AI developers and regulators need to treat alignment more like a legal system, with transparent rules for how principles are applied, documentation of discretionary choices in datasets, and mechanisms for human review and correction.

Read the full paper here.

Topics: AI / Machine Learning, Research