Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

EigenBench: A Comparative Behavioral Measure of Value Alignment

Created by
  • Haebom

Author

Jonathn Chang, Leonhard Piff, Suvadip Sana, Jasmine X. Li, Lionel Levine

Outline

EigenBench is a novel benchmarking methodology for solving the value alignment problem in AI. To address the lack of existing quantitative metrics, it proposes a black-box approach that comparatively evaluates the level of value alignment across various language models. It takes as input an ensemble of models, a constitution describing the value system, and a scenario dataset, and outputs a vector score quantifying each model's alignment with the given constitution. Each model evaluates the outputs of other models under various scenarios, and the EigenTrust algorithm aggregates these evaluations to produce a score reflecting the weighted average judgment of the entire ensemble. It is designed to quantify features that may vary even among rational judges, without relying on correct-answer labels. Experiments using prompt personas to test the sensitivity of EigenBench scores to models or prompts revealed that while most of the variance is explained by the prompts, small residuals quantify the inherent biases of the models themselves.

Takeaways, Limitations

Takeaways:
A new method for quantitatively measuring the alignment of AI values.
Adoption of a black-box approach that does not rely on existing correct answer labels
Suggesting the possibility of measuring the value propensity of the model itself
Limitations:
The influence of the prompt appears to be greater than that of the model (raising questions about the accuracy of the model's own value propensity measurement).
The nature of the EigenTrust algorithm may make it difficult to interpret the results.
Generalizability needs to be verified across various value systems and scenarios.
👍