Skip to content

Human Evaluation

Having humans judge AI outputs for quality, accuracy, helpfulness, and safety. Human evaluation remains the gold standard for assessing language models, as automated metrics often fail to capture nuanced aspects of output quality.

Related terms

BenchmarkElo RatingRed Teaming
← Back to glossary