Benchmark

A standardized test or dataset used to measure and compare AI model performance. Common LLM benchmarks include MMLU (knowledge), HumanEval (coding), and GSM8K (math). Benchmarks provide objective metrics but may not fully capture real-world usefulness.