About the Evaluation category

This subforum is dedicated to measurement and benchmarking. Discussions may include datasets, benchmarks, evaluation methodologies, reproducibility, comparisons between models or systems, and critiques of existing metrics.