One of the mostly advanced artificial intelligence (AI) technologies in recent year has been language models (LM) which is the object of the study with the aim to necessitate a comparison or benchmark among many LM to enhance transparency of these models. The essence of the results aim to provide a fuller characterization of LMs rather than to focus on a specific aspect in order to increase societal impact.

As a result of this study, a framework for designing a LM benchmark has been developed with a focus on metrics, model, and scenarios so that one can taxonomize the vast design space of language model evaluation into scenarios and metrics.

Based on core scenarios one can comprehensively measure major metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency). One can also evaluate existing LMs under the standardized conditions of the benchmark, ensuring models can now be directly compared across many scenarios and metrics.

The results can be explained based on the suggested model scenario, adaptation, metric-required to provide a roadmap for how to evaluate language models. The results can be used under the condition that rather than assuming it as a complete model, it is a step towards the design of more sophisticated models and aims to raise awareness of the importance of developing benchmarks for AI models.

Keywords: language models, artificial intelligence, transformers, benchmark, neural language programming, transparency, neural networks.