Since the launch of ChatGPT, a succession of new large language models (LLMs) and updates have emerged, each claiming to offer unparalleled performance and capabilities. However, these claims can be subjective as the results are often based on internal testing that is tailored to a controlled environment. This has created a need for a standardized method to measure and compare the performance of different LLMs.
Anthropic, a leading AI safety and research company, is launching a program to fund the development of new benchmarks capable of independent evaluation of the performance of AI models, including its own GenAI model Claude.
The Amazon-funded AI company is ready to offer funding and access to its domain experts to any third-party organization that develops a reliable method to measure advanced capabilities in AI models. To get started, Anthropic has appointed a full-time program coordinator. The company is also open to investing or acquiring projects that it believes have the potential to scale.
The call to have a third-party bench for AI models is not new. Several companies, including Patrouns AI, are looking to fill the gap. However, there has not been any industry-wide accepted benchmark for AI models.
The existing benchmarks used for AI testing have been criticized for their lack of real-world relevance as they are often unable to evaluate the models on how the average person would use the model in everyday situations.
The benchmarks can also be optimized specifically for certain tasks, resulting in poor overall assessment of the LLM performance. There can also be issues with the static nature of datasets used for the testing. These limitations result in the inability to assess the long-term performance and adaptability of the AI model. Most of the benchmarks are focused on LLM performance, lacking the ability to evaluate risks posed by AI.
“Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem,” Anthropic wrote on its official blog. “We’re seeking evaluations that help us measure the AI Safety Levels (ASLs) defined in our Responsible Scaling Policy. These levels determine the safety and security requirements for models with specific capabilities.
Anthropic’s announcement of the plans to create independent, third-party benchmark tests comes on the heels of the launch of the Claude 3.5 Sonnet LLM model, which Anthropic claims beats other leading LLM models on the market including GPT-4o and Llama-400B.
However, Anthropic’s claims are based on internal evaluations conducted by itself, rather than third-party independent testing. There was some collaboration with external experts for testing, but this does not equate to independent verification of performance claims. This is the primary reason why the startup wants a new generation of reliable benchmarks, which it can use to demonstrate that its LLMs are the best in the business.
According to Anthropic, one of its key objectives for the independent benchmarks is to have a method to assess an AI model’s ability to engage in malicious activities, such as carrying out cyber attacks, social manipulation, and national security risks. It also wants to develop an “early warning system” for identifying and assessing risks.
Additionally, the startup wants the new benchmarks to evaluate the AI model’s potential for scientific innovation and discovery, conversing in multiple languages, self-censoring toxicity, and mitigating inherent biases in its system.
While Anthropic wants to facilitate the development of independent GenAI benchmarks, it remains to be seen whether other key AI players, such as Google and OpenAI, will be willing to join forces or accept the new benchmarks as an industry standard.
Anthropic shared in its blog that it wants the AI benchmarks to use certain AI safety classifications, which were developed internally with some input from third-party researchers. This means that the developer of the new benchmarks could be compelled to adopt definitions of AI safety that may not align with their viewpoints.
However, Anthropic is adamant that there is a need to take the initiative to develop benchmarks that could at least serve as a starting point for more comprehensive and widely accepted GenAI benchmarks in the future.
Related Items
Indico Data Launches LLM Benchmark Site for Document Understanding
New MLPerf Inference Benchmark Results Highlight the Rapid Growth of Generative AI Models
Groq Shows Promising Results in New LLM Benchmark, Surpassing Industry Averages
Source link
lol