Patronus AI Inc., a startup that builds tools for companies to detect and fix reliability issues in their large language artificial intelligence models, today announced the launch of a small but mighty AI model that can evaluate and judge the accuracy of much larger models.
The company calls its model Glider, a 3.8 billion parameter open-source LLM designed to be a fast, flexible judge for AI language models. The company said it’s the smallest model to date to outperform competing models such as OpenAI’s GPT-4o-mini, which is commonly used as an evaluator.
Large language model evaluation is the process of assessing how well an LLM performs particular tasks, such as text generation, comprehension and question answering by measuring accuracy, coherence and relevance against set standards. This helps AI developers and engineers understand and analyze how well the model will behave in given circumstances and identify its strengths and weaknesses before it is released to the public.
“Our new model challenges the assumption that only large-scale models (30B+ parameters) can deliver robust and explainable evaluations,” said Rebecca Qian, co-founder and chief technology officer of Patronus. “By demonstrating that smaller models can achieve similar results, we’re setting a new benchmark for the community.”
When AI engineers end up relying on proprietary LLMs such as GPT-4 to evaluate the performance of pre-trained LLMs, Patronus said, it comes with several issues, such as high cost and a lack of transparency. According to the company, Glider helps provide transparency to developers and engineers by delivering a small, explainable “LLM-as-a-judge” solution with real-time evaluation scores while walking through its reasoning.
Glider’s small size also means that it can be run on-premises or on-device, meaning that companies do not need to send their sensitive data to any third party. This is especially important during a time when more companies are becoming increasingly aware of the potential privacy implications of cloud-hosted models.
During evaluations, Glider provides high-quality reasoning chains in addition to benchmark scores for each of its criteria. It does this by providing understandable bullet-point lists that explain its process. As a result, each score comes with a reason “why,” allowing developers to understand the context and full breadth that underlies what caught the model’s attention.
The company said the model is trained on 183 real-world evaluation criteria across 685 domains, which enables it to handle the evaluation of tasks that require factual accuracy and subjective human-like metrics. These include evaluation criteria such as fluency and coherence, which makes the model versatile across creative and business applications.
Its judgment system evaluates not just model outputs, but also user inputs, context, metadata and more.
“By combining speed, versatility and explainability with an open-source approach, we’re enabling organizations to deploy powerful guardrail systems without sacrificing cost-efficiency or privacy,” said co-founder and Chief Executive Anand Kannappan. “It’s a significant contribution to the AI community, proving that smaller models can drive big innovations.”
Patronus said that by providing an open-source model that supports on-premises deployment, Glider can be used for multiple evaluation use cases, including acting as an LLM guardrail, which can evaluate and catch bad behavior, or provide real-time subjective text analysis.
Image: SiliconANGLE/Microsoft Designer
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU
Source link
lol