Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.
As companies race to implement generative AI, concerns about the accuracy and safety of large language models (LLMs) threaten to derail widespread enterprise adoption. Stepping into the fray is Patronus AI, a San Francisco startup that just raised $17 million in Series A funding to automatically detect costly — and potentially dangerous — LLM mistakes at scale.
The round, which brings Patronus AI’s total funding to $20 million, was led by Glenn Solomon at Notable Capital, with participation from Lightspeed Venture Partners, former DoorDash executive Gokul Rajaram, Factorial Capital, Datadog, and several unnamed tech executives.
Founded by former Meta machine learning (ML) experts Anand Kannappan and Rebecca Qian, Patronus AI has developed a first-of-its-kind automated evaluation platform that promises to identify errors like hallucinations, copyright infringement and safety violations in LLM outputs. Using proprietary AI, the system scores model performance, stress-tests models with adversarial examples and enables granular benchmarking — all without the manual effort required by most enterprises today.
Exposing the dark side of generative AI: hallucinations, copyright violations and safety risks
“There’s a range of things that our product is actually really good at being able to catch, in terms of mistakes,” said Kannappan, CEO of Patronus AI, in an interview with VentureBeat. “It includes things like hallucinations, and copyright and safety related risks, as well as a lot of enterprise-specific capabilities around things like style and tone of voice of the brand.”
The emergence of powerful LLMs like OpenAI’s GPT-4o and Meta’s Llama 3 has set off an arms race in Silicon Valley to capitalize on the technology’s generative abilities. But as hype cycles accelerate, so too have high-profile model failures, from news site CNET publishing error-riddled AI-generated articles to drug discovery startups retracting research papers based on LLM-hallucinated molecules.
These public missteps only scratch the surface of broader issues endemic to the current crop of LLMs, Patronus AI claims. The company’s previously published research, including the “CopyrightCatcher” API released three months ago and the “FinanceBench” benchmark unveiled six months ago, reveals startling deficiencies in leading models’ ability to accurately answer questions grounded in fact.
FinanceBench and CopyrightCatcher: Patronus AI’s groundbreaking research reveals LLM deficiencies
For its “FinanceBench” benchmark, Patronus tasked models like GPT-4 with answering financial queries based on public SEC filings. Shockingly, the best performing model answered only 19% of questions correctly after ingesting an entire annual report. A separate experiment with Patronus’ new “CopyrightCatcher” API found open-source LLMs reproducing copyrighted text verbatim in 44% of outputs.
“Even state-of-the-art models were hallucinating and only got like 90% of responses correct in finance settings,” explained Qian, who serves as CTO. “Our research has shown that open source models had over 20% unsafe responses in many high priority areas of harm. And copyright infringement is a huge risk — large publishers, media companies, or anyone using LLMs needs to be concerned.”
(Editor’s note: Coincidentally, Patronus’s CTO Rebecca Qian will be speaking at our AI Impact Tour event in New York City on June 5. Come learn the latest strategies and technologies for model evaluation and auditing, and network with your peers.)
While a handful of other startups like Credo AI, Weights & Biases and Robust Intelligence are building tools for LLM evaluation, Patronus believes its research-first approach leveraging the founders’ deep expertise sets it apart. The core technology is based on training dedicated evaluation models that reliably surface edge cases where a given LLM is likely to fail.
“No other company right now has the research and technology at the level of depth that we have as a company,” Kannappan said. “What’s really unique about how we’ve approached everything is our research-first approach — that’s in the form of training evaluation models, developing new alignment techniques, publishing research papers.”
This strategy has already gained traction with several Fortune 500 companies spanning industries like automotive, education, finance and software using Patronus AI to deploy LLMs “safely within their organizations,” per the startup, though it declined to name specific customers. With the fresh capital, Patronus plans to scale up its research, engineering and sales teams while developing additional industry benchmarks.
If Patronus achieves its vision, rigorous automated evaluation of LLMs could become table stakes for enterprises looking to deploy the technology, in the same way security audits paved the way for widespread cloud adoption. Qian sees a future where testing models with Patronus is as commonplace as unit-testing code.
“Our platform is domain-agnostic and so the evaluation technology that we build can be extended to any domain, whether that’s legal, healthcare or others,” she said. “We want to enable enterprises across every industry to leverage the power of LLMs while having assurance the models are safe and aligned with their specific use case requirements.”
Still, given the black-box nature of foundation models and near-endless space of possible outputs, conclusively validating an LLM’s performance remains an open challenge. By advancing the state-of-the-art in AI evaluation, Patronus aims to accelerate the path to accountable real-world deployment.
“Measuring LLM performance in an automated way is really difficult and that’s just because there’s such a wide space of behavior, given that these models are generative by nature,” acknowledged Kannappan. “But through a research-driven approach, we’re able to catch mistakes in a very reliable and scalable way that manual testing fundamentally cannot.”
Source link
lol