Model Fatigue: Why New Doesn’t Always Mean Better

(Roman-Samborskyi/Shutterstock)

Doesn’t it seem like there’s a new machine learning model introduced every week? That’s probably because there is.

From Sora to LLaMA-3 and Claude 2, models today come in all shapes and sizes—open source, off the shelf—with varying performance rates, cost implications, and rate limits. Each provider makes big promises to revolutionize the industry, and your business in particular.

But the reality is that model fatigue is setting in. Choosing a model today is like walking down the cereal aisle at the grocery store. We’re spoiled for choice, and choice is good. But unlike cereal, you can’t just throw a model away if you don’t like it. Investing in a technology takes resources and experimentation, and any mistakes could result in significant cost to your business.

This prompts a central question: how does any business know how a model is going to perform? Even if standard benchmarks are high, how do they know it’s right for their business? Well, they don’t. And herein lies the problem.

The Exhaustion of Having Endless Choices

We’re overwhelmed by the sheer number of options and what goes into choosing the right model for the job. This takes hard work. A business has to:

Define Criteria: understand your business needs and objectives. Identify the specific tasks and outcomes you intend to achieve with the model. Clearly define what successful model performance looks like for each task, and establish parameters for acceptable results and behaviors to ensure the model aligns with your expectations.
Narrow Down Your Model Options: Filter models based on their function, complexity, and suitability for your specific tasks. Consider models that have established track records for tasks similar to yours, such as coding-specific models for software development.

(Tada Images)
Gather/Curate Data: Collect data that simulates the typical interactions your model will handle. If necessary, generate synthetic data to ensure it aligns with your evaluation criteria.
Run Evaluations: Test each shortlisted model against your defined criteria. Experiment with different model and prompt combinations to obtain the most comprehensive results.

And that’s just scratching this surface. There’s quite a bit mor that goes into making the right choice

The Evaluation Dilemma

Evaluating new models is no simple task. It requires a deep understanding of the model’s architecture, the data it was trained on, and its performance on relevant benchmarks. But even with this knowledge, there’s no guarantee that a model will seamlessly integrate into your existing infrastructure or meet your business needs.

The process is time-consuming and resource-intensive, and if not approached systematically, can easily lead to dead ends. For example:

What if none of these models meet my success criteria?
What if the prompt I perfected for model A turns out to be useless for model B? (Not every prompt is successful for every LLM)
Do I now need to fine-tune my own model to get the results I want?

At this point, it’d be easy to understand if a company regrets having gone down this path at all.

It’s Not About the Model; It’s About Your Data

(a-image/Shutterstock)

While it’s easy to be dazzled by the latest and latest, the newest model isn’t always the most effective solution for your unique use case.

Bottom line: customizability is more important than raw capability. Meaning, just because model benchmarks (which are not based on your organization’s data) show that it performs better than its predecessor, it doesn’t mean it will actually perform well for you.

Novelty doesn’t guarantee compatibility with your data, nor does it mean it will scale and actually drive meaningful business outcomes.

That’s why it is absolutely critical to follow the steps outlined above before making any significant investment. You need to understand what the objective is first and go from there. Failing to lay the groundwork could render the model evaluation phase meaningless.

In the end, the outcomes for your app and your customers are what really matters; work backwards from there. Curate the best data specific to your task and measure success against that alone. Generic benchmarks won’t give you the answers you need to make the right choice.

About the author: Luis Ceze is the CEO and co-founder OctoAI and a computer professor at the University of Washington.