Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
In the age of large language models (LLMs), enterprises are racing to deploy the best possible model for their applications. The task seems pretty simple, but most organizations struggle with one key roadblock: how does one identify what works best for their highly specific use cases when things are evolving so rapidly in the ecosystem?
Well, Not Diamond, a new startup emerging from stealth today, claims the answer lies in smart routing.
The San Francisco-based startup has developed a novel LLM router, which allows enterprises to have multiple models in play and direct queries to the best one, improving not only the quality of outputs but also other usage-critical aspects such as overall latency and associated costs.
“Our fundamental bet is that the future won’t have one single, giant model or company that everyone sends everything to—instead, there will be many foundation models, millions of fine-tuned variants of those models, and countless custom inference engines running on top of them. We started Not Diamond to enable this multi-model future, starting with the world’s most powerful infrastructure for routing between models,” Tomás Hernando Kofman, the CEO and co-founder of Not Diamond, said in a statement.
Importantly, even though the company is very new, it is garnering significant attention. It has raised $2.3 million in initial funding from defy.vc and several leading names in the AI industry, including Google DeepMind’s chief scientist Jeff Dean, Hugging Face’s Julien Chaumond, OpenAI’s Zack Kass, Databricks chairman Ion Stoica, Github’s Tom Preston-Werner and Jeff Weiner from LinkedIn.
The LLM cost vs. task-specific performance dilemma
The current ecosystem of large language models is a very complex one. Every model, whether open-sourced or not, has its own set of strengths and weaknesses.
So, if you are going for a model that has a massive context length and high performance, there’s a good chance it may cost too much.
On the other hand, if it’s affordable, it might be missing out on some relevant capability or its latency might be too high.
On top of that, new models are being added into the mix every passing day and old ones are being updated with significant improvements (just saw how good open-source AI can be with Llama 3.1).
How Not Diamond helps enterprises
Kofman, who was building a no-code AI product, struggled with the LLM dilemma himself. He envisioned the solution in an interface that could help enterprises tap into a network of different specialized models — rather than relying on one single model.
This led him to team up with fellow ML colleagues Tze-Yang Tung and Jeffrey Akiki and launch Not Diamond with the mission of building the infrastructure for intelligently routing queries between models.
“Robust routing infrastructure will be critical to maximizing the effectiveness of AI systems… Small, specialized models can outperform larger models on narrow domains, and routing gives special models the robustness of general ones. This is not only more computationally efficient—we get huge interpretability and safety benefits as a free bonus,” Kofman told VentureBeat.
At its core, Not Diamond’s router understands a query going into an application and then uses a ‘meta-model’ to automatically direct it to the model that can handle it most accurately while providing cost and latency benefits at the same time.
According to Kofman, this can easily save teams from the hassle of calling the same, large model every time — even when the query is not complicated enough for it.
In the benchmark results shared, Not Diamond router working with multiple LLMs delivered much better results than individual models, including Llama-3.1 and GPT-4o.
To bring the offering to life, Not Diamond first constructed a large evaluation dataset that assessed the performance of different LLMs on everything from question answering to coding to reasoning.
Then, using this dataset, the company trained a ranking algorithm that determines which LLM is best suited to respond to a given query. This decision ultimately powers the routing action.
It first open-sourced a lightweight preview of its router in December 2023, allowing enterprises to automatically handle queries between GPT-3.5 and GPT-4, and then expanded to other models in the market.
More importantly, if a team wants to use the router in their internal workflows for select use cases, they can provide internal evaluation datasets to train a custom router that chooses which model would be the best suited. It includes the option to hash all the data sent to the API as well as prompt translation to optimize the prompt according to the model it is being routed to.
Goal to accelerate developer adoption
Despite being in the early stages, Not Diamond is seeing significant adoption, especially from early and growth-stage companies and independent developers. The CEO did not share the exact count of these early users, but he did confirm that one enterprise customer, Samwell AI, saw a 10% improvement in LLM output quality with a 10% reduction in inference costs and latency with the company’s technology.
With the funding from industry leaders, the company hopes to build on this work by further accelerating product development and driving up adoption rates. Kofman confirmed the company has a “host of additional product features” in the pipeline, but it cannot share much about it just yet.
In the space of smart query routing, the company is competing with a few notable startups, including Martian and Unify. However, the CEO says it differentiates from these players with its blazing-fast routing speed and prompt optimization and privacy features.
Source link lol