From gen AI 1.5 to 2.0: Moving from RAG to agent systems

Time’s almost up! There’s only one week left to request an invite to The AI Impact Tour on June 5th. Don’t miss out on this incredible opportunity to explore various methods for auditing AI models. Find out how you can attend here.

We are now more than a year into developing solutions based on generative AI foundation models. While most applications use large language models (LLMs), more recently multi-modal models that can understand and generate images and video have made it such that foundation model (FM) is a more accurate term.

The world has started to develop patterns that can be leveraged to bring these solutions into production and produce real impact by sifting through information and adapting it for the people’s diverse needs. Additionally, there are transformative opportunities on the horizon that will unlock significantly more complex uses of LLMs (and significantly more value). However, both of these opportunities come with increased costs that must be managed.

Gen AI 1.0: LLMs and emergent behavior from next-generation tokens

It is critical to gain a better understanding of how FMs work. Under the hood, these models convert our words, images, numbers and sounds into tokens, then simply predict the ‘best-next-token’ that is likely to make the person interacting with the model like the response. By learning from feedback for over a year, the core models (from Anthropic, OpenAI, Mixtral, Meta and elsewhere) have become much more in-tune with what people want out of them.

By understanding the way that language is converted to tokens, we have learned that formatting is important (that is, YAML tends to perform better than JSON). By better understanding the models themselves, the generative AI community has developed “prompt-engineering” techniques to get the models to respond effectively.

June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure optimal performance and accuracy across your organization. Secure your attendance for this exclusive invite-only event.

For example, by providing a few examples (few-shot prompt), we can coach a model towards the answer style we want. Or, by asking the model to break down the problem (chain of thought prompt), we can get it to generate more tokens, increasing the likelihood that it will arrive on the correct answer to complex questions. If you’ve been an active user of consumer gen AI chat services over the past year, you must have noticed these improvements.

Gen AI 1.5: Retrieval augmented generation, embedding models and vector databases

Another foundation for progress is expanding the amount of information that an LLM can process. State of the art models can now process up to 1M tokens (a full-length college textbook), enabling the users interacting with those systems to control the context with which they answer questions in ways that weren’t previously possible.

It is now quite simple to take an entire complex legal, medical or scientific text and ask questions over it to an LLM, with performance at 85% accuracy on the relevant entrance exams for the field. I was recently working with a physician on answering questions over a complex 700 page guidance document, and was able to set this up with no infrastructure at all using Anthropic’s Claude.

Adding to this, the continued development of technology that leverages LLMs to store and retrieve similar text to be retrieved based on concepts instead of keywords further expands the available information.

New embedding models (with obscure names like titan-v2, gte, or cohere-embed) enable similar text to be retrieved by converting from diverse sources to “vectors” learned from correlations in very large datasets, vector query being added to database systems (vector functionality across the suite of AWS database solutions) and special purpose vector databases like turbopuffer, LanceDB, and QDrant that help scale these up. These systems are successfully scaling to 100 million multi-page documents with limited drops in performance.

Scaling these solutions in production is still a complex endeavor, bringing together teams from multiple backgrounds to optimize a complex system. Security, scaling, latency, cost optimization and data/response quality are all emerging topics that don’t have standard solutions in the space of LLM based applications.

Gen 2.0 and agent systems

While the improvements in model and system performance are incrementally improving the accuracy of solutions to the point where they are viable for nearly every organization, both of these are still evolutions (gen AI 1.5 maybe). The next evolution is in creatively chaining multiple forms of gen AI functionality together.

The first steps in this direction will be in manually developing chains of action (a system like BrainBox.ai ARIA, a gen-AI powered virtual building manager, that understands a picture of a malfunctioning piece of equipment, looks up relevant context from a knowledge base, generates an API query to pull relevant structured information from an IoT data feed and ultimately suggests a course of action). The limitations of these systems is in defining the logic to solve a given problem, which must be either hard coded by a development team, or only 1-2 steps deep.

The next phase of gen AI (2.0) will create agent-based systems that use multi-modal models in multiple ways, powered by a ‘reasoning engine’ (typically just an LLM today) that can help break down problems into steps, then select from a set of AI-enabled tools to execute each step, taking the results of each step as context to feed into the next step while also re-thinking the overall solution plan.

By separating the data gathering, reasoning and action taking components, these agent-based systems enable a much more flexible set of solutions and make much more complex tasks feasible. Tools like devin.ai from Cognition labs for programming can go beyond simple code-generation, performing end-to-end tasks like a programming language change or design pattern refactor in 90 minutes with almost no human intervention. Similarly, Amazon’s Q for Developers service enables end-to-end Java version upgrades with little-to-no human intervention.

In another example, imagine a medical agent system solving for a course of action for a patient with end-stage chronic obstructive pulmonary disease. It can access the patient’s EHR records (from AWS HealthLake), imaging data (from AWS HealthImaging), genetic data (from AWS HealthOmics), and other relevant information to generate a detailed response. The agent can also search for clinical trials, medications and biomedical literature using an index built on Amazon Kendra to provide the most accurate and relevant information for the clinician to make informed decisions.

Additionally, multiple purpose-specific agents can work in synchronization to execute even more complex workflows, such as creating a detailed patient profile. These agents can autonomously implement multi-step knowledge generation processes, which would have otherwise required human intervention.

However, without extensive tuning, these systems will be extremely expensive to run, with thousands of LLM calls passing large numbers of tokens to the API. Therefore, parallel development in LLM optimization techniques including hardware (NVidia Blackwell, AWS Inferentia), framework (Mojo), cloud (AWS Spot Instances), models (parameter size, quantization) and hosting (NVidia Triton) must continue to be integrated with these solutions to optimize costs.

Conclusion

As organizations mature in their use of LLMs over the next year, the game will be about obtaining the highest quality outputs (tokens), as quickly as possible, at the lowest possible price. This is a fast moving target, so it is best to find a partner who is continuously learning from real-world experience running and optimizing genAI-backed solutions in production.

Ryan Gross is senior director of data and applications at Caylent.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

From gen AI 1.5 to 2.0: Moving from RAG to agent systems

Gen AI 1.0: LLMs and emergent behavior from next-generation tokens

Gen AI 1.5: Retrieval augmented generation, embedding models and vector databases

Gen 2.0 and agent systems

Conclusion

By stp2y

Leave a Reply Cancel reply