How to manage a team of AI agents

Hey all, its been a awhile. I took time off for honeymoon and the resulting backlog of work. To those who joined the enterprise-ready AI event, thank you for making it awesome! The interest was staggering. Ashish and I will share the key takeaways soon. If you’d like to be the first to know of future events, consider subscribing!

The user experience of ChatGPT & similar products is that it requires a human to pilot. Its works as a collaborator that needs live instructions. A different way to experience AI is that of fully autonomous AI agents, large language model (LLM)-driven programs that can seemingly operate autonomously. Give an agent a goal, it will figure it out from there. The first version of the AutoGPT project was pioneering, though a bit chaotic. The agent doesn’t know when the assigned goal is achieved. It can be unpredictable, such as how it started researching OnlyFans as a way to grow this newsletter, and only stops working when it reaches an arbitrarily set number of API calls.

BabyAGI’s framework was a major step in making agents practical by incorporating a task management system to impose structure and direction. The human analogy is that an AI agent on its own is like a human reacting to whatever comes to mind, which is not reliable. But pair a human with a task management system, even a simple Kanban board, then work becomes so much more productive. The March 2023 launch of AutoGPT and BabyAGI marked the start of agentic AI wave. Since then developers have been improving what agents could do, equipping them with long-term memory to remember what they’ve done along with a portfolio of skills to browse the web, write software programs, and many more.

If we could build one agent, why not several? The multi-agent scene first came to prominence with the work of Google and Stanford researchers running the Smallville experiment with 25 AI agents each simulated with their own personalities and memories. What’s fascinating from the research is that social behaviors such as gossiping and event planning ‘naturally’ emerged from the simulation. That brought multi-agent projects to the Silicon Valley mind hive. Character.ai and other social bot platforms soon enabled chatrooms to host multiple agents chatting with each other. Those are all social-conversational applications, what about getting agents to do work?

Just a couple of months ago, I wrote in Command line to conversations:

This might be a vision of the future: copilot agents from different software vendors working together. Microsoft Copilot will extract the technical requirements from the product requirements Word document, then instruct Atlassian Intelligence to create Jira tickets for each technical requirement. These tickets will then be handed off to Github Copilot to write, test, and ship code.

That future just got closer. In the past month, there has been the advent of practical multi-agent projects. MetaGPT is the most popular one so far because it is open-source, achieved state-of-the-art (SOTA) performance, and provides an intuitive framework to build a team of AI agents.

So, what is MetaGPT?

A framework that efficiently incorporates human workflows into LLM-based multi-agent collaboration. By encoding Standard Operating Procedures (SOPs) into prompts, MetaGPT enables structured coordination and modular outputs. It leverages an assembly line paradigm to assign diverse roles to various agents, allowing for the effective deconstruction of complex multi-agent collaborative problems. Experiments on collaborative software engineering benchmarks show promising results.

In the paper, the creators of MetaGPT configured it to be a software development team composed of a product manager, systems architect, program manager, software engineer, and a QA engineer. To have an objective measure of the AI team’s performance, they tested against two popular coding benchmarks, HumanEval and MBPP, and achieved SOTA results. LLMs that perform well on both benchmarks are considered to be more capable of generating code that is both correct and readable.

However, a more holistic evaluation would encompass other aspects such as ease of use, quality of the documentation, and the effort it took to build the application. MetaGPT is compelling because it produces work artifacts along with the code base such as product requirements document (PRD), system architecture and data flow diagrams, API specs, and a comparison of competing products. All these documents are in the team’s shared workspace and referenced as needed by the agents to continue working. In the paper, the example software the authors created was the 2048 game. The goal of the game is to slide the numbers around to combine them, creating larger and larger numbers until it reaches 2048. They were able to create working game along with work artifacts from a single line of instruction “Make the 2048 sliding tile number puzzle.”

That’s amazing. I had to see it for myself. So I installed MetaGPT and asked it to write a task management software. Out came an almost working codebase. I still had to do some debugging but hey, it worked. It was complete with a SQL database, HTML templates, and a Flask-based Python script to stitch everything together. For $1.50 worth of API calls and 10 minutes, I got an entire AI team built an app with detailed documentation. I could have shaved half the time & cost if I didn’t instruct the agents to do code reviews.

See snippets of the generated PRD and system diagram below. If you’re interested in browsing through all the other files, here’s the link.

User Stories
1. I want to create, edit and delete tasks so that I can manage my work
2. I want to categorize tasks into projects so that I can organize my work 
3. I want to set due dates for tasks so that I can prioritize my work
4. I want to mark tasks as complete so that I can track my progress
5. I want to view all tasks in a single dashboard so that I can get an overview

Competitive Analysis
1. Asana: A comprehensive project management tool. It is more suitable for teams and may be overwhelming for individual users
2. Trello: A kanban-style task management app. It is easy to use but lacks some advanced task management features
3. Microsoft To Do: A simple and straightforward task management app. It is fully integrated with other Microsoft apps
4. Google Tasks: A basic task management tool. It is integrated with Google Calendar and Gmail but lacks advanced features

The insight from this is that agents will become more helpful collaborators to humans if they adopted how humans manage and produce work. Specifically:

Give agents specialized roles — Research has shown that an LLM reasons better if it assumes a cast of experts to reason about a topic, performing better than the chain-of-thought technique. The MetaGPT paper also showed that removing roles in the team led to worse performance. The intuition is that an LLM is a general brain that can be prompt-hypnotized into role playing as experts to tap into specialized knowledge hidden in the model parameters.

The process of specialization isn’t limited to prompt engineering. It can also be by equipping agents with the role-relevant skills. For example, a product manager agent needs to be able to browse the web to do competitive research, craft user stories, and synthesize them into a PRD. The creators of MetaGPT programmed these skills for each agent. It will be interesting to see if there will be a marketplace of AI skills so users can craft their AI agents in a modular fashion. We’re seeing the beginnings of this with ChatGPT plugins and Zapier’s marketplace of zappable APIs.

Teach agents how to collaborate on a shared workspace — Knowledge workers have a repository of work artifacts (docs, slides, numbers, chat logs) and processes (waterfall, agile, weekly syncs) to coordinate work at scale. These are templates to model how AI agents can work together. In the MetaGPT example, it was a modeled as a waterfall development process in which agents pass on the baton of artifacts to the next one.

Sidebar: When I was studying macroeconomics in university, one concept I never fully appreciated is how abstract management practices is considered part of “technological” growth factor along with likes of tangible computer chips. But seeing that programming how agents work together affect the end output gave me better appreciation of that concept. Economics, after all, is about managing scarce resources. In this case, its coordinating and allocating the skills and memory of AI agents.

Manage information flow — When we go about our work, we communicate a lot to make decisions and keep stakeholders aligned. Whether its written or verbal communication, we share only the relevant information (at least we try to) to complete our jobs. MetaGPT employs a similar idea with different personas subscribing to certain chat logs. The agents don’t digest all of the information arbitrarily. For example, the system architect doesn’t follow the competitive analysis work. But when the PRD is done, it’ll get notified so it can start designing an architecture. This is similar to how we don’t attend every meeting, read every email, or even read messages in Slack channels we’re in. Vector-based retrieval and a large context window can help agents digest more information, but curating information flow through intentional communication patterns matter still. Research has shown that while LLMs can handle 100,000 tokens, they’re similar to humans in that the more concise and relevant the context the better.

Granted that MetaGPT may still seem like a shiny demo limited to simpler Python-oriented programs, the trajectory of progress is staggering. Just a few weeks after MetaGPT’s release, Microsoft open-sourced Autogen, a flexible multi-agent framework in which humans can step in as an individual contributor or give iterative feedback to agents as the manager. These projects set the stage for a future where AI does more than just assists — it also collaborates and manages. If we got here within 10 months of ChatGPT’s release, where will we be by 2024?

Curated reads

Technical: MetaGPT: The Multi-Agent Framework

Commercial: You can now finetune GPT-3.5

Social: Nations carve different paths for tech regulation

Source link
lol