Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
A new AI agent has emerged from the parent company of TikTok to take control of your computer and perform complex workflows.
Much like Anthropic’s Computer Use, ByteDance’s new UI-TARS understands graphical user interfaces (GUIs), applies reasoning and takes autonomous, step-by-step action.
Trained on roughly 50B tokens and offered in 7B and 72B parameter versions, the PC/MacOS agents achieves state-of-the-art (SOTA) performance on 10-plus GUI benchmarks across performance, perception, grounding and overall agent capabilities, consistently beating out OpenAI’s GPT-4o, Claude and Google’s Gemini.
“Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention,” researchers from ByteDance and Tsinghua University write in a new research paper.
How UI-TARS explains its thinking
UI-TARS works across desktop, mobile and web applications, using multimodal inputs (text, images, interactions) to understand visual environments.
Its UI features two tabs — one to the left showing its step-by-step “thinking,” and a larger one to the right where it pulls up files, websites and apps and automatically takes action.
For example, in a demo video released today, the model is prompted to “Find round trip flights from SEA to NYC on the 5th and return on the 10th next month and filter by price in ascending order.”
In response, UI-TARS navigates to the website for Delta Airlines, fills in the “from” and “to” fields, clicks in the relevant dates and sorts and filters by price, explaining each step in its thinking box before taking action.
In another scenario, it is instructed to install an autoDocstring extension in VS Code. Here is its step-by-step thinking as it completes that task:
- It first reports that “to proceed with installing the autoDocstring extension in VS Code, I first need to open the VS Code application.”
- Recognizing that the VS Code window has not fully loaded, the model notes that “it is necessary to wait for the application to finish initializing before proceeding with the installation of the autoDocstring extension. This ensures that all required functionalities are available for the next steps.”
- It then explains that it must access the Extensions view in VS Code, which can be done by clicking on a tab in the left sidebar.
- Running into a small glitch, it reasons: “The previous attempt to click on the ‘Extensions’ tab may not have been precise enough, so I will try clicking on it again to ensure the Extensions view opens successfully.”
- Once in the Extensions view, it types in ‘autoDocstring’ before concluding: “Since the installation process has already been initiated, there is no further action required at this moment. I will wait for the installation to complete and confirm its success once the process is finished.”
Outperforming its rivals
Across a variety of benchmarks, researchers report that UI-TARS consistently outranked OpenAI’s GPT-4o; Anthropic’s Claude-3.5-Sonnet; Gemini-1.5-Pro and Gemini-2.0; four Qwen models; and numerous academic models.
For instance, in VisualWebBench — which measures a model’s ability to ground web elements including webpage quality assurance and optical character recognition — UI-TARS 72B scored 82.8%, outperforming GPT-4o (78.5%) and Claude 3.5 (78.2%).
It also did significantly better on WebSRC benchmarks (understanding of semantic content and layout in web contexts) and ScreenQA-short (comprehension of complex mobile screen layouts and web structure). UI-TARS-7B achieved leading scores of 93.6% on WebSRC, while UI-TARS-72B achieved 88.6% on ScreenQA-short, outperforming Qwen, Gemini, Claude 3.5 and GPT-4o.
“These results demonstrate the superior perception and comprehension capabilities of UI-TARS in web and mobile environments,” the researchers write. “Such perceptual ability lays the foundation for agent tasks, where accurate environmental understanding is crucial for task execution and decision-making.”
UI-TARS also showed impressive results in ScreenSpot Pro and ScreenSpot v2 , which assess a model’s ability to understand and localize elements in GUIs. Further, researchers tested its capabilities in planning multi-step actions and low-level tasks in mobile environments, and benchmarked it on OSWorld (which assesses open-ended computer tasks) and AndroidWorld (which scores autonomous agents on 116 programmatic tasks across 20 mobile apps).
Under the hood
To help it take step-by-step actions and recognize what it’s seeing, UI-TARS was trained on a large-scale dataset of screenshots that parsed metadata including element description and type, visual description, bounding boxes (position information), element function and text from various websites, applications and operating systems. This allows the model to provide a comprehensive, detailed description of a screenshot, capturing not only elements but spatial relationships and overall layout.
The model also uses state transition captioning to identify and describe the differences between two consecutive screenshots and determine whether an action — such as a mouse click or keyboard input — has occurred. Meanwhile, set-of-mark (SoM) prompting allows it to overlay distinct marks (letters, numbers) on specific regions of an image.
The model is equipped with both short-term and long-term memory to handle tasks at hand while also retaining historical interactions to improve later decision-making. Researchers trained the model to perform both System 1 (fast, automatic and intuitive) and System 2 (slow and deliberate) reasoning. This allows for multi-step decision-making, “reflection” thinking, milestone recognition and error correction.
Researchers emphasized that it is critical that the model be able to maintain consistent goals and engage in trial and error to hypothesize, test and evaluate potential actions before completing a task. They introduced two types of data to support this: error correction and post-reflection data. For error correction, they identified mistakes and labeled corrective actions; for post-reflection, they simulated recovery steps.
“This strategy ensures that the agent not only learns to avoid errors but also adapts dynamically when they occur,” the researchers write.
Clearly, UI-TARS exhibits impressive capabilities, and it’ll be interesting to see its evolving use cases in the increasingly competitive AI agents space. As the researchers note: “Looking ahead, while native agents represent a significant leap forward, the future lies in the integration of active and lifelong learning, where agents autonomously drive their own learning through continuous, real-world interactions.”
Researchers point out that Claude Computer Use “performs strongly in web-based tasks but significantly struggles with mobile scenarios, indicating that the GUI operation ability of Claude has not been well transferred to the mobile domain.”
By contrast, “UI-TARS exhibits excellent performance in both website and mobile domain.”
Source link lol