It can automatically perform a series of complex tasks by understanding the user’s natural language instructions and the visual content of the screen.
For example,”Delete all pictures in a Word document” or “Add a new slide to a PowerPoint document.”
It combines GPT 4-V and is able to understand and perform operations with the graphical user interface (GUI) of Windows applications.
UFOs can perform various operations in Windows applications, such as clicking buttons, filling out forms, browsing files, etc., just as if a person were using a mouse and keyboard to operate a computer.
The video presentation is: Delete all comments on the PowerPoint presentation.
Main abilities:
1. Cross-application operation: UFOs can seamlessly navigate and operate between multiple applications in the Windows operating system. This means it can perform a series of actions in different applications based on the requirements of the task, such as extracting information from a Word document and then using that information to compose and send emails in Outlook.
2. Natural language command execution: Users can tell UFOs what tasks need to be completed through natural language commands. UFOs understand these instructions and convert them into specific GUI operations without manual user intervention.
3. Automated control interaction: UFO includes a control interaction module that can transform the actions recognized by the visual model into actual operations of application controls. This feature allows UFOs to automatically click buttons, enter text, etc. in the application.
4. Application selection: UFO uses the application selection agent (AppAgent) in the dual-agent framework to determine which application is most suitable to complete the user’s request. This includes switching to a different application when the task requires it.
5. Action selection and execution: The action selection agent (ActAgent) is responsible for selecting and executing specific actions in the selected application until the task is completed. It uses screenshots and control information to determine the best step forward.
6. Multimodal input processing: UFO can process and parse images (screenshots) and text information to understand the current GUI state and make decisions.
7. Custom tasks and controls: UFO is highly extensible, allowing users to design and customize actions and controls for specific tasks, enhancing its versatility and flexibility in different applications and usage scenarios.
Working principle:
UFO (UI-Focused Agent) works based on advanced visual language modeling technology, particularly GPT-Vision, and a unique dual-agent framework that allows it to understand and perform graphical user interface (GUI) tasks in the Windows operating system. The following is a detailed explanation of how UFO works:
1. Dual agent framework
Dual agent architecture: UFO consists of two main agents, AppAgent and ActAgent, which are responsible for selecting and switching applications, and performing specific actions within these applications respectively.
AppAgent: Responsible for deciding which application needs to be launched or switched to in order to complete a user request. It makes choices by analyzing users ‘natural language commands and screenshots of the current desktop. Once the most suitable application is determined, AppAgent develops a global plan to guide the execution of the task.
Action Selection Agent (ActAgent): Once an application is selected, the ActAgent performs specific operations in the application, such as clicking buttons, entering text, etc. ActAgent uses screenshots and control information of the application to determine the most appropriate next action, and converts these actions into actual actions on the application controls through the control interaction module.
2. Control interaction module
The UFO’s control interaction module is a key component of transforming actions recognized by agents into actual execution in the application. This module enables UFOs to interact directly with the application’s GUI elements, performing operations such as clicking, dragging, text entry, etc. without human intervention.
3. Multimodal input processing
UFOs are capable of processing multiple types of input, including text (a user’s natural language instructions) and images (screenshots of an application). This allows UFOs to understand the current GUI status, available controls and their properties, and make accurate operational decisions.
4. Resolution of user request
When receiving natural language instructions from a user, the UFO first parses the instructions to determine the user’s intentions and the tasks needed to be completed. It then breaks down the task into a series of subtasks or action steps that are executed in sequence by AppAgent and ActAgent.
5. Seamless switching between applications
If completing a user request requires the operations of multiple applications, UFO can seamlessly switch between these applications. It uses AppAgent to decide when and how to switch applications, and uses ActAgent to perform specific actions in each application.
6. Mapping natural language commands to GUI operations
One of the core functions of UFO is to map the user’s natural language commands to specific GUI operations. This process involves understanding the intent of the command, identifying the relevant GUI elements, and generating and executing actions to manipulate those elements.
In this way, UFOs can automatically complete a series of complex tasks from document editing and information extraction to email writing and sending, greatly improving the efficiency and convenience of users working in the Windows operating system.
GitHub:https://github.com/microsoft/UFO
Thesis:https://arxiv.org/abs/2402.07939
Case study: Use text from multiple sources to write an email.
In this example, UFO can seamlessly switch and operate between multiple applications such as Word, picture viewer, and email client, demonstrating its ability to handle cross-application tasks.
Demonstrate the versatility and efficiency of the UFO cross-application experience!
Video: