AppAgent can perform various tasks on mobile phones by autonomously learning and mimicking human tap and swipe gestures.
It can post on social media, write and send emails for you, use maps, shop online, and even do complex image editing…
AppAgent was extensively tested on 50 tasks, covering 10 different applications.
The project was developed by a research team from Tencent and the University of Texas at Dallas.
Key features:
- Multimodal Agent: AppAgent is a multimodal agent based on large language models that can process and understand multiple types of information (such as text, images, touch operations, etc.). This allows it to understand complex tasks and perform them in a variety of different applications.
- Intuitive Interaction: It interacts with smartphone apps by mimicking intuitive human actions, such as tapping and swiping the screen. Just like a real user.
- Autonomous Learning: AppAgent observes and analyzes user interface interactions in different applications. And learn these interaction patterns and compile the knowledge gained into documentation.
- Build a Knowledge Base: Through these interactions, AppAgent builds a knowledge base that records the operation methods and interface layouts of different applications. This knowledge base is then used to guide agents through tasks in different applications.
– Perform complex tasks: Once learning how to operate an application, AppAgent is able to perform complex tasks across applications, such as sending emails, editing pictures, or making online purchases.
Projects and Presentations: https://appagent-official.github.io
Thesis: https://arxiv.org/abs/2312.13771
GitHub:https://github.com/mnotgod96/AppAgent