Adept Fuyu-Heavy is a new multi-modality model designed for digital agents

Claimed to be the third most capable Multi-modality model in the world, behind GPT4-V and Gemini Ultra.
It is particularly adept at understanding user interfaces, which means it can interpret and operate the interfaces of various software and applications.
It can assist users in performing various tasks, such as automating processes, responding to queries, and providing information.

Adept Fuyu-Heavy has demonstrated outstanding performance in multiple evaluations and benchmarks.

1. Multimodal Benchmarking: In the MMM (Multimodal Multitask) benchmark, Fuyu-Heavy outperformed Gemini Pro, highlighting its capabilities on multimodal tasks.
2. Text Benchmarking: Although Fuyu-Heavy needs to allocate some capacity to process image data, it performs roughly on par with Gemini Pro in standard text-only evaluations and even outperforms Gemini Pro in MMLU (Multimodal Language Understanding) benchmarks.
3. Long-form conversation performance: After a supervised fine-tuning and direct optimization phase, Fuyu-Heavy performs comparable to Claude 2.0 in the most commonly used chat evaluations – MT-Bench and AlpacaEval 1.0 – although it is a smaller model and part of the capacity is used for image modeling.
4. Multimodal performance criteria: Fuyu-Heavy slightly outperforms Gemini Pro on the MMM (Multimodal Multitask) benchmark. In addition, results on VQAv2 (a visual question answering benchmark) and AI2D (a graph understanding dataset) are also included.

The main capabilities of Adept Fuyu-Heavy include:

1. Multimodal understanding and generation: Fuyu-Heavy is capable of processing and understanding multiple types of data, such as text and images, and is able to generate corresponding output based on this data. This makes it excellent at multimodal tasks.
2. Efficient image and text processing: Despite the need for some capacity for image modeling, Fuyu-Heavy outperforms comparable models in standard text benchmarks.
3. Optimized model architecture: Fuyu-Heavy extends and optimizes the Fuyu architecture to efficiently handle images of arbitrary size and shape, and effectively utilizes existing transformer model optimization.
4. Long-form dialogue performance: After being optimized for a specific training phase, Fuyu-Heavy performs well in long-form dialogues and interactions.
5. User Interface Understanding: Particularly skilled at understanding digital user interfaces (UIs), such as websites and applications, to provide effective automation solutions. Meaning that Fuyu-Heavy was developed with a focus on enabling it to adapt and optimize the functions of digital agents, such as improving user interface understanding, enhancing automated decision-making capabilities, and providing more accurate information retrieval and content generation.
6. Cross-modal content generation: Capable of generating content that spans text and images, suitable for a variety of application scenarios.

Detail:https://adept.ai/blog/adept-fuyu-heavy

Video: