UMI: A robotic data collection and strategy learning framework developed by Stanford

Data collection is carried out through handheld grippers and carefully designed interfaces.
UMI can directly transfer human operating skills in complex environments to robots without requiring humans to write detailed programming instructions.
That is, humans operate the demonstration in person, collect data, and transfer it directly to the robot, allowing the robot to quickly learn new tasks.
UMI integrates carefully designed policy interfaces, including inference delay matching and relative trajectory action representations, so that learned strategies are not limited by hardware and can be deployed across multiple robot platforms.
UMI provides a portable, intuitive, low-cost data collection and strategy learning framework that allows for the direct transformation of diverse human presentations into effective visual motor strategies. This framework is particularly suitable for tasks that are difficult to complete with traditional teleoperations, such as dynamic, precise, two-handed operations and long-term perspective tasks.

Main features and functions of UMI:

1. Skills transfer: Directly transfer human operating skills in complex environments to robots, without the need for humans to write detailed programming instructions.
2. Data collection: Collect dynamic operation data required for robot learning through direct human operation, including visual information and action sequences.
3. Multi-platform deployment: Allow robot operation strategies learned through UMI to be deployed across different robot hardware platforms to achieve hardware independence.
4. Improve the robot’s operating capabilities: Through UMI, the robot can learn to perform more complex and precise operating tasks, such as two-handed collaboration and precise control.
5. Quickly adapt to new tasks: UMI enables robots to quickly learn new tasks by observing human operation demonstrations without having to program from scratch, which improves the speed at which robots adapt to new tasks.
6. Reduce robot learning costs: By using UMI, the time and resources required for robots to learn and deploy new tasks can be reduced, and costs can be reduced.
7. Promote the application of robot technology in various fields: The use of UMI has broadened the scope of applications of robots in home, service, manufacturing and other industries, enabling them to better serve human society.

UMI key technologies and design concepts:

1. Hardware design: UMI uses a handheld gripper equipped with a high-quality camera (such as GoPro) to capture visual data while performing tasks. This design makes the data collection process simple and intuitive, allowing operators to naturally demonstrate tasks while capturing rich visual and operational information.
Handheld gripper: A 3D printed parallel gripper equipped with soft fingertips improves operating flexibility and security. The GoPro camera is integrated into the holder as the only sensor and recording device used to capture visual information during operation.
Fisheye lens: The 155-degree fisheye lens mounted on the holder expands the field of view and ensures that sufficient visual context and critical depth information are collected, which is crucial for learning effective robot strategies.
Side mirrors: In order to make up for the shortcoming that monocular cameras cannot directly obtain depth information, UMI designs include side mirrors to assist depth estimation by providing hidden stereoscopic views.
IMU perceptual tracking: Combined with GoPro’s built-in IMU (Inertial Measurement Unit) data, UMI can maintain stable tracking even when moving rapidly, even when due to motion blur or loss of visual features.

2. Hardware-independent data collection:
By using a universal handheld gripper and vision system, UMI is able to collect data without relying on specific robot hardware. This means that the data collected can be used in a variety of robotic systems, improving data availability and flexibility.
Delay matching: UMI handles delay changes between different hardware (such as streaming cameras, robot controllers, industrial grippers) by inferential delay matching, ensuring time matching between observation flow and action execution.
Action representation: Using relative trajectories as action representation eliminates the need for precise global actions, thereby simplifying the transition from human actions to robot execution actions.
Diffusion Policy Model: Use a Diffusion Policy model to handle multimodal action distributions, enhancing the strategy’s ability to process complex and diverse human presentation data.

3. Inference delay matching and relative trajectory action representation: UMI implements inference delay matching and relative trajectory action representation in the policy interface, ensuring the accuracy and time alignment of actions. This is crucial for performing precise and time-sensitive tasks.

4. Zero-order generalization ability: By training on diverse human demonstrations, the strategies learned by UMI can achieve zero-order generalization to new environments and objects. This means that the robot is able to perform tasks in situations that it has not been seen before, demonstrating a high degree of adaptability and flexibility.

Real-world application verification:

UMI verified the effectiveness of its method through a series of experiments, including dynamic throwing, precise placement, and two-handed collaboration tasks. These experiments not only demonstrate the generalization ability of the UMI strategy, but also demonstrate its application potential in real-world environments.

Projects and demonstrations:https://umi-gripper.github.io
Thesis:https://umi-gripper.github.io/umi.pdf
GitHub：https://github.com/real-stanford/universal_manipulation_interface
Hardware Guide:https://docs.google.com/document/d/1TPYwV9sNVPAi0ZlAupDMkXZ4CA1hsZx7YDMSmcEy6EU/edit? usp=sharing
Data collection instructions:https://swanky-sphere-ad1.notion.site/UMI-Data-Collection-Tutorial-4db1a1f0f2aa4a2e84d9742720428b4c? pvs=4

Video: