AI systems that control the movements of 3D models with language

An open-source project based on large language models that generates the poses and animations of MMD characters through natural language descriptions.
A semantic pose description language called MPL (MMD Pose Language) is used, allowing AI to understand and generate anatomically consistent poses.
Simply input a simple natural language command, such as “Wave your right hand with a smile and invite me to dinner,” and the system will generate the corresponding MMD pose code in real time and render smooth skeletal animations.

1. Project overview

  • Name: PoPo — “Pose and animate MMD model with LLM”
  • Functional Targeting: Converting the semantic attitude language → generated by natural language → LLM into MMD-usable skeletal animations or pose controls
  • Example of use case: If you say “wave right hand with big laugh, inviting me for dinner”, the system can generate the corresponding skeletal movement, allowing the model to make this pose/action
  • Real-time rendering: Supports instant pose creation + smooth bone animation
  • Model adaptation direction: For MMD models/anime-style characters, consider bone constraints, physics, motion rationality, etc. ([GitHub][1])

2. Key technologies and design

Here are some key technical points and architectural elements of the project:

Technology/ModulesDescription
MPL(MMD Pose Language)It is a semantic gesture description language customized by PoPo that is used as the output target of LLMs, rather than allowing LLMs to directly output low-level data such as raw quaternions and transformation matrices. This has several benefits: more structured, more readable, allows the model to learn “action syntax” better, has more stable output, and is easier to debug.
LLM Fine-tuningThe large language model (mentioned in the project as fine-tuned GPT-4o-mini) is trained on mapping from natural language to MPL language.
Front-end / visualization sectionUse Next.js + TypeScript to build an interface and use Babylon.js + babylon-mmd for 3D rendering and skeletal animation.
Training dataThe project should have a dataset directory that contains the natural language description + the corresponding MPL output control data.
Bone / Physical / Constraint ProcessingSince the target is an MMD model (usually anime-style with specific bone structure and motion constraints), bone constraints, prevention of unreasonable distortion, motion coherence, etc. need to be considered in rendering/animation generation. The project README mentions “anatomically correct – built-in constraints prevent impossible movements”

3. Workflow (from input to output)

The PoPo process can be roughly summarized into the following steps:

  1. User input: Describe an action or pose in natural language (e.g., “raise left arm, look left, smile”, etc.).
  2. LLM inference/generation: Input natural language to the fine-tuned language model, and the model outputs the corresponding MPL code/description
  3. Parsing/Transforming MPL: Converting MPL representations into skeletal animation commands, joint rotation, position transformation, etc.
  4. Render/Playback: Use Babylon.js + some MMD engines (such as babylon-mmd) to apply this skeletal action to the character model for real-time visualization and playback.
  5. Possible adjustments/edits: Users can make fine-tuning/manual edits to the MPL code (or corresponding bone commands) to correct or refine the action.

4. Advantages & Challenges / Limitations

Pros:

  • Natural language control: Users do not need to manually control individual joint parameters, just “speak” the action to get the result and improve creative efficiency.
  • Structured Posture Expression (MPL): Expressing actions in semantic language is more readable and easier to debug than numerical parameters that make the model output difficult to interpret.
  • Optimized for MMD: The project is specifically designed for MMD models, focusing on bone constraints and rationality.
  • Real-time / visualization: Front-end rendering can see the results instantly, which is friendly to the interactive experience.

Challenges / Limitations

  • Accuracy/Detail Control: Natural language descriptions are vague, a “wave of hand” may have many variations, and LLMs may output actions that differ from what the user expects.
  • Pose Extreme / Complex Movements: It may be difficult to support complex dances and coherent animation sequences (multi-frame continuous movement).
  • Versatility: Currently designed specifically for MMD models, skeletal systems may not be compatible with other types of 3D models.
  • Physics / Collision / Fabric / Dynamic Simulation: MMD models may also involve hair, clothing, and physical collisions, which go beyond the pose control level and require additional mechanical support.
  • Training data dependency: The effectiveness is limited by the quality and diversity of the training data of the NLP ↔ MPL used.
  • Inference Cost/Latency: LLM models (especially if they are larger) may experience delays in inference, and real-time performance may be limited.

5. Comparison & application prospects with similar/similar projects

PoPo belongs to the kind of cross-application of language-driven 3D/animation/virtual humans , and there are many discussions in the industry/academia in similar directions, such as converting natural language into action trajectories, controlling character performance, and automating character performance.

It has the potential to save a lot of cost on manual adjustment of movements in scenarios such as the concept stage, rapid prototyping, and virtual anchor/avatar demonstrations compared to the regular animation production process.

If the project continues to develop, it may be expanded in the future to support coherent motion sequences (motion clips), multimodal input (natural language + gestures/sketches/audio rhythms), etc.

Github:https://github.com/AmyangXYZ/PoPo
Tubing:

Scroll to Top