What are the key points?

HY-Embodied-0.5 introduces a flexible foundation model family for physical, real-world robotic agents. The 2B edge model optimizes for spatial reasoning, while the 32B variant rivals frontier-level performance. Tencent adopts a Mixture-of-Transformers architecture to enhance fine-grained visual perception and planning.

Tencent Unveils HY-Embodied-0.5 for Real-World Robotics

•HY-Embodied-0.5 introduces a flexible foundation model family for physical, real-world robotic agents.
•The 2B edge model optimizes for spatial reasoning, while the 32B variant rivals frontier-level performance.
•Tencent adopts a Mixture-of-Transformers architecture to enhance fine-grained visual perception and planning.

The landscape of artificial intelligence is rapidly shifting from the screen to the physical world, and the latest release from Tencent's Robotics X team marks a significant step forward in this evolution. Their new model, HY-Embodied-0.5, is designed specifically to bridge the gap between general-purpose language models and the concrete requirements of physical robots. Unlike standard chatbots that operate solely in text, these models are built to understand the spatial and temporal nuances of our physical environment, enabling robots to predict, interact, and plan in real-world scenarios.

At the heart of this innovation is a unique architectural choice: a Mixture-of-Transformers (MoT) design. Think of this as a specialized brain structure that allows the model to dedicate specific computational pathways to different types of sensory input, rather than trying to process visual and language data through a single, congested pipeline. By integrating latent tokens—compact mathematical representations that summarize key visual features—the model can achieve the high-resolution perception needed for delicate robotic tasks without causing a massive spike in computational demands.

This approach offers versatility across different hardware constraints. The suite includes a lightweight 2B parameter version, optimized for deployment on edge devices (hardware located directly on the robot, rather than in a remote server farm), alongside a more powerful 32B model intended for heavy-duty, high-reasoning tasks. The team further refined these models through a self-evolving post-training process, where the larger model's advanced reasoning capabilities are effectively distilled into the smaller variant.

Evaluation data suggests that this strategy is paying off, with the MoT-2B model outperforming its peers on a battery of visual and spatial benchmarks. The 32B version, meanwhile, shows performance comparable to some of the most capable models currently available on the market. By open-sourcing the code and models, Tencent is providing a new toolkit for researchers looking to move robotics beyond simple, repetitive automation and toward intelligent, agentic behavior in the physical world.

The landscape of artificial intelligence is rapidly shifting from the screen to the physical world, and the latest release from Tencent's Robotics X team marks a significant step forward in this evolution. Their new model, HY-Embodied-0.5, is designed specifically to bridge the gap between general-purpose language models and the concrete requirements of physical robots. Unlike standard chatbots that operate solely in text, these models are built to understand the spatial and temporal nuances of our physical environment, enabling robots to predict, interact, and plan in real-world scenarios.

At the heart of this innovation is a unique architectural choice: a Mixture-of-Transformers (MoT) design. Think of this as a specialized brain structure that allows the model to dedicate specific computational pathways to different types of sensory input, rather than trying to process visual and language data through a single, congested pipeline. By integrating latent tokens—compact mathematical representations that summarize key visual features—the model can achieve the high-resolution perception needed for delicate robotic tasks without causing a massive spike in computational demands.

This approach offers versatility across different hardware constraints. The suite includes a lightweight 2B parameter version, optimized for deployment on edge devices (hardware located directly on the robot, rather than in a remote server farm), alongside a more powerful 32B model intended for heavy-duty, high-reasoning tasks. The team further refined these models through a self-evolving post-training process, where the larger model's advanced reasoning capabilities are effectively distilled into the smaller variant.

Evaluation data suggests that this strategy is paying off, with the MoT-2B model outperforming its peers on a battery of visual and spatial benchmarks. The 32B version, meanwhile, shows performance comparable to some of the most capable models currently available on the market. By open-sourcing the code and models, Tencent is providing a new toolkit for researchers looking to move robotics beyond simple, repetitive automation and toward intelligent, agentic behavior in the physical world.

Tencent Unveils HY-Embodied-0.5 for Real-World Robotics

Tags