World models are shifting from prediction to planning, HWM and long-term control challenges

robot
Abstract generation in progress

Null Introduction

The research focus on world models over the past year initially centered on representation learning and future prediction. The model first understands the world, then internally simulates future states. This approach has already produced a series of representative results. V-JEPA 2 (Video Joint Embedding Predictive Architecture 2—Meta’s video world model released in 2025) pre-trained on over 1 million hours of internet videos, combined with a small amount of robot interaction data, demonstrating the potential of world models in understanding, prediction, and zero-shot robot planning.

However, prediction does not equal the ability to handle long tasks. When facing multi-stage control, systems typically encounter two pressures. One is that prediction errors accumulate during long rollouts, causing the entire path to drift further from the target. The other is that the action search space rapidly expands as the horizon increases, leading to rising planning costs. HWM does not rewrite the underlying learning route of world models but adds a hierarchical planning structure on top of existing action-conditioned world models, allowing the system to first organize stage-wise paths and then handle local actions.

Technically, V-JEPA 2 (more focused on world representation and basic prediction), HWM (more focused on long-term planning), WAV (World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry)—

  1. Why Long-Term Control Remains a Bottleneck for World Models

The difficulty of long-term control becomes clearer in robot tasks. Take a robotic arm example: picking up a cup and placing it into a drawer is not a single action but a sequence of continuous steps. The system must approach the object, adjust its posture, grasp it, move to the target location, then handle the drawer and place the object. When the chain is long, two problems occur simultaneously. One is that prediction errors continue to accumulate along the rollout; the other is that the action search space expands rapidly.

What the system usually lacks is not local prediction ability but the capacity to organize distant goals into stage-wise paths. Many actions, from a local perspective, seem to deviate from the target, but in reality, they are intermediate steps necessary to achieve the goal. For example, before grasping, lift the arm higher; before opening the drawer, step back slightly and then adjust the angle.

In demonstration tasks, world models can already produce coherent predictions. But once entering real control scenarios, performance begins to decline, and problems emerge. The pressure comes not only from the representation itself but also from the immature planning layer.

  1. How HWM Reconstructs the Planning Process

HWM splits the original single-layer planning process into two layers. The upper layer handles stage directions on a longer time scale, while the lower layer manages local execution on a shorter time scale. The model does not plan at just one rhythm but plans simultaneously at two different time scales.

Single-layer methods for long tasks typically require directly searching the entire action chain in the low-level action space. The longer the task, the higher the search cost, and the easier prediction errors are to propagate along multi-step rollouts. After splitting, the high level only handles route selection on a longer time scale, and the low level only executes the current segment of actions. The entire long task is divided into multiple shorter tasks, reducing planning complexity.

A key design here is that high-level actions are not simply the difference between two states but are encoded into a higher-level action representation using an encoder that compresses a segment of low-level actions. For long tasks, the focus is not only on how much the start and end points differ but also on how the intermediate steps are organized. If the high level only considers displacement differences, it risks losing path information within the action chain.

HWM embodies a hierarchical task organization approach. For multi-stage work, the system no longer unfolds all actions at once but first forms a coarse stage path, then executes and refines segment by segment. Once this hierarchical relationship is integrated into the world model, the prediction ability begins to more stably translate into planning capability.

  1. From 0% to 70%, What Do the Experimental Results Show?

In the paper’s setup for a real-world grasp-and-place task, the system only receives the final goal conditions and no manually crafted intermediate goals. Under these conditions, HWM achieves a success rate of 70%, while a single-layer world model’s success rate is 0%. Tasks that were nearly impossible to complete become highly feasible after introducing hierarchical planning.

The paper also tested simulated tasks like object pushing and maze navigation. Results show that hierarchical planning not only improves success rates but also reduces planning phase computational costs. In some environments, the planning stage’s computational cost can be reduced to about a quarter of the original, while maintaining or even increasing success rates.

  1. From V-JEPA to HWM to WAV

V-JEPA 2 represents the world representation route. It pre-trains on over 1 million hours of internet videos, then fine-tunes with less than 62 hours of robot videos to produce a latent action-conditioned world model capable of understanding, predicting, and planning physical interactions. This demonstrates that models can acquire world representations through large-scale observation and transfer these representations to robot planning.

HWM is the next step. The model already possesses world representation and basic prediction capabilities, but when it comes to multi-stage control, error accumulation and search space expansion become problematic. HWM does not change the underlying representation learning route but adds multi-time-scale planning structures on top of the existing action-conditioned world model. Its focus is on how the model organizes distant goals into intermediate steps and advances incrementally.

WAV further emphasizes verification capabilities. For world models to enter policy optimization and deployment scenarios, they must do more than predict—they need to identify regions prone to distortion and correct themselves accordingly. It focuses on how the model checks its own predictions.

V-JEPA emphasizes world representation, HWM emphasizes task planning, and WAV emphasizes result verification. Although their focus points differ, their overarching goal is aligned. The next stage of world models will not only involve internal prediction but also integrate prediction, planning, and verification into a cohesive system.

  1. From Internal Prediction to an Executable System

Many past world model works focused more on enhancing the continuity of future state prediction or the stability of internal world representations. But current research is shifting toward systems that not only form judgments about the environment but also convert those judgments into actions, and then continue to refine based on outcomes. To approach real deployment, it is necessary to control error propagation in long-term tasks, compress search spaces, and reduce reasoning costs.

This shift also impacts AI agents. Many agent systems can already perform short-horizon tasks, such as tool use, file reading, or executing multi-step instructions. But when tasks become long-horizon, multi-stage, and require mid-course re-planning, performance drops. This is not fundamentally different from challenges in robot control—both stem from insufficient high-level path organization, leading to disconnection between local execution and overall goals.

The hierarchical approach offered by HWM—high-level path and stage goal management, low-level action and feedback handling, plus result verification—will likely continue to appear in more systems. The next phase of world models will focus less on mere prediction and more on organizing prediction, execution, and correction into a runnable, integrated pathway.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin