oai:arXiv.org:2406.10667
Computer Science
2024
19.06.2024
Learning predictive world models is essential for enhancing the planning capabilities of reinforcement learning agents.
Notably, the MuZero-style algorithms, based on the value equivalence principle and Monte Carlo Tree Search (MCTS), have achieved superhuman performance in various domains.
However, in environments that require capturing long-term dependencies, MuZero's performance deteriorates rapidly.
We identify that this is partially due to the \textit{entanglement} of latent representations with historical information, which results in incompatibility with the auxiliary self-supervised state regularization.
To overcome this limitation, we present \textit{UniZero}, a novel approach that \textit{disentangles} latent states from implicit latent history using a transformer-based latent world model.
By concurrently predicting latent dynamics and decision-oriented quantities conditioned on the learned latent history, UniZero enables joint optimization of the long-horizon world model and policy, facilitating broader and more efficient planning in latent space.
We demonstrate that UniZero, even with single-frame inputs, matches or surpasses the performance of MuZero-style algorithms on the Atari 100k benchmark.
Furthermore, it significantly outperforms prior baselines in benchmarks that require long-term memory.
Lastly, we validate the effectiveness and scalability of our design choices through extensive ablation studies, visual analyses, and multi-task learning results.
The code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.
;Comment: 32 pages, 16 figures
Pu, Yuan,Niu, Yazhe,Ren, Jiyuan,Yang, Zhenjie,Li, Hongsheng,Liu, Yu, 2024, UniZero: Generalized and Efficient Planning with Scalable Latent World Models