Tweet Overview
View this X/Twitter post from @GoogleDeepMind published on 2026๋ 1์ 22์ผ ์คํ 03:01. This post contains 1 videos and 1 images.
We're helping AI to see the 3D world in motion as humans do. ๐ Enter D4RT: a unified model that turns video into 4D representations faster than previous methods - enabling it to understand space and time. This is how it works ๐งต
To perceive a 2D scene captured on video, an AI must track every pixel of every object as it moves. ๐๏ธ๏ธ Capturing this level of geometry and motion requires computationally intensive processes leading to slow and fragmented reconstructions. But D4RT takes a different approach.
D4RT encodes input videos into a compressed version, then processes and queries the data using a lightweight decoder in parallel. This makes it extremely fast and scalable - whether to track just a few points, or to reconstruct an entire scene. ๐ผ๏ธ
Many 4D tasks can now be solved with one model, enabling us to: ๐ Predict a pixelโs 3D trajectory by looking for its location across different time steps. โฑ๏ธ Freeze time and the camera viewpoint to generate a scene's complete 3D structure.
D4RT can even create and align 3D snapshots of a single moment from different viewpoints - easily recovering the camera's trajectory. ๐ฅ
4D reconstruction often fails on dynamic objects, resulting in ghosting artifacts or processing lag. โณ D4RT can continuously understand what's moving while running 18xโ300x faster than previous methods - processing a 1-minute video in roughly 5 seconds on a single TPU chip.
We believe this research could have unlimited applications in the real-world. From providing spatial awareness in robotics ๐ค, leveling up efficiency in AR ๐น๏ธ, and expanding the capabilities in world models ๐ D4RTโs potential is a necessary step on the path towards AGI. Find out more โ







