We're helping AI to see the 3D world in motion as humans do. 🌐
Enter D4RT: a unified model that turns video into 4D representations faster than previous methods - enabling it to understand space and time. This is how it works 🧵
To perceive a 2D scene captured on video, an AI must track every pixel of every object as it moves. 🔍️️
Capturing this level of geometry and motion requires computationally intensive processes leading to slow and fragmented reconstructions. But D4RT takes a different approach.
D4RT encodes input videos into a compressed version, then processes and queries the data using a lightweight decoder in parallel.
This makes it extremely fast and scalable - whether to track just a few points, or to reconstruct an entire scene. 🖼️
Many 4D tasks can now be solved with one model, enabling us to:
👉 Predict a pixel’s 3D trajectory by looking for its location across different time steps.
⏱️ Freeze time and the camera viewpoint to generate a scene's complete 3D structure.
D4RT can even create and align 3D snapshots of a single moment from different viewpoints - easily recovering the camera's trajectory. 🎥
4D reconstruction often fails on dynamic objects, resulting in ghosting artifacts or processing lag. ⏳
D4RT can continuously understand what's moving while running 18x–300x faster than previous methods - processing a 1-minute video in roughly 5 seconds on a single TPU chip.
We believe this research could have unlimited applications in the real-world.
From providing spatial awareness in robotics 🤖, leveling up efficiency in AR 🕹️, and expanding the capabilities in world models 🌐 D4RT’s potential is a necessary step on the path towards AGI.
Find out more →
