Egocentric video from Quest 3, showing the prediction of room layout, objects, and object decompositions from SceneScript

Meta Reality Labs Research

SceneScript: an AI model and method
to understand and describe 3D spaces

SceneScript is a method for representing and inferring scene geometry
using an auto-regressive structured language model, and end-to-end learning.

What is it for?

SceneScript allows AR & AI devices to understand the geometry of physical spaces

Shown on footage captured by Aria glasses, SceneScript can take visual input and estimate scene elements, such as walls, doors, or windows.

Shown on Meta Quest, scene elements predicted by SceneScript can be arbitrarily extended to include new architectural features, objects, and even object decompositions.

How does it work?

SceneScript jointly estimates room layouts and objects from visual data using end-to-end learning

End-to-end learning avoids the need for fragile ‘hard-coded’ rules.

An icon showing SceneScript input

SceneScript is provided visual information in the form of images or point cloud from an egocentric device.

An icon showing SceneScript CODEC network

SceneScript encodes the visual information into a latent representation, which describes the physical space.

An icon showing SceneScript language representation

SceneScript decodes the latent representation to a concise, parametric, and interpretable language, similar to CAD.

An icon showing SceneScript output interpreted in 3D

A 3D interpreter can convert the language to a geometric representation of the physical space.

What makes it special?

Flexible and extensible model design

Trained in Simulation

SceneScript is trained using Aria Synthetic Environments, a fully simulated dataset consisting of 100,000 unique interior environments, each simulated with the same camera characteristics of Project Aria.

The Aria Synthetic Environments dataset was made available to academic researchers last year, along with a public research challenge, to accelerate open research in this area.

Learn more about the datasetLearn more about the challenge

Adaptable to new scene representations

Because SceneScript both represents and predicts scenes using pure language, the scene elements can be easily expanded by expanding the language used to describe the simulated data.

Unlike traditional rule-based approaches for scene reconstruction, this means there is no need to train new 3D detectors for different objects or scene elements.

A screenshot from SceneScript application, showing the decomposition of object primitives
Examples showing how SceneScript could be used to allow LLMs to reason about physical spaces

Enables LLMs to reason about physical spaces

SceneScript leverages the same method of next-token prediction as large-language models. This provides AI models the vocabulary needed to reason about physical spaces.

This advancement could ultimately unlock the next-generation digital assistants, by providing the real-world context necessary to answer complex spatial queries.

Read the blog post on AI.Meta.com

Learn more about SceneScript

For more information about the SceneScript resesarch, read our paper on arXiv and watch the supplementary video.

SCENESCRIPT RESEARCH PAPER
A screenshot from the SceneScript research paper.

Frequently Asked Questions

Because SceneScript is trained in simulation, no real-world data was used for training the model. To ensure that SceneScript works as expected for real-world scenes, the model was validated in fully consented environments.

The base pointcloud encoder and decoder each comprise approximately 35M parameters, resulting in a total of around 70M parameters.

The model is trained until convergence for about 200k iterations resulting in a total of ~3 days.

No, SceneScript is a research project from RealityLabs Research.

The model has been trained exclusively on synthetic indoor scenes and hence an inference on outdoor scenes may result in unpredictable outputs.

The model is coarsely segmented into an encoder and decoder. The encoder consists of a series of 3D sparse convolution blocks pooling a large point cloud to a small number of features. Subsequently, a transformer decoder autoregressively generates tokens by leveraging the encoder's features as context for cross-attention.

Using a vanilla, non-optimized transformer featured directly in PyTorch, decoding 256 tokens, equivalent to a medium-sized scene containing walls, doors, windows, and object bounding boxes, requires approximately 2-3 seconds.

No, not at the moment as the current model is trained using sequences that simulate what would be captured on Project Aria glasses. However, the model could be finetuned using a different camera model with different kind of lenses.

Currently SceneScript is only available to internal research teams at Meta.

Acknowledgements

Research authors

Armen Avetisyan, Chris Xie, Henry Howard-Jenkins, Tsun-Yi Yang,

Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme,

Edward Miller, Jakob Engel, Richard Newcombe, Vasileios Balntas.

Subscribe to Project Aria Updates

Stay in the loop with the latest datasets and challenges from Project Aria.

SIGN UP FOR EMAIL UPDATES