Featured label
Research

Sonata: Advancing Self-Supervised Learning for 3D Point Representations

June 4, 2025 · 5 min read

Key Takeaways

    • We developed a novel self-supervised learning (SSL) approach for point cloud representation learning.
    • Key to the strong performance is that we identify and show ways to overcome the “geometric shortcut”, also called “representation collapse”.
    • The strong point cloud representations learned by Sonata enable 0-shot correspondence, strong and efficient linear probing for semantic segmentation tripling performance from 21.8% to 72.5% on ScanNet, and state of the art performance across indoor and outdoor segmentation tasks when fully finetuned.
    • We open source our model, its weights and training code to facilitate open research.

We are excited to introduce Sonata, a groundbreaking approach to self-supervised learning (SSL) for point clouds. SSL is a powerful way of training machine learning models because it doesn’t require any ground truth labels. This enables a model to learn from a large variety and quantity of data during training. Most SSL frameworks are focused on learning from text, images or video. Sonata focuses on 3D pointcloud data from sources such as RGB-D sensors, Structure-from-Motion (SfM) point clouds like from Project Aria, and LIDAR scanners. Pointclouds can also be extracted from mono-depth or multi-view reconstructions systems based on Neural Radiance Fields, Gaussian Splats or end-to-end learned models like DUSt3R and VGGT to enable 3D reasoning directly from video.

3D pointclouds organize information in a fundamentally different way to images

Unlike images (left), point cloud color values are not organized in a regular 2D grid, but rather associated to x,y,z positions (top). The point coordinates alone (bottom), which are used as positional encodings to the model, contain valuable information about the scene, which can lead to representation collapse if not treated properly.

The principal difference between image and point cloud data is in the way the information (i.e. color values) is organized. In images, color values are organized in a consistent, structured 2D grid. For point clouds, there is no such regular uniform structure. Instead, as visualized above, the information is organized via the point coordinates themselves, whose local structure and statistics can vary widely depending on the 3D data source (e.g. RGB-D, SfM, LIDAR as described above). This means that a model that uses the point coordinates in its operations has access to additional information. We cannot easily mask out spatial information, as typical text/image/video SSL formulations would, since they are used as input positional encodings to the model. This leads to what we call the “geometric shortcut”, where the model can “cheat” by using low level statistics in the point cloud, rather than reasoning holistically about the entire scene using the color information as a good 3D representation should. As shown below, the “geometric shortcut” leads to poor representations in related work such as CSC and MSC that collapse to point height or surface normal information.

We select a point on the sofa arm and compute pairwise similarity heatmaps with other points. This reveals that prior methods (Contrastive Scene Contexts (CSC), Masked Scene Contrast (MSC)) collapse to low-level spatial features like point height or surface normals. In contrast, Sonata can extract more discriminative features as demonstrated by highlighting all sofa arms in the scene.

Sonata Self-Supervised Framework

The Sonata framework is built on a point-based self-distillation approach similar to that used in DinoV2, leveraging point clouds from 140k scenes. We introduce two innovative strategies to address the fundamental challenges of point clouds and the “geometric shortcut”:

  • Spatial Obscuration: By applying SSL losses at coarser spatial scales and disturbing the spatial information of masked features on points, Sonata reduces the model's dependence on easily accessible geometric cues and reduces the impact of the “geometric shortcut”.
  • Decoder-Free Architecture: Sonata moves beyond traditional U-Net structures used in point cloud SSL. This is similar to the patch-based output seen in 2D SSL techniques using Vision Transformers such as DinoV2. We focus SSL exclusively on the encoder and leave decoding to be trained on downstream tasks like semantic segmentation. To get full resolution output during pretraining (i.e. one feature for input point), we deterministically upcast the features from coarser point clouds in deeper layers of the encoder to the full resolution pointcloud and concatenate them. This approach allows for training more flexible and generalizable multi-scale representations.

Scalable 3D Point Cloud Machine Learning Architecture

Sonata uses a state-of-the-art 3D backbone called PointTransformerV3 (PTv3), which is capable of operating on large scenes. At the heart of this architecture is the transformer layer. Though it is common to use an image patch as the input for 2D images with architectures like ViT, it is less clear how to formulate tokens or 3D patches with unstructured 3D data. Thus we chose PTv3 as it uses an efficient serialization mechanism to build higher level tokens. This enables the model to scale to very large point clouds with thousands of points, while keeping the computation tractable.

Sonata runs efficiently on large point clouds such as an entire home with over 500k points, due to the use of the PointTransformerV3 architecture. A query point is selected on a bed and its feature similarity is shown to other beds in the same floor of a home. The similarity is shown in a jet colormap, where red corresponds to highly similar.

Powerful Point Cloud Representation leads to Exceptional Parameter Efficiency & Data Efficiency

For the first time, the Sonata SSL framework enables learning point cloud representations strong enough to be zero-shot or linear probed for segmentations without finetuning the model. The ability to zero-shot probe or train just a tiny linear layer (with <0.2% of parameters of the original model) for segmenting a point cloud shows that the features output by the self-supervised pretrained model already contains information to allow distinguishing different classes of objects. While this is common practice in image and video SSL, in point cloud SSL representations learned by prior work lead to poor linear probing performance.

This is shown in the figure above for the PointContrast (PC) and Masked Scene Contrast (MSC) SSL approaches. As can seen, the Sonata pretrained model achieves a breakthrough 3.3x higher linear probing performance. This performance is close to the new full finetuning state-of-the-art of 79.4% set by the Sonata pretrained model. As a reference, linear probing of Sonata features outperforms linear probing of unprojected and point-cloud-aggregated DINOv2 features, a state-of-the-art vision foundation model.

Going a little deeper, we compare the PCA visualizations of DINOv2, Sonata, and their combined feature representation in the figure below. DINOv2 excels at capturing photometric details, while Sonata better distinguishes spatial information. One example of this limitation of Sonata alone is the poster on the wall shown on the left side of the rightmost column in the figure below. The poster is flat and is not represented in 3D, yet its photometric detail is easily captured by DinoV2. The combined model demonstrates improved coherence and detail, showcasing the complementary strengths of both models. The linear probing results of lifted DINOv2, Sonata, and combined representation are 63.1%, 72.5%, and 75.9%, respectively.

This means linear probing, which is very parameter efficient, is starting to approach full finetuning performance and already outperforms baseline 2D feature lifting approaches. Combining Sonata and DINOv2 features leads to the best performance which indicates that there is some complementary information.

In additional experiments we also find that with only 1% of the data, Sonata nearly doubles the performance compared to previous approaches. This highlights the data efficiency of the learned representation and points a way towards powerful 3D perception models supervised with little data.

State-of-the-Art Perception Performance With Full Finetuning

To unlock the full potential of the Sonata pretrained model we fully finetune all the model parameters. We find that the Sonata pointcloud model demonstrates exceptional performance across a wide range of 3D perception tasks. It achieves state-of-the-art performance on both indoor and outdoor semantic segmentation and instance segmentation tasks.

Open-Sourcing the high performance Sonata-pretrained 3D point cloud model

Sonata represents a significant leap forward in the field of 3D self-supervised learning. By identifying and addressing the geometric shortcut and introducing a flexible, efficient framework, it is an exceptionally robust 3D point representation. This work advances the state-of-the-art and sets the stage for future innovations in 3D perception and its applications. We release the training code and model weights on github and weights.

Paper Link https://arxiv.org/pdf/2503.16429
Github Link https://github.com/facebookresearch/sonata
Weights Link https://huggingface.co/facebook/sonata
Page Link https://xywu.me/sonata/

Meta logo, homepage link

Follow Us

facebook icon
YouTube icon
X icon
LinkedIn icon

Sign up for our newsletter

By providing your email, you agree to receive marketing related electronic communications from Meta, including news, events, updates, and promotional emails related to Project Aria. You may withdraw your consent and unsubscribe from these at any time, for example, by clicking the unsubscribe link included on our emails.