Featured label
Research

Bridging the Gap: Introducing Boxer for Open-World 2D-to-3D Lifting 🥊

April 21, 2026 · 5 min read

Boxer in action on an egocentric sequence captured from smart glasses.

In the world of computer vision, detecting objects in 3D space poses a fundamental challenge. Today, foundation models and Video Language Models (VLMs) are impressively accurate at basic 2D detection tasks like identifying a set of keys on a coffee table. However, they struggle to identify where things are located more precisely in 3D space. This makes it challenging to, for example, identify how furniture is arranged in a room or where a specific item is located for navigation purposes, especially from an egocentric (first-person) view such as you might have from Project Aria or other wearables. We are excited to announce Boxer: a new, lightweight and modular model designed to lift open-world 2D bounding boxes into high-precision metric 3D oriented bounding boxes (3DBBs), enabling more detailed annotation of objects in 3D space.

What is Boxer?

Boxer is a perception engine designed to work with existing 2D foundation models. Boxer acts as a specialized "lifting" layer that takes 2D detections and transforms them into 3D objects with physical dimensions and global coordinates.

While Boxer was optimized for egocentric data (like the output from Project Aria or Quest headsets) it is designed to generalize. It generalizes well to any calibrated video such as from a phone or tablet, RGB-D sensors from robotics platforms, and stereo cameras.

A diagram showing the inputs, algorithm, and output flow for Boxer.

How It Works: The Three-Stage Pipeline

Boxer’s modular architecture consists of a three-stage pipeline designed for seamless integration into existing AI workflows. The process begins with 2D Detection, uses open-vocabulary models like OWLv2 to identify objects via text prompts or manual boxes. The detections are then processed by BoxerNet, the core innovation that encodes images using DINOv3 and leverages camera intrinsics and gravity direction to predict precise 3D oriented bounding boxes. Finally, the system employs Fusion and Tracking to maintain temporal consistency, allowing for either real-time online tracking or offline multi-frame fusion to create a stable, global 3D map of the scene.

A diagram showing the flow for BoxerNet.

Performance

Beyond its versatility, Boxer delivers high-precision accuracy that significantly outperforms existing state-of-the-art alternatives on public benchmarks. It has demonstrated superior performance on both Project Aria datasets, such as ADT and NymeriaPlus, and external benchmarks like CA-1M and Omni3DSUN. Whether evaluated through frame-by-frame analysis or aggregated per-scene metrics, Boxer’s consistent precision makes it a robust solution ready for demanding real-world applications in robotics and augmented reality.

Open Sourced and Ready for Research

To foster community collaboration and advance 3D perception, the inference code, model checkpoints, and datasets have been made publicly available. Boxer features a lightweight and efficient design accessible to researchers without extensive compute clusters, and it includes out-of-the-box support for Project Aria. Additionally, the codebase provides interactive demos and a 3D viewer, allowing users to draw 2D boxes and witness Boxer lift them into 3D space in real-time.

Get Started Today

Ready to bring 3D awareness to your vision models?

Give it a try, and let’s build a more spatially aware future together.

Share: