Abstract
Estimating the 6D pose of arbitrary objects from a single reference image is a critical yet challenging task in robotics, especially considering the long-tail distribution of real-world instances. While category-level and model-based approaches have achieved notable progress, they remain limited in generalizing to unseen objects under one-shot settings. In this work, we propose a novel pipeline for fast and accurate one-shot 6D pose and scale estimation. Leveraging recent advances in single-view 3D generation, we first build high-fidelity textured meshes without requiring known object poses. To resolve scale ambiguity, we introduce a coarse-to-fine alignment module that estimates both object size and initial pose by matching 2D-3D features with depth information. We then generate a diversified set of plausible 3D models using text-guided generative augmentation and render them with Blender to synthesize large-scale, domain-randomized training data for pose estimation. This synthetic data bridges the domain gap and enables robust fine-tuning of pose estimators. Our method achieves state-of-the-art results on several 6D pose benchmarks, and we further validate its effectiveness on a newly collected in-the-wild dataset. Finally, we integrate our system with a dexterous hand, demonstrating its robustness in real-world robotic grasping tasks. All code, data, and models will be released to foster future research.
Interactive Examples

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6
Method Overview
Figure 2 illustrates the overall pipeline of our method. Given an anchor RGB-D image IA containing an object of interest, our primary challenge is to estimate its 6D pose without a pre-existing 3D model, a common limitation for novel objects. To address this, as shown in the top-left of Figure 2, we first leverage recent advancements in single-view 3D generation to create a textured 3D model with a standardized orientation and scale (see Section 3.3). However, this generated model exists in a normalized space and lacks real-world scale.
To recover the object's true size and location in the anchor image frame, we introduce a coarse-to-fine alignment module (see Section 3.4). This module aligns the normalized generated model with the partial object observation in IA, simultaneously estimating the object's metric scale and initial 6D pose. Once the metric-scale model in the anchor view is established, we can efficiently estimate the object's pose in subsequent query RGB-D images IQ (top-right of Figure 2) using the aligned model and a robust pose estimation framework, including a pose selection module to handle potential object symmetries. The final relative transformation TA→Q is then computed from the absolute poses in both views.

Figure 2: Overview of OnePoseviaGen
Online Demo
Welcome to try our online demo below! You can upload your own images to experience the 6D pose estimation of OnePoseviaGen. (If loading is slow, please be patient or refresh the page to retry.)
Experiments
Public datasets. We evaluated our method on three challenging public datasets: YCBInEOAT (robotic interaction), Toyota-Light (TOYL) (challenging lighting), and LINEMOD Occlusion (LM-O) (cluttered, occluded, textureless objects).
Real-world evaluation. We performed two experiments in real-world settings: (1) 6D pose estimation for uncommon objects by generating synthetic training data via our domain randomization pipeline and testing on a calibrated real set, and (2) robotic manipulation tasks, establishing grasping setups using a ROKAE robot arm equipped with an XHAND1 dexterous hand, and two AgileX PiPERs, and measuring success rates against baselines.
The two examples displayed below are randomly selected from our collection to enhance the page loading speed. Should you wish to explore further, simply click on the 'Load More' button to view additional examples.

Figure 3: Qualitative comparison on the YCBInEOAT, LMO and TYOL dataset

Figure 3: Qualitative comparison on the YCBInEOAT dataset