Abstract
6D pose estimation is a key task in computer vision and embodied AI. Existing methods directly regress in a high-dimensional continuous space, facing two key challenges: limited accuracy due to noise and local optima, and inefficient search over an infinite space that hinders real-time performance. We propose Flow6D, a hierarchical flow matching framework with a two-stage discrete latent space localization to continuous pose regression strategy. A discrete flow matching model first locks the latent space around the true pose to reduce search complexity; a continuous flow matching model then predicts local residuals to regress an accurate pose. The framework naturally extends to articulated objects, outperforming state-of-the-art methods on synthetic and real datasets with real-time inference at 70 FPS.
Discrete-to-Continuous
Discrete flow matching localizes the true-pose latent space, while continuous flow matching refines residuals, reducing the local-optimum and noise sensitivity of direct regression.
Real-Time at 70 FPS
A compact structured search replaces costly brute-force candidate ranking, running in ~11 ms per frame on a single RTX 4090 .
Rigid & Articulated
One unified framework for both rigid and multi-part articulated objects, generalizing robustly under occlusion, clutter, and illumination changes.
Method
Flow6D turns 6D pose estimation into a two-stage, discrete-to-continuous process:
- Stage I — Latent localization. Rotation (Euler angles) and translation are discretized into bins. A discrete flow matching model predicts the high-probability bin per dimension, collapsing an infinite continuous search into an efficient, stable classification that locks onto the true pose region.
- Anchor–Probs sampling. The top-1 bin becomes the anchor pose; the top- bins define an uncertainty-aware initialization for refinement, keeping multi-modality while focusing on promising regions.
- Stage II — Continuous regression. The point cloud is canonicalized by the anchor pose, then a continuous flow matching model regresses fine residuals through deterministic ODE solving — correcting discretization error to a precise 6D pose.
- Articulated objects. A joint-centric extension predicts joint axis, origin, and state to propagate poses of child parts along the kinematic chain.
Results
Rigid objects — REAL275
Using only depth input (no category prior), Flow6D sets a new state of the art while running ~5× faster than the diffusion-based GenPose.
Articulated objects — ArtImage
Flow6D achieves the lowest per-part rotation and translation errors while running two-to-three orders of magnitude faster than optimization-based baselines.
Real-world experiments
We evaluate Flow6D on rigid and articulated objects in real-world scenes. The videos show stable pose tracking through object motion, interaction, and articulation under changing viewpoints and partial occlusion.
Rigid objects
Tasks: mug pick-and-place and cross-container pouring.
Articulated objects
Tasks: laptop opening and closing.
BibTeX
@article{mei2026flow6d, title={Flow6D: Discrete-to-Continuous Flow Matching for Efficient and Accurate Category-Level 6D Pose Estimation}, author={Mei, Mingyu and Zhang, Li and Dai, Zibo and Sun, Han and Zhao, Xinyue and Shen, Huiliang and He, Zaixing}, journal={IEEE Robotics and Automation Letters}, year={2026}, publisher={IEEE}}