Flow6D: Discrete-to-Continuous Flow Matching for Efficient and Accurate Category-Level 6D Pose Estimation

Mingyu Mei ^1,*

Li Zhang ^2,*

Zibo Dai ¹

Han Sun ³

Xinyue Zhao ¹

Huiliang Shen ¹

Zaixing He ^1,†

IEEE Robotics and Automation Letters (RA-L)

¹Zhejiang University, ²University of Science and Technology of China, ³Shanghai Jiao Tong University

^*Equal contribution, ^†Corresponding author

Paper arXiv Code

Prior candidate-based pose pipelines versus Flow6D. — Prior methods rely on brute-force candidate sampling and ranking **(a)**. Flow6D **(b)** localizes a discrete latent space and then regresses a continuous pose — higher accuracy, much faster.

Abstract

6D pose estimation is a key task in computer vision and embodied AI. Existing methods directly regress in a high-dimensional continuous space, facing two key challenges: limited accuracy due to noise and local optima, and inefficient search over an infinite space that hinders real-time performance. We propose Flow6D, a hierarchical flow matching framework with a two-stage discrete latent space localization to continuous pose regression strategy. A discrete flow matching model first locks the latent space around the true pose to reduce search complexity; a continuous flow matching model then predicts local residuals to regress an accurate pose. The framework naturally extends to articulated objects, outperforming state-of-the-art methods on synthetic and real datasets with real-time inference at 70 FPS.

Discrete-to-Continuous

Discrete flow matching localizes the true-pose latent space, while continuous flow matching refines residuals, reducing the local-optimum and noise sensitivity of direct regression.

Real-Time at 70 FPS

A compact structured search replaces costly brute-force candidate ranking, running in ~11 ms per frame on a single RTX 4090 .

Rigid & Articulated

One unified framework for both rigid and multi-part articulated objects, generalizing robustly under occlusion, clutter, and illumination changes.

Method

Two-stage Flow6D framework. — **Two-stage framework.** Stage I selects an anchor pose via discrete flow matching over uniformly sampled rotation/translation bins. Stage II refines it via continuous flow matching with adaptive latent pose sampling.

Flow6D turns 6D pose estimation into a two-stage, discrete-to-continuous process:

Stage I — Latent localization. Rotation (Euler angles) and translation are discretized into bins. A discrete flow matching model predicts the high-probability bin per dimension, collapsing an infinite continuous search into an efficient, stable classification that locks onto the true pose region.
Anchor–Probs sampling. The top-1 bin becomes the anchor pose; the top- $N$ bins define an uncertainty-aware initialization for refinement, keeping multi-modality while focusing on promising regions.
Stage II — Continuous regression. The point cloud is canonicalized by the anchor pose, then a continuous flow matching model regresses fine residuals through deterministic ODE solving — correcting discretization error to a precise 6D pose.
Articulated objects. A joint-centric extension predicts joint axis, origin, and state to propagate poses of child parts along the kinematic chain.

Results

Rigid objects — REAL275

Using only depth input (no category prior), Flow6D sets a new state of the art while running ~5× faster than the diffusion-based GenPose.

Quantitative comparison on the REAL275 dataset. — **Quantitative comparison on REAL275** for category-level pose estimation.

Qualitative results on the real-world REAL275 dataset. — **REAL275 qualitative results.** Red and green 3D boxes are ground truth and our predictions.

Articulated objects — ArtImage

Flow6D achieves the lowest per-part rotation and translation errors while running two-to-three orders of magnitude faster than optimization-based baselines.

Comparison with state-of-the-art methods on the ArtImage dataset. — **Comparison on ArtImage** across Laptop, Eyeglasses, Dishwasher, Scissors, and Drawer categories.

Qualitative results on the ArtImage dataset. — **ArtImage qualitative results** — accurate per-part poses even at joints and under partial occlusion.

Real-world experiments

We evaluate Flow6D on rigid and articulated objects in real-world scenes. The videos show stable pose tracking through object motion, interaction, and articulation under changing viewpoints and partial occlusion.

Rigid objects

Tasks: mug pick-and-place and cross-container pouring.

Black-mug pick-and-place

Yellow-mug pick-and-place

Bottle-to-bowl pouring

Bottle-to-mug pouring

Bottle-to-mug pouring · different height

Mug-to-bottle pouring

Articulated objects

Tasks: laptop opening and closing.

Laptop closing · sequence 1

Laptop closing · sequence 2

Laptop opening · sequence 1

Laptop opening · sequence 2

BibTeX

@article{mei2026flow6d,
  title={Flow6D: Discrete-to-Continuous Flow Matching for Efficient and Accurate Category-Level 6D Pose Estimation},
  author={Mei, Mingyu and Zhang, Li and Dai, Zibo and Sun, Han and Zhao, Xinyue and Shen, Huiliang and He, Zaixing},
  journal={IEEE Robotics and Automation Letters},
  year={2026},
  publisher={IEEE}
}