ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail


ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

Technical University of Munich
teaser image

Abstract

We present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description. We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions. To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail, comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes. The object- and part-level of detail captions generated by ExCap3D are of higher quality than those produced by state-of-the-art methods, with a Cider score improvement of 17% and 124% for object- and part-level details respectively.

Video

3D Captioning at Multiple Levels

We propose the task of expressive 3D captioning. Given an input 3D scene, the task is to describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts.

Joint Captioning with Consistency Losses

We introduce ExCap3D, a 3D object captioning model that produces expressive captions at both object- and part-levels of detail using two captioning heads. The part-level caption collectively describes the different parts of an object. Knowing the descriptions of the parts can enable our model to generate more consistent captions describing the object as a whole. Hence, ExCap3D first produces the part-level details and then uses these to inform the object-level captioner. To further reduce inconsistencies between the two levels of detail in terms of the object semantics and text content, ExCap3D employs semantic and textual consistency losses during training.

ExCap3D Dataset

To enable the multilevel captioning task, we provide the ExCap3D Dataset with 190k object and part-level captions of 34k 3D objects in the ScanNet++ dataset. The dataset is generated automatically, using a pretrained vision-language model (VLM) on high-resolution DSLR images along with ground truth semantic labels. We employ multi-view aggregation to robustly capture key object- and part-level details.

Comparison with Existing Datasets

teaser image

Qualitative Results

ExCap3D produces complete and detailed descriptions at both the object- and part-levels of detail, while the baseline methods often omit details in both levels.

Quantitative Results

ExCap3D outperforms all baselines on the CIDEr score, which aligns the most closely with human perception, improving upon the highest performing baseline PQ3D by a margin of 17% and 124% for object- and part-level details respectively. teaser image

BibTeX

      @misc{yeshwanth2025excap3d,
        title={ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail}, 
        author={Chandan Yeshwanth and David Rozenberszkiand Angela Dai},
        year={2025},
        eprint={2503.17044},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.17044}, 
      }