OvarNet: Towards Open-vocabulary Object Attribute Recognition

Keyan Chen☆ 1
Xiaolong Jiang☆ 2
Yao Hu2
Xu Tang2
Yan Gao2
Jianqi Chen1
Weidi Xie ✉ 3

Beihang University1
Xiaohongshu Inc2
Shanghai Jiao Tong University3
equal contribution
corresponding author
Code [GitHub]
Paper [arXiv]
Cite [BibTeX]


The first row depicts the tasks of object detection and attribute classification in a close-set setting, i.e., train and test on the same vocabulary set. The second row gives qualitative results from our proposed OvarNet, which simultaneously localizes, categorizes, and characterizes arbitrary objects in an open-vocabulary scenario. We only show one object per image for ease of visualization, red denotes the base category/attribute i.e., seen in the training set, while blue represents the novel category/attribute unseen in the training set.


In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories.


An overview of the proposed method. Left: the two-step training procedure for finetuning the pre-trained CLIP to get CLIP-Attr that better aligns the regional visual feature to attributes. Step-I: naive federate training by base attribute annotations. Step-II: training by image-caption pairs. We first conduct RPN on the whole image to get box-level crops, parse the caption to get noun phrases, categories, and attributes, and then match these fine-grained concepts for weakly supervised training. Right: the proposed one-stage framework OvarNet. We inherit the CLIP-Attr for open-vocabulary object attribute recognition. Regional visual feature is learned from the attentional pooling of proposals; while attribute concept embedding is extracted from the text encoder. Solid lines declare the standard federated training regime. Dashed lines denote training by knowledge distillation with CLIP-Attr.

Quantitative Results

R1: Benchmark on COCO and VAW Datasets

In the Tab., we compare OvarNet to other attribute prediction methods and open-vocabulary object detectors on the VAW test set and COCO validation set. As there is no open-vocabulary attribute prediction method developed on the VAW dataset, we re-train two models on the full VAW dataset as an oracle comparison, namely, SCoNE and TAP. Our best model achieves 68.52/67.62 AP across all attribute classes for the box-given and box-free settings respectively. On COCO open-vocabulary object detection, we compare with OVR-RCNN, ViLD, Region CLIP, PromptDet, and Detic, our best model obtains 54.10/35.17 AP for novel categories, surpassing the recent state-of-the-art ViLD-ens and Detic by a large margin, showing that attributes understanding is beneficial for open-vocabulary object recognition.

R2: Cross-dataset Transfer on OVAD Benchmark

We compare with other state-of-the-art methods on OVAD benchmark, following the same evaluation protocol, we conduct zero-shot cross-dataset transfer evaluation with CLIP-Attr and OvarNet trained on COCO Caption dataset. Metric is average precision (AP) over different attribute frequency distributions, 'head', 'medium', and 'tail'. As shown in the Tab., our proposed models largely outperform other competitors by a noticeable margin.

R3: Evaluation on LSA Benchmark

We evaluate the proposed OvarNet on the same benchmark proposed by Pham et al.. As OpenTAP employs a Transformer-based architecture with object category and object bounding box as the additional prior inputs, we have evaluated two settings. One is the original OvarNet without any additional input information; the other integrates the object category embedding as an extra token into the transformer encoder layer. As shown in the Tab., OvarNet outperforms prompt-based CLIP by a large margin and surpasses OpenTAP (proposed in the benchmark paper) under the same scenario, i.e., with additional category embedding introduced. 'Attribute prompt' means the prompt designed with formats similar to "A photo of something that is [attribute]", while 'object-attribute prompt' denotes "A photo of [category] [attribute]". For the 'combined prompt', the outputs of the 'attribute prompt' and the 'object-attribute prompt' are weighted average.


In the following Fig., we show the qualitative results of OvarNet on VAW and MS-COCO benchmarks. OvarNet is capable of accurately localizing, recognizing, and characterizing objects based on a broad variety of novel categories and attributes.


    title={OvarNet: Towards Open-vocabulary Object Attribute Recognition},
    author={Chen, Keyan and Jiang, Xiaolong and Hu, Yao and Tang, Xu and Gao, Yan and Chen, Jianqi and Xie, Weidi},


Based on a template by Phillip Isola and Richard Zhang.