Task driven object detection aims to detect object instances suitable for affording a task in an image. Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection. Simply mapping categories and visual features of common objects to the task cannot address the challenge. In this paper, we propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task. Moreover, we propose a novel multi-level chain-of-thought prompting (MLCoT) to extract the affordance knowledge from large language models, which contains multi-level reasoning steps from task to object examples to essential visual attributes with rationales. Furthermore, to fully exploit knowledge to benefit object recognition and localization, we propose a knowledge-conditional detection framework, namely CoTDet. It conditions the detector from the knowledge to generate object queries and regress boxes. Experimental results demonstrate that our CoTDet outperforms state-of-the-art methods consistently and significantly (+15.6 box AP and +14.8 mask AP) and can generate rationales for why objects are detected to afford the task.
Given a task (e.g., “open parcel”) and an image, task driven object detection requires detecting a set of objects most preferred to afford the task. Note that the target objects indicated by the task are not fixed in quantity and category, which may vary with changes in the image scene. In contrast, traditional object detection detects objects of fixed categories while referring image grounding localizes unambiguous objects.
The framework of our proposed CoTDet is shown in Figure. First, we introduce the problem definition and image and text encoders. Second, we acquire visual affordance knowledge from large language models (LLMs) by leveraging the multi-level chain-of-thought prompting and aggregation. Next, we present the knowledge-conditional decoder that conditions acquired knowledge to detect and segment suitable objects. Finally, we introduce the loss functions.
As shown in the tables below, we compare TempCD with state-of-the-art methods on four benchmarks. CoTDet consistently outperforms state-of-the-art methods on all tasks.
Compared to TOIST, our CoTDet achieves significant performance improvement (15.6% mAPbox and 14.8% mAPmask), which demonstrates the effectiveness of our task-relevant knowledge acquisition and utilization. Compared to the two-stage method GGNN, we achieve 24.3% mAPbox and 21.2% mAPmask performance gain, which demonstrates the importance of leveraging the visual affordance knowledge rather than purely visual context information.
Our CoTDet significantly improves the detection and segmentation performance on task4 (get potatoes out of fire), task6 (get lemon out of tea) and task7 (dig hole), achieving approximately 20% mAP improvement on both benchmarks. These tasks face the common challenge of the wide variety of targets’ categories and visual appearances, which is hardly dealt with by methods like [19, 41] that merely learn the mapping between tasks and objects’ categories and visual features. In contrast, our method explicitly acquires the visual affordance knowledge of tasks to detect rare objects and avoid overfitting to common objects, outperforming significantly in these tasks. In addition, for those less challenging tasks with a few ground-truth object categories, we still achieve approximately 8% mAP improvement, demonstrating the effectiveness of conditioning on visual affordances to object localization.
Here we visualize several qualitative results. For (a), no objects in the image should be selected to “get lemon out of tea”. Our model can successfully return the empty set, while TOIST detects the french fry that is one of the salient objects in the image as the tool. Similarly, as knives are uncommon for “opening bottle of beer”, the knife in (b) is challenging for TOIST to identify and locate. Guided by the visual affordance of “sharp blade with a pointed end”, our model correctly localizes and selects the sharp knife.
The (c) and (d) demonstrate effectiveness without MLCOT or knowledge-conditional denoising training (KDN). With visual affordance knowledge obtained by directly asking LLMs, our model relies solely on matching with the single knowledge unit, which incorrectly detects the trunk in (c) and misses the knife in (d). The former trunk is easily confused with objects that are “flat, broad with a handle”, while the latter knife is ignored because its visual attributes of straight mismatch the single knowledge unit that includes “curved or angled”. Furthermore, without KDN, our detector lacks explicit guidance, leading to inaccurate detection in challenging scenes. Specifically, the glove in (c) and the knife in (d) are not detected successfully, and the packing line in (d) is mistakenly detected.