Cotdet: Affordance Knowledge Prompting For Task Driven Object Detection
Cotdet: Affordance Knowledge Prompting For Task Driven Object Detection
Output
Text Encoder
Affordance
Affordance
Affordance ...
Affordance Values Attention Generation
Affordance Pooling (bottom right) Prediction Head
Only for
Knowledge Aggregating Keys Knowledge Reading
Training
Tranformer Decoder
Visual Affordance Rationales
Noise
pierce paper sharp tips metal Open Parcel
sharp tips
pierce plastic slender Twee GT box
sharp tips silvery z
... ers
... a handle ...
a handle provide a grip ...
... Knowledge-conditional
Candidates
Query Generation
Figure 2. Overall framework of the proposed CoTDet with multi-level chain-of-thought prompting (MLCoT). We first generate
visual affordance knowledge from LLMs with the proposed novel MLCoT. Next, we perform knowledge-conditional object detection by
utilizing the knowledge to generate object queries for the scene as well as guide object localization through denoising training.
The pioneering work [42] adopts a two-stage framework the language model to filter effective visual functional at-
that first detects objects and then compares among objects tributes to afford tasks.
to select suitable objects via graph neural networks [43]. External Knowledge Utilization. Incorporating knowl-
The following work TOIST [19] distills target object names edge reasoning has attracted growing interest in computer
to pronouns such as something by conditioning on the lan- vision [11, 26, 34, 35, 44, 63] and vision-language [49] fields
guage description of the task. However, these works lack such as object detection [44, 46], visual relationship detec-
to model explicit requirements of tasks and objects’ affor- tion [16, 61] and visual question answering [9, 28, 30, 53].
dances to tasks, which limits their performance and gener- For tasks that focus on capturing relations among ob-
alization capabilities. jects, such as scene graph generation and visual relation-
Knowledge Acquisition in Vision-Language Tasks. Inte- ship detection, extracting knowledge of the interactions be-
grating external knowledge into computer vision tasks [16, tween general concepts becomes natural [11, 16, 61, 63].
35,44,46] and vision-language tasks [29,49,50,63] has been Other tasks like object detection and image classification
found beneficial. Previous methods [16, 28, 30, 35, 53, 61] rely on category-related knowledge that is retrieved in the
are interested in acquiring structured knowledge (e.g., Con- knowledge base as definitions or attributes of general con-
ceptNet [22]), which usually includes commonsense con- cepts [9, 26, 34, 44, 46, 53]. For knowledge utilization, they
cepts and relations and is presented in fixed data struc- mainly directly take external knowledge as an expansion
tures such as the graph or triple. Recently, large language to visual content or explicitly constrain the consistency be-
models [2, 7, 36] have been demonstrated to learn open- tween knowledge and visual content. Different from exist-
world commonsense knowledge from the large-scale cor- ing works, we leverage attribute-level commonsense knowl-
pus [32, 40]. Some works [9, 59] utilize language models edge about requirements for completing tasks and take ex-
to encode the representations of inputs or directly generate ternal knowledge of tasks as the condition to condition the
answers conditioning on visual inputs, leveraging the latent detector for task driven object detection.
knowledge in language models. Unlike previous works, we
prompt language model GPT-3 [8] to obtain external knowl- 3. Method
edge explicitly with the chain of thought (CoT) [25, 52, 64]
for better interpretability. To the best of our knowledge, we The framework of our proposed CoTDet is shown in
are the first to explore CoT to acquire visual commonsense Figure 2. First, we introduce the problem definition and
knowledge in text form, leveraging the reasoning ability of image and text encoders in Section 3.1. Second, we ac-
quire visual affordance knowledge from LLMs by lever- why the examples can afford the task and summarizes cor-
aging the multi-level chain-of-thought prompting and ag- responding visual attributes for rationales.
gregation (Section 3.2). Next, we present the knowledge- Object-level Prompting as Brainstorming. At the first
conditional decoder that conditions acquired knowledge to level, we prompt LLMs to generate daily object examples
detect and segment suitable objects (Section 3.3). Finally, that afford the input task S. Specifically, we design the fol-
we introduce the loss functions in Section 3.4. lowing text prompt:
3.1. Problem Definition and Encoder Prompt: What common objects in daily life can be used as
a vehicle for the human to [task]? Please list the twenty
Problem Definition. Given a task in text form S (e.g., most suitable objects.
“open parcel”) and an image I, task driven object detec- Output: knife, pen, paper cutter, scissors, screwdriver, ...
tion require detecting a set of objects most preferred to af-
ford the task. Note that the target objects indicated by the Where [task] is filled with the task text S. We denote the
task are not fixed in quantity and category, which may vary number of objects returned from LLMs as No . For simplic-
with changes in the image scene. In contrast, traditional ob- ity, we present the most critical parts of the prompts. For
ject detection [3, 10, 65] detects objects of fixed categories complete prompts, please refer to Appendix. One straight-
while referring image grounding [6, 17, 27, 33] localizes un- forward idea is to perform traditional category-specific ob-
ambiguous objects. ject detection with respect to the categories of object exam-
ples. However, it is not feasible due to the following ob-
Encoders. For the task S, we leverage the linguistic infor-
servations: (1) object examples are overly limited to partial
mation from two perspectives. First, the text S is preserved
object categories, resulting in the gap between object cat-
as input for extracting knowledge from LLMs. Besides, we
egories and the actual task requirements. For instance, the
employ RoBERTa [23] as the text encoder to obtain the text
fork in Figure 2 is not among the objects returned by LLMs.
feature ts for the task S, which will be used to query the
(2) a few noisy unsuitable objects may be output. Although
task-relevant visual content in the image. For the image I,
the noisy objects are few, relying entirely on the object ex-
we adopt the ResNet-101 [13] as the visual backbone to ex-
amples is risky. For example, for the task of extinguishing
tract multi-scale visual feature maps and flatten the maps
fire, LLMs return the fire axe, a common firefighting tool,
along the spatial dimension to features V .
but it cannot be directly used to extinguish the fire.
3.2. Visual Affordance Knowledge from LLMs Affordance-level Prompting with Rationales. To address
the above challenges and capture the essential visual affor-
To detect objects to afford a particular task, we naturally dances implied in representative object examples, we pro-
consider the task’s requirements first and subsequently lo- pose to generate rationales for why these objects can afford
calize suitable objects that meet the requirements. Never- the task and summarize visual affordances from rationales.
theless, the task requirements are abstract and can not di- With the object examples, we prompt LLMs to generate ra-
rectly correspond to the visual content in the image for lo- tionales as follows:
calizing the objects. Motivated by this, we propose to ex-
plicitly extract common visual attributes (i.e., visual affor- Prompt: For each object, let’s think about the rationales
dances) that make different objects can afford the task and for why they afford the task from the perspective of visual
use visual affordances to bridge task requirements and ob- features.
ject instances in the image. Furthermore, we generate visual Output: Knife: They have a sharp blade which can easily
affordances from LLMs because LLMs generally contain cut through paper and plastic; They have a handle which
world knowledge learned from a vast amount of text data. provides a good grip for the user. Pen: ...
Specifically, we first design a novel multi-level chain- By the above prompting, we get a set of rationales for each
of-thought prompting to leverage LLMs to generate visual object. Next, we further prompt LLMs to summarize visual-
affordances (see Section 3.2.1) and then encode and ag- related rationales to form visual affordances as follows:
gregate them automatically to be utilized for detection (see
Prompt: Summary corresponding visual features of the ob-
Section 3.2.2).
ject for each rationale.
Output: {A sharp blade and a handle. }, {...}, ...
3.2.1 Multi-Level Chain-of-Thought Prompting
Finally, we obtain No sets of visual affordances, where each
Our multi-level chain-of-thought prompting (MLCoT) set contains several visual attributes relevant to why objects
leverages LLMs to generate and summarize intermediate can afford the task S. And we define each set of visual affor-
reasoning steps from object examples to essential visual at- dance knowledge as a knowledge unit. Note that although
tributes with rationales. MLCoT first brainstorms the object each knowledge unit is derived from the rationales of an
examples to afford the task and then considers rationales object example, the affordance knowledge in that unit is not
limited to that object example or its related categories. For tent of the image, the task, and the visual affordance knowl-
example, the first visual affordance unit comprises “a sharp edge. Specifically, we utilize visual affordance knowledge
blade and a handle”, which correspond to the returned ob- to select visual features and combine them with the knowl-
ject “knife”. Notably, this visual affordance unit also ap- edge to generate queries, and then the spatial information in
plies to “box cutter” and “paper cutter”. these visual features naturally becomes the spatial priors of
reference points.
3.2.2 Knowledge Encoding and Aggregating Given visual features F = {f1 , f2 , ..., fi }, we first fuse
each feature fi with the task’s text feature ts , and then cal-
We further extract a refined knowledge base by filtering culate its relevance to the task’s visual affordance knowl-
out a few knowledge units corresponding to the unsuit- edge K = {(pkj , pvj )}N j=1 . Since each knowledge unit
able object examples mentioned in Section 3.2.1. For each (pkj , pvj ) in the knowledge base K is a set of affordances
knowledge unit, we concatenate textual descriptions of its that meet the task requirements, we use the fused feature’s
visual affordances into a textual sequence and then utilize largest similarity to each knowledge unit (pkj , pvj ) in the
RoBERTa [23] to obtain the sentence feature. To filter out knowledge base K as the feature’s relevance score. The cal-
unsuitable units, we compute the cosine similarity between culation is formulated as follows,
each pair of knowledge units and exclude outlier units if
their maximum similarity to other units falls below a pre- \label {eq:read} \begin {aligned} &s_{i,j} = \mathrm {cos}(\mathrm {fc}(f_i) + \mathrm {fc}(t_s), p_j^k), \\ &r_i = \mathrm {max}_j(s_{i,j}), d_i = \mathrm {argmax}_j(s_{i,j}), \end {aligned}
determined threshold. Additionally, for each selected unit, (2)
we extract its word representations via RoBERTa. In sum-
mary, we aggregate N visual affordance knowledge units, where cos(·, ·) computes the cosine similarity, fc(·) repre-
denoted as K = {pkj , pvj }N k v
j=1 , where pj and pj are the sen- sents the fully connected layer, and si,j is the similarity be-
tence feature and word features of j-th unit, respectively. tween i-th visual feature and j-th knowledge unit. Then,
ri and di mean the i-th visual feature’s relevance score and
3.3. Knowledge-conditioned Decoder index of the corresponding knowledge unit, respectively.
We base on the detection architecture of Deformable- Next, we select the visual features with the top-k largest
DETR [65], a DETR-like detector [3, 5, 18], which uses relevance scores {ri } to incorporate their corresponding
object queries to capture object-level information for detec- knowledge {(pkdi , pvdi )} to generate queries Qkn as follows,
tion (Section 3.3.1). Unlike randomly initializing the object
queries, we leverage visual affordance knowledge to gener- \label {eq:query} Q^{\text {kn}} = \mathrm {topk}_{r_i}\{f_i + \mathrm {AttentionPool}(f_i, p^v_{d_i})\}, (3)
ate the object queries (Section 3.3.2) and guide the bound-
where topkri means to select the corresponding features
ing box regression with denoising training (Section 3.3.3).
with the top-k largest relevance scores ri . The atten-
tion pooling layer [48] AttentionPool(fi , pvdi ) returns the
3.3.1 Introduction to Deformable-DETR weighted features on pvdi based on their similarities to the
Deformable-DETR contains a Transformer encoder and a fi . Note that, for each knowledge unit (pkj , pvj ), we use its
Transformer decoder. The encoder inputs visual features V global sentence feature pkj to compute its overall similarity
and outputs the refined visual features F = {f1 , f2 , ..., fi } to each visual feature in Eq. 2 while adopting word-level
via multi-scale deformable attention. The decoder ran- features pvj to better enhance the query’s fine-grained rep-
domly initializes queries Q = {q1 , q2 , ..., qk } and predicts resentations in Eq. 3. Similar to Deformable-DETR, we
a reference point pk for each object query qk , and these ref- further predict the reference points Lkn from the queries
erence points L = {l1 , l2 , ..., lk } serve as the initial guess Qkn . In addition, to facilitate the learning of Top-k selec-
of the box centres. Next, the decoder searches for objects O tion, the selected queries Qkn are directly fed into the pre-
for these queries Q with reference points L via multi-scale diction heads and supervised during training using the same
deformable cross-attention and self-attention, which is for- training loss in Section 3.4.
mulated as follows,
3.3.3 Knowledge-conditional Decoding
O = \mathrm {Deformable}([Q, L], F), (1)
With queries Qkn , reference points Lkn , and the refined vi-
where Deformable(·, ·) denotes the Transformer decoder of sual features F , we apply the Deformable decoder to search
Deformable-DETR. objects Okn as follows,
3.3.2 Knowledge-conditional Query Generation \label {eq:normal} O^{\text {kn}}=\mathrm {Deformable}([Q^{\text {kn}},L^{\text {kn}}], F). (4)
Instead of randomly initializing the queries, we generate the In addition to utilizing visual affordance knowledge for
queries and their reference points based on the visual con- query generation and providing the decoder with prior
task1: step on task2: sit comfortably task3: place flowers task4: get potatoes out of fire task5: water plant
task6: get lemon out of tea task7: dig hole task8: open bottle of beer task9: open parcel task10: serve wine
task11: pour sugar task12: smear butter task13: extinguish fire task14: pound carpet
Task(APbox @0.5)
Method Mean
task1 task2 task3 task4 task5 task6 task7 task8 task9 task10 task11 task12 task13 task14
GGNN [42] 36.6 29.8 40.5 37.6 41.0 17.2 43.6 17.9 21.0 40.6 22.3 28.4 39.1 40.7 32.6
TOIST [19] 44.0 39.5 46.7 43.1 53.6 23.5 52.8 21.3 23.0 46.3 33.1 41.7 48.1 52.9 41.3
TOIST† [19] 45.8 40.0 49.4 49.6 53.4 26.9 58.3 22.6 32.5 50.0 35.5 43.7 52.8 56.2 44.1
Ours 58.9 55.0 51.2 68.5 60.5 47.7 76.9 40.7 47.4 66.5 41.9 48.3 61.7 71.4 56.9
†
Table 1. Comparison with state-of-the-art models for task driven object detection on COCO-Tasks dataset. means the model is
with noun-pronoun distillation.
knowledge, we further improve the knowledge utilization where λcl and λbox are the hyperparameters of the weighted
by designing a knowledge-based denoising training [18]. loss. Our method can be easily extended to instance seg-
As the visual affordance knowledge indicates the target mentation by adding a segmentation head [5] and replacing
objects’ visual attributes, such as shape and size, the the box regression loss with the Dice loss Lmask .
knowledge-base denoising guides the decoder in learning
how to use this kind of visual knowledge to regress the tar- 4. Experiment
gets’ boxes.
4.1. Dataset and Implementation Details
Specifically, during the training stage, we first randomly
add noise to ground-truth boxes Ogt = {ogt M
m }m=1 to con- Dataset. We conduct experiments on the COCO-Tasks
struct the noised objects following DN-DETR [18] and then dataset [42], which comprises 14 different tasks (see Ta-
extract noised boxes’ visual features and centers as the ble 1). This dataset is derived from the COCO dataset [21],
noise M
noised queries F noise = {fm }m=1 . Notice that the previ- but with customized annotations for task driven object de-
ous denoising training method [18] adds noise to both boxes tection. Each task contains 3600 training and 900 testing
and categories labels to capture label-box relations better. images. Besides, we follow [19] to incorporate mask anno-
But we only add noise to boxes because we aim to utilize the tations to the original COCO-Tasks dataset for the instance
knowledge without noise to help denoise boxes. Therefore, segmentation benchmark.
we extract the knowledge unit (pkdm , pvdm ) for each ground- Implementation Details. Following previous works [19,
truth box ogt m through Eq. 2. Finally, the knowledge units
42], we use ResNet-101 [13] as the image encoder and
{(pkdm , pvdm )}M RoBERTa [23] as the text encoder. The model is pre-trained
m=1 guide the decoder to regress the ground-
truth boxes Ogt from the noised queries F noise , which is for- on the COCO dataset but images already part of COCO-
mulated as follows, Tasks are removed. We train the model for 4000 iterations
with the initial learning rate 1e-4 and use AdamW [24] as
\begin {aligned} \label {eq:kn} P^{\text {kn}} &= \{\mathrm {AttentionPool}(f_m^{\text {noise}}, p^v_{d_m})\}_{m=1}^M \\ O^{\text {denoise}}&=\mathrm {Deformable}([F^{\text {noise}} + P^{\text {kn}}, L^\text {noise}], F,) \\ \end {aligned} the optimizer. The hyperparameters λcl and λbox are 4 and
(5) 5. Following [19], we evaluate the segmentation and de-
tection performance of each task using APmask @0.5 and
APbox @0.5, respectively. And we denote their means across
where P kn is the visual affordance knowledge of noised all tasks as mAPmask and mAPbox . Unless otherwise speci-
queries, and the Deformable(·, ·) in Eq. 4 and Eq. 5 shares fied, we leverage the GPT-3 [2] to extract visual affordance
the same parameters. And the denoising is only considered knowledge due to its capability to generate rationales [52].
in the training stage.
4.2. Comparison with State-of-the-Art Methods
3.4. Loss Functions
Table 1 and Table 2 show the comparison of our CoT-
Following DETR [3], we use bipartite matching to find Det with state-of-the-art models (SOTAs) on detection and
the unique predictions for the ground-truth objects and segmentation benchmarks. Our model consistently outper-
adopt the same bounding box regression loss Lbox consist- forms the SOTAs [19, 42] on all benchmarks and tasks.
ing of L1 loss and GIoU [39] loss. Moreover, we use the Comparison with SOTAs. Compared to TOIST [19],
binary cross entropy loss as the classification loss Lcl . The our CoTDet achieves significant performance improve-
overall loss is represented as: ment (15.6% mAPbox and 14.8% mAPmask ), which demon-
strates the effectiveness of our task-relevant knowledge
\mathcal {L}_{cost} = \lambda _{cl}\mathcal {L}_{cl} +\lambda _{box}\mathcal {L}_{box}, (6) acquisition and utilization. Compared to the two-stage
Task(APmask @0.5)
Method Mean
task1 task2 task3 task4 task5 task6 task7 task8 task9 task10 task11 task12 task13 task14
GGNN [42] 31.8 28.6 45.4 33.7 46.8 16.6 37.8 15.1 15.0 49.9 24.9 18.9 49.8 39.7 32.4
TOIST [19] 37.0 34.4 44.7 34.2 51.3 18.6 40.5 17.1 23.4 43.8 29.3 39.9 46.6 42.4 35.2
TOIST† [19] 40.8 36.5 48.9 37.8 43.4 22.1 44.4 20.3 26.9 48.1 31.8 34.8 51.5 46.3 38.8
Ours 55.0 51.6 51.2 57.7 60.1 43.1 65.9 40.4 45.4 64.8 40.4 48.7 61.7 64.4 53.6
Table 2. Comparison with state-of-the-art models for task driven instance segmentation on COCO-Tasks dataset. † means the model
is with noun-pronoun distillation.
Ablation task2 task6 task9 mAPbox mAPmask tiveness of the proposed knowledge acquisition and utiliza-
tion. The results are shown in Table 3. In addition to mAP,
“objects” 25.4 16.5 21.0 31.9 31.3
“visual” 50.4 30.3 38.3 48.1 44.7 we report APbox @0.5 on relatively easy task2 (sit comfort-
w/o rationales 52.0 40.7 41.2 52.4 49.0 ably) and challenging task6 (get lemon out of tea) and task9
MLCoT 55.0 47.7 47.5 56.9 53.6 (open parcel) for reference, and the full results and analysis
MLCoT(ChatGPT) 50.6 48.1 50.3 57.0 54.0 on sub-tasks are provided in Appendix.
Def+GGNN [42] 38.6 24.7 23.4 38.8 35.8 MLCoT Prompting for Knowledge Acquisition. To eval-
Def+TOIST [19] 43.4 21.0 29.0 40.3 37.6 uate the impact of core designs in MLCoT, we replace
the MLCoT pipeline with the following approaches and
Init w/ MLCoT 42.2 35.9 35.6 48.7 46.4
Fuse w/ MLCoT 44.0 42.3 41.2 50.6 47.7
utilize the acquired knowledge as the condition to guide
Select w/ MLCoT 50.0 47.2 43.7 55.3 51.7 detection: (1) We encode the object categories returned
Full Decoder 55.0 47.7 47.5 56.9 53.6 by the object level of MLCoT as the knowledge to per-
form the knowledge-conditional object detection. The re-
Table 3. Ablation study about knowledge acquisition, detection sults (31.9% mAPbox and 31.3% mAPmask ) demonstrate that
framework, and knowledge utilization of our CoTDet. simply extracting the object categories from LLMs cannot
achieve satisfactory performance. (2) We attempt to ac-
method GGNN [42], we achieve 24.3% mAPbox and 21.2% quire affordance-level rather than object-level knowledge.
mAPmask performance gain, which demonstrates the impor- Specifically, we prompt LLMs by asking “what visual fea-
tance of leveraging the visual affordance knowledge rather tures can we use to determine the suitability of an object
than purely visual context information. for {TASK}?” to generate visual affordance knowledge di-
Comparison on Sub-tasks. The following comparisons rectly. The attempt improves the above object-level model
on sub-tasks further demonstrate that the affordance-level by 16.2% mAPbox and 13.4% mAPmask , showing the neces-
knowledge is capable of bridging tasks and objects. Our sity of exploring the essential visual affordances behind the
CoTDet significantly improves the detection and segmenta- object categories. However, this model still underperforms
tion performance on task4 (get potatoes out of fire), task6 by approximately 9% mAP compared to our full model. It is
(get lemon out of tea) and task7 (dig hole), achieving ap- difficult to summarize a unified description of widely vary-
proximately 20% mAP improvement on both benchmarks. ing objects without priors, resulting in only one set of visual
These tasks face the common challenge of the wide vari- attributes being returned from LLMs. (3) To increase the
ety of targets’ categories and visual appearances, which is diversity of visual affordances, we prompt LLMs to gen-
hardly dealt with by methods like [19, 42] that merely learn erate visual features for each object retrieved, which leads
the mapping between tasks and objects’ categories and vi- to a significant improvement to 52.4% mAPbox and 49.0%
sual features. In contrast, our method explicitly acquires the mAPmask . (4) Finally, we further add rationales to filter out
visual affordance knowledge of tasks to detect rare objects the misleading and irrelevant attributes, achieving a 4.5%
and avoid overfitting to common objects, outperforming and 4.6% increase in mAPbox and mAPmask , respectively.
significantly in these tasks. In addition, for those less chal- (5) We also evaluate the effect of using different LLMs
lenging tasks with a few ground-truth object categories, we to extract visual affordance knowledge. Our MLCoT with
still achieve approximately 8% mAP improvement, demon- ChatGPT [31] has a similar mAP to MLCoT with GPT-3.
strating the effectiveness of conditioning on visual affor- Knowledge-conditional Object Detection. To validate the
dances to object localization. effectiveness of our proposed knowledge-conditioned de-
coder, we conduct ablation studies with two baselines and
4.3. Ablation Study
three variants based on Deformable-DETR [65] framework:
We evaluate seven variants of our CoTDet and two SO- (1) We develop GGNN [42] on the Deformable-DETR de-
TAs with the same backbones as ours to validate the effec- tection framework. Def+GGNN simply learns the rela-
GT TOIST Ours GT “visual” w/o KDN Ours
(b) “open bottle of beer” TOIST(∅), (d) “get lemon out of tea”
Rationale for (b): sharp blade with a pointed end to insert Rationale for (d): handle of the spatula is long enough to reach the bottom of
into the bottle cap. the cup; flat edge is designed to scoop up the lemon pieces.
Figure 3. Visualization for prediction results of our CoTDet, its variants, and the existing best-performing TOIST [19].
tions between objects and identifies objects based on their with a handle”, while the latter knife is ignored because its
contexts, limiting its performance. (2) Besides, similar visual attributes of straight mismatch the single knowledge
to TOIST [19], we initialize queries with the task’s tex- unit that includes “curved or angled”. Furthermore, with-
tual feature based on our framework. The performance gap out KDN, our detector lacks explicit guidance, leading to
(16.6% mAPbox and 14.1% mAPmask ) between Def+TOIST inaccurate detection in challenging scenes. Specifically, the
and our final model. (3) We introduce the visual affordance glove in (c) and the knife in (d) are not detected success-
knowledge extracted by MLCoT but simply use it to initial- fully, and the packing line in (d) is mistakenly detected.
ize the queries of the decoder (Init w/MLCoT). The model
achieves significant performance gain compared to the two 5. Conclusion
baselines. (4) We further fuse knowledge with the im-
age’s visual feature map to construct a multi-modal feature In this paper, we focus on challenging task driven object
map (Fuse w/MLCoT), which jointly understands the two detection, which is practical in the real world yet under-
modalities and improves performance (1.9% mAPbox and explored. To bridge the gap between abstract task require-
1.3% mAPmask ) compared to the last model. (5) Our pro- ments and objects in the image, we propose to explicitly
posed knowledge-conditional query generation, generating extract visual affordance knowledge for the task and de-
based on the visual content of the image, the task, and the tect objects having consistent visual attributes to the visual
visual affordance knowledge, helps the decoder better local- knowledge. Furthermore, our CoTDet utilizes visual affor-
ize the objects, resulting in average improvements of 4.7% dance knowledge to condition the decoder in localizing and
mAPbox and 4.0% mAPmask . (6) Finally, the knowledge- recognizing suitable objects.
conditional denoising training improves APbox and APmask Limitations: While acknowledging the disparity between
by 1.6% and 1.9%, respectively. the COCO-Task dataset and real-world application scenar-
ios, attributed to its limited task variety and preference for
4.4. Visualization images and annotations, our approach has the potential to
Figure 3 visualizes qualitative results for several exam- extend beyond these confines. Notably, our knowledge
ples. For (a), no objects in the image should be selected to acquisition and utilization are flexible and generalizable,
“get lemon out of tea”. Our model can successfully return granting it the capacity to transcend specific dataset, spe-
the empty set, while TOIST detects the french fry that is cific tasks, object categories, or tools. We leave this to fu-
one of the salient objects in the image as the tool. Simi- ture works. Furthermore, with the incorporation of LLM,
larly, as knives are uncommon for “opening bottle of beer”, our approach inherits potential social biases from LLM,
the knife in (b) is challenging for TOIST to identify and lo- which could potentially be reflected in the preference for
cate. Guided by the visual affordance of “sharp blade with selecting frequently used tools.
a pointed end”, our model correctly localizes and selects Acknowledgment: This work was supported by the Na-
the sharp knife. The (c) and (d) demonstrate effectiveness tional Natural Science Foundation of China (No.62206174),
without MLCoT or knowledge-conditional denoising train- Shanghai Pujiang Program (No.21PJ1410900), Shanghai
ing (KDN). With visual affordance knowledge obtained by Frontiers Science Center of Human-centered Artificial In-
directly asking LLMs, our model relies solely on matching telligence (ShangHAI), MoE Key Laboratory of Intelligent
with the single knowledge unit, which incorrectly detects Perception and Human-Machine Collaboration (Shang-
the trunk in (c) and misses the knife in (d). The former haiTech University), and Shanghai Engineering Research
trunk is easily confused with objects that are “flat, broad Center of Intelligent Vision and Imaging.
References ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 4, 6
[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb-
[14] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-
otar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu,
mentation from natural language expressions. In Computer
Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i
Vision–ECCV 2016: 14th European Conference, Amster-
can, not as i say: Grounding language in robotic affordances.
dam, The Netherlands, October 11–14, 2016, Proceedings,
arXiv preprint arXiv:2204.01691, 2022. 1
Part I 14, pages 108–124. Springer, 2016. 2
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
[15] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
modulated detection for end-to-end multi-modal understand-
guage models are few-shot learners. Advances in neural in-
ing. In Proceedings of the IEEE/CVF International Confer-
formation processing systems, 33:1877–1901, 2020. 1, 3, 6
ence on Computer Vision, pages 1780–1790, 2021. 2
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [16] Keizo Kato, Yin Li, and Abhinav Gupta. Compositional
end object detection with transformers. In Computer Vision– learning for human object interaction. In Proceedings of the
ECCV 2020: 16th European Conference, Glasgow, UK, Au- European Conference on Computer Vision (ECCV), pages
gust 23–28, 2020, Proceedings, Part I 16, pages 213–229. 234–251, 2018. 3
Springer, 2020. 1, 2, 4, 5, 6 [17] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and
[4] Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Tamara Berg. Referitgame: Referring to objects in pho-
Krähenbühl. Learning by cheating. In Conference on Robot tographs of natural scenes. In Proceedings of the 2014 con-
Learning, pages 66–75. PMLR, 2020. 1 ference on empirical methods in natural language processing
(EMNLP), pages 787–798, 2014. 4
[5] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-
pixel classification is not all you need for semantic segmen- [18] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni,
tation. Advances in Neural Information Processing Systems, and Lei Zhang. Dn-detr: Accelerate detr training by intro-
34:17864–17875, 2021. 2, 5, 6 ducing query denoising. In Proceedings of the IEEE/CVF
[6] Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Conference on Computer Vision and Pattern Recognition,
Zhou, and Houqiang Li. Transvg: End-to-end visual ground- pages 13619–13627, 2022. 2, 5, 6
ing with transformers. In Proceedings of the IEEE/CVF [19] Pengfei Li, Beiwen Tian, Yongliang Shi, Xiaoxue Chen, Hao
International Conference on Computer Vision, pages 1769– Zhao, Guyue Zhou, and Ya-Qin Zhang. Toist: Task oriented
1779, 2021. 2, 4 instance segmentation transformer with noun-pronoun distil-
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina lation. arXiv preprint arXiv:2210.10775, 2022. 1, 2, 3, 6, 7,
Toutanova. Bert: Pre-training of deep bidirectional 8
transformers for language understanding. arXiv preprint [20] Liang Lin, Pengxiang Yan, Xiaoqian Xu, Sibei Yang, Kun
arXiv:1810.04805, 2018. 3 Zeng, and Guanbin Li. Structured attention network for re-
[8] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, ferring image segmentation. IEEE Transactions on Multime-
scope, limits, and consequences. Minds and Machines, dia, 24:1922–1932, 2021. 2
30:681–694, 2020. 3 [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[9] Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Ying Nian Wu, and Prem Natarajan. Transform-retrieve- Zitnick. Microsoft coco: Common objects in context. In
generate: Natural language-centric outside-knowledge vi- Computer Vision–ECCV 2014: 13th European Conference,
sual question answering. In Proceedings of the IEEE/CVF Zurich, Switzerland, September 6-12, 2014, Proceedings,
Conference on Computer Vision and Pattern Recognition, Part V 13, pages 740–755. Springer, 2014. 6
pages 5067–5077, 2022. 2, 3 [22] Hugo Liu and Push Singh. Conceptnet—a practical com-
[10] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- monsense reasoning tool-kit. BT technology journal,
national conference on computer vision, pages 1440–1448, 22(4):211–226, 2004. 3
2015. 2, 4 [23] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
[11] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
and Mingyang Ling. Scene graph generation with exter- moyer, and Veselin Stoyanov. Roberta: A robustly optimized
nal knowledge and image reconstruction. In Proceedings of bert pretraining approach. arXiv preprint arXiv:1907.11692,
the IEEE/CVF conference on computer vision and pattern 2019. 4, 5, 6
recognition, pages 1969–1978, 2019. 3 [24] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- regularization. arXiv preprint arXiv:1711.05101, 2017. 6
shick. Mask r-cnn. In Proceedings of the IEEE international [25] Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-
conference on computer vision, pages 2961–2969, 2017. 1, Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin
2 Kalyan. Dynamic prompt learning via policy gradient for
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. semi-structured mathematical reasoning. In International
Deep residual learning for image recognition. In Proceed- Conference on Learning Representations (ICLR), 2023. 3
[26] Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit [39] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir
Menon, Junfeng Yang, Xin Wang, and Carl Vondrick. Dou- Sadeghian, Ian Reid, and Silvio Savarese. Generalized in-
bly right object recognition: A why prompt for visual ratio- tersection over union: A metric and a loss for bounding
nales. arXiv preprint arXiv:2212.06202, 2022. 3 box regression. In Proceedings of the IEEE/CVF conference
[27] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana on computer vision and pattern recognition, pages 658–666,
Camburu, Alan L Yuille, and Kevin Murphy. Generation 2019. 6
and comprehension of unambiguous object descriptions. In [40] Adam Roberts, Colin Raffel, and Noam Shazeer. How much
Proceedings of the IEEE conference on computer vision and knowledge can you pack into the parameters of a language
pattern recognition, pages 11–20, 2016. 4 model? arXiv preprint arXiv:2002.08910, 2020. 1, 3
[28] Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, [41] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor
and Marcus Rohrbach. Krisp: Integrating implicit and sym- Darrell, and Bernt Schiele. Grounding of textual phrases in
bolic knowledge for open-domain knowledge-based vqa. In images by reconstruction. In Computer Vision–ECCV 2016:
Proceedings of the IEEE/CVF Conference on Computer Vi- 14th European Conference, Amsterdam, The Netherlands,
sion and Pattern Recognition, pages 14111–14121, 2021. 3 October 11–14, 2016, Proceedings, Part I 14, pages 817–
[29] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and 834. Springer, 2016. 2
Roozbeh Mottaghi. Ok-vqa: A visual question answering [42] Johann Sawatzky, Yaser Souri, Christian Grund, and Jurgen
benchmark requiring external knowledge. In Proceedings Gall. What object should i use?-task driven object detection.
of the IEEE/cvf conference on computer vision and pattern In Proceedings of the IEEE/CVF Conference on Computer
recognition, pages 3195–3204, 2019. 3 Vision and Pattern Recognition, pages 7605–7614, 2019. 1,
[30] Medhini Narasimhan, Svetlana Lazebnik, and Alexander 3, 6, 7
Schwing. Out of the box: Reasoning with graph convolu- [43] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Ha-
tion nets for factual visual question answering. Advances in genbuchner, and Gabriele Monfardini. The graph neural
neural information processing systems, 31, 2018. 2, 3 network model. IEEE transactions on neural networks,
[31] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L 20(1):61–80, 2008. 3
Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agar- [44] Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jian-
wal, Katarina Slama, Alex Ray, et al. Training language wei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan,
models to follow instructions with human feedback. arXiv Lijuan Wang, Lu Yuan, et al. K-lite: Learning transfer-
preprint arXiv:2203.02155, 2022. 1, 7 able visual models with external knowledge. arXiv preprint
[32] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton arXiv:2204.09222, 2022. 2, 3
Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian [45] Cheng Shi and Sibei Yang. Spatial and visual perspective-
Riedel. Language models as knowledge bases? arXiv taking via view rotation and relation reasoning for embodied
preprint arXiv:1909.01066, 2019. 1, 3 reference understanding. In European Conference on Com-
[33] Bryan A Plummer, Arun Mallya, Christopher M Cervantes, puter Vision, pages 201–218. Springer, 2022. 2
Julia Hockenmaier, and Svetlana Lazebnik. Phrase local- [46] Krishna Kumar Singh, Santosh Divvala, Ali Farhadi, and
ization and visual relationship detection with comprehensive Yong Jae Lee. Dock: Detecting objects by transferring
image-language cues. In Proceedings of the IEEE inter- common-sense knowledge. In Proceedings of the European
national conference on computer vision, pages 1928–1937, Conference on Computer Vision (ECCV), pages 492–508,
2017. 4 2018. 2, 3
[34] Sarah Pratt, Rosanne Liu, and Ali Farhadi. What does a [47] Jiajin Tang, Ge Zheng, Cheng Shi, and Sibei Yang. Con-
platypus look like? generating customized prompts for zero- trastive grouping with transformer for referring image seg-
shot image classification. arXiv preprint arXiv:2209.03320, mentation. In Proceedings of the IEEE/CVF Conference
2022. 3 on Computer Vision and Pattern Recognition, pages 23570–
[35] Mengshi Qi, Yunhong Wang, Jie Qin, and Annan Li. Ke- 23580, 2023. 2
gan: Knowledge embedded generative adversarial networks [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
for semi-supervised scene parsing. In Proceedings of the reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
IEEE/CVF Conference on Computer Vision and Pattern Polosukhin. Attention is all you need. Advances in neural
Recognition, pages 5237–5246, 2019. 3 information processing systems, 30, 2017. 5
[36] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, [49] Duc Minh Vo, Hong Chen, Akihiro Sugimoto, and Hideki
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Nakayama. Noc-rek: Novel object captioning with retrieved
Peter J Liu. Exploring the limits of transfer learning with vocabulary from external knowledge. In 2022 IEEE/CVF
a unified text-to-text transformer. The Journal of Machine Conference on Computer Vision and Pattern Recognition
Learning Research, 21(1):5485–5551, 2020. 3 (CVPR), pages 17979–17987. IEEE, 2022. 3
[37] Joseph Redmon and Ali Farhadi. Yolov3: An incremental [50] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and An-
improvement. arXiv preprint arXiv:1804.02767, 2018. 1, 2 ton Van Den Hengel. Fvqa: Fact-based visual question an-
[38] Allen Z Ren, Bharat Govil, Tsung-Yen Yang, Karthik R swering. IEEE transactions on pattern analysis and machine
Narasimhan, and Anirudha Majumdar. Leveraging language intelligence, 40(10):2413–2427, 2017. 3
for accelerated learning of tool manipulation. In Conference [51] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong
on Robot Learning, pages 1531–1541. PMLR, 2023. 1 Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-
driven referring image segmentation. In Proceedings of Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII
the IEEE/CVF conference on computer vision and pattern 16, pages 606–623. Springer, 2020. 3
recognition, pages 11686–11695, 2022. 2 [64] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao,
[52] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten George Karypis, and Alex Smola. Multimodal chain-of-
Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought thought reasoning in language models. arXiv preprint
prompting elicits reasoning in large language models. arXiv arXiv:2302.00923, 2023. 3
preprint arXiv:2201.11903, 2022. 1, 3, 6 [65] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
[53] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Wang, and Jifeng Dai. Deformable detr: Deformable trans-
Anton Van Den Hengel. Ask me anything: Free-form vi- formers for end-to-end object detection. arXiv preprint
sual question answering based on knowledge from external arXiv:2010.04159, 2020. 1, 2, 4, 5, 7
sources. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4622–4630, 2016. 2, 3
[54] Sibei Yang, Guanbin Li, and Yizhou Yu. Cross-modal re-
lationship inference for grounding referring expressions. In
Proceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 4145–4154, 2019. 2
[55] Sibei Yang, Guanbin Li, and Yizhou Yu. Dynamic graph at-
tention for referring expression comprehension. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, pages 4644–4653, 2019. 2
[56] Sibei Yang, Guanbin Li, and Yizhou Yu. Propagating over
phrase relations for one-stage visual grounding. In Computer
Vision–ECCV 2020: 16th European Conference, Glasgow,
UK, August 23–28, 2020, Proceedings, Part XIX 16, pages
589–605. Springer, 2020. 2
[57] Sibei Yang, Guanbin Li, and Yizhou Yu. Relationship-
embedded representation learning for grounding referring
expressions. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 43(8):2765–2779, 2020. 2
[58] Sibei Yang, Meng Xia, Guanbin Li, Hong-Yu Zhou, and
Yizhou Yu. Bottom-up shift and reasoning for referring im-
age segmentation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
11266–11275, 2021. 2
[59] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu-
mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study
of gpt-3 for few-shot knowledge-based vqa. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 36,
pages 3081–3089, 2022. 3
[60] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng-
shuang Zhao, and Philip HS Torr. Lavt: Language-aware
vision transformer for referring image segmentation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 18155–18165, 2022. 2
[61] Dongran Yu, Bo Yang, Qianhao Wei, Anchen Li, and Shirui
Pan. A probabilistic graphical model based on neural-
symbolic reasoning for visual relationship detection. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 10609–10618, 2022. 3
[62] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu,
Mohit Bansal, and Tamara L Berg. Mattnet: Modular at-
tention network for referring expression comprehension. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 1307–1315, 2018. 2
[63] Alireza Zareian, Svebor Karaman, and Shih-Fu Chang.
Bridging knowledge graphs to generate scene graphs. In
Computer Vision–ECCV 2020: 16th European Conference,