0% found this document useful (0 votes)
30 views11 pages

Cotdet: Affordance Knowledge Prompting For Task Driven Object Detection

The document presents CoTDet, a novel framework for task-driven object detection that leverages visual affordance knowledge and multi-level chain-of-thought prompting (MLCoT) from large language models (LLMs). It aims to enhance object detection by focusing on common attributes that enable various objects to perform specific tasks, rather than relying solely on predefined object categories. Experimental results demonstrate that CoTDet significantly outperforms existing methods in accuracy and provides rationales for object selection based on their affordances.

Uploaded by

Bhavna Buddhadev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views11 pages

Cotdet: Affordance Knowledge Prompting For Task Driven Object Detection

The document presents CoTDet, a novel framework for task-driven object detection that leverages visual affordance knowledge and multi-level chain-of-thought prompting (MLCoT) from large language models (LLMs). It aims to enhance object detection by focusing on common attributes that enable various objects to perform specific tasks, rather than relying solely on predefined object categories. Experimental results demonstrate that CoTDet significantly outperforms existing methods in accuracy and provides rationales for object selection based on their affordances.

Uploaded by

Bhavna Buddhadev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection

Jiajin Tang* Ge Zheng* Jingyi Yu Sibei Yang†


School of Information Science and Technology, ShanghaiTech University
{tangjj,zhengge,yujingyi,yangsb}@shanghaitech.edu.cn
https://toneyaya.github.io/cotdet
arXiv:2309.01093v1 [cs.CV] 3 Sep 2023

(a) cup, (b) GT (c) Ours, (d) fork (e) temperature


Abstract bottle, knife... GGNN, TOIST( ) probe

Task driven object detection aims to detect object in-


stances suitable for affording a task in an image. Its chal-
lenge lies in object categories available for the task being
Object Detection Task Driven Object Detection (Open Parcel)
too diverse to be limited to a closed set of object vocabu-
(f) Multi-Level Chain-of-Thought Prompting
lary for traditional object detection. Simply mapping cat-
Choose objects Which com- Why these Then visual
egories and visual features of common objects to the task to open parcel... mon objects? objects? features?
Visual Affordance
cannot address the challenge. In this paper, we propose 1. Knife
1. Knife Knowledge!
Image easily cut paper Image
a sharp blade
to explore fundamental affordances rather than object cat- 2. Pen
3. Scissors
easily cut plastic
provide a grip a handle
egories, i.e., common attributes that enable different ob- 4. Screwdriver 2. Pen
protect from heat a sharp tip
... ...
jects to accomplish the same task. Moreover, we propose
a novel multi-level chain-of-thought prompting (MLCoT) to Figure 1. An example of (a) traditional object detection, (b)-
extract the affordance knowledge from large language mod- (e) task driven object detection, and (f) our multi-level chain-of-
els, which contains multi-level reasoning steps from task thought (MLCoT) prompting large language models (LLMs) to
to object examples to essential visual attributes with ratio- generate visual affordance knowledge.
nales. Furthermore, to fully exploit knowledge to benefit ob-
ject recognition and localization, we propose a knowledge-
conditional detection framework, namely CoTDet. It con- most suitable for the task [19, 42]. However, the challenges
ditions the detector from the knowledge to generate object for task driven object detection are multi-fold. Previous
queries and regress boxes. Experimental results demon- methods that directly learn the mapping between objects
strate that our CoTDet outperforms state-of-the-art meth- and tasks from objects’ visual context features or categories
ods consistently and significantly (+15.6 box AP and +14.8 cannot achieve satisfactory results. As shown in Figure 1c,
mask AP) and can generate rationales for why objects are the context-based approach GGNN [42] wrongly chooses
detected to afford the task. vegetable stem for afford task of “opening parcels” be-
cause it learns that visually slender objects can open parcels.
1. Introduction Similarly, the category-based approach TOIST [19] consid-
ers that no object in the image can perform the “opening
The traditional object detection task [3, 12, 37, 65], as parcels” since it could not even recognize an instance of
shown in Figure 1a, aims to detect object instances of given the knife in the image. In contrast, people will intelligently
categories in an image, so objects of categories such as the and naturally choose to use the knife to open parcels in the
cup, bottle, cake, and knife are detected. In contrast, de- scene of Figure 1b.
tection tasks in real-world applications [1, 4], such as in-
Recently, Large Language Models (LLMs) like GPT-
telligent service robots, usually appear in the form of tasks
3 [2] and ChatGPT [31] have demonstrated impressive ca-
rather than object categories [38]. For example, when an
pabilities in encoding general world knowledge from a vast
intelligent robot is asked to complete the task of “opening
amount of text data [32, 40, 52]. A naive approach is to
parcels”, the robot needs to autonomously locate the tool
prompt LLMs directly to return “what objects should we
shown in Figure 1b, i.e., a knife. So the core of this type
use to open parcels” and leverage the returned object cate-
of task is to detect the object instances in the image that are
gories to simplify task driven object detection to the tradi-
* Both authors contributed equally. † Corresponding author. tional one. However, we observe that LLMs usually only
return a few categories of commonly used objects, such as ate knowledge-aware queries based on image features and
the knife, pen, paper cutter, and scissors shown in Figure 1f. visual affordance knowledge. In addition to generating
According to these categories, although the knife in Fig- queries, we use visual affordance knowledge to guide the
ure 1b can be identified as the target object, the detection bounding box regression explicitly, inspired by the denois-
system will miss other objects that also can be used to open ing training [18]. Unlike [18] introduces denoising for
parcels, such as the fork in Figure 1d and the temperature accelerating training, our knowledge-conditional denoising
probe next to the microwave oven in Figure 1e. In turn, we training aims to teach the decoder how to utilize visual
ask why people can easily lock the fork and temperature knowledge to regress the boxes for queries.
probe in Figure 1d and 1e as the target objects? We argue Finally, we propose the CoTDet network, which acquires
that the reason is that people are not restricted to using spe- visual affordance knowledge from LLMs via the proposed
cific categories of objects to accomplish a task but instead MLCoT and performs knowledge-conditional object detec-
select objects based on the commonsense knowledge that tion to effectively utilize the knowledge. Moreover, our
objects with “a handle and sharp blade” or “a handle and CoTDet can easily be extended to task driven instance seg-
sharp tip” can “open parcels”. mentation by employing a segmentation head [5, 15].
In this paper, we propose to explicitly acquire visual af- In summary, our main contributions are:
fordance knowledge of the task (i.e., common visual at- • We are the first to propose to explicitly acquire vi-
tributes that enable different objects to afford the task) sual affordance knowledge and utilize the knowledge
and utilize the knowledge to bridge the task and objects. to bridge the task and object instances.
Figure 1f shows two sets of visual affordance knowledge • We propose a novel multi-level CoT prompting (ML-
(marked inside the yellow box) for opening parcels. How- CoT) to make abstract affordance knowledge con-
ever, it is not trivial to acquire such task-specific visual af- creted, which leverages LLMs to generate and summa-
fordance knowledge. rize intermediate reasoning steps from object examples
Furthermore, we propose a novel multi-level chain-of- to essential visual attributes with rationales.
thought prompting (MLCoT) to elicit visual affordance rea- • We claim that visual affordance knowledge can benefit
soning from LLMs. At the first level (object level), we both object recognition and localization and propose a
prompt LLMs to return common objects by the above- knowledge-conditional detection framework to condi-
mentioned approach. Unlike before, which treats the re- tion the detector to generate object queries and guide
turned object categories as target categories, we instead box regression through denoising training.
treat this query progress as brainstorming to obtain repre- • Our CoTDet not only consistently outperforms
sentative object examples. At the second level (affordance state-of-the-art methods (+15.6 AP50box and +14.8
level), we generate rationales from LLMs for why object AP50mask ) by a large margin but also can generate ra-
examples can afford the task and cooperate rationales to tionales for why objects are detected to afford tasks.
facilitate LLMs to reason and summarize the visual affor-
dances beyond object examples. As shown in Figure 1f,
the rationale and visual affordances that enable the knife to 2. Related Work
open parcels are “easily cut through paper and plastic...” Task Driven Object Detection/Instance Segmentation
and “a sharp blade and handle”, respectively. Our MLCoT aims to detect or segment out the most suitable objects in
can capture the essence of visual affordances behind these an image to afford a given task, such as opening parcel
object examples without being limited to object categories. or getting lemon out of tea. Different from traditional ob-
Thus we can successfully detect the fork and temperature ject detection or instance segmentation [3, 5, 10, 12, 37, 65],
probe in Figure 1d and 1e as they meet the visual affor- it requires modeling the preference for selecting objects
dances required by the task. based on a comprehensive understanding of the specific
Moreover, we claim that visual affordance knowledge task and the image scene. Although referring expres-
not only helps recognize and identify objects suitable for sion grounding [6, 15, 41, 45, 54, 56, 57] and segmenta-
the task but also helps localize objects more precisely be- tion [14, 20, 47, 51, 55, 58, 60, 62] similarly locate the ref-
cause visual attributes such as color and shape are use- erent according to a natural language description, they rely
ful in object localization. Therefore, unlike some meth- on the object attributes and relations in the scene for local-
ods [9, 30, 44, 46, 53] to take knowledge as complemen- ization without considering the prior knowledge needed to
tary to the image’s visual features, we condition the detector afford the given task. These factors result in the challenge
on the visual affordance knowledge to perform knowledge- of task driven object detection which requires complex and
conditional object detection. Specifically, we follow [19] to joint knowledge reasoning on the requirements for one spe-
use an end-to-end query-based detection framework [3, 65]. cific task and objects’ functional attributes beyond visual
But instead of randomly initializing queries, we gener- recognition and scene understanding.
Visual Affordance Knowledge

Output

Text Encoder
Affordance
Affordance
Affordance ...
Affordance Values Attention Generation
Affordance Pooling (bottom right) Prediction Head
Only for
Knowledge Aggregating Keys Knowledge Reading
Training

Tranformer Decoder
Visual Affordance Rationales
Noise
pierce paper sharp tips metal Open Parcel
sharp tips
pierce plastic slender Twee GT box
sharp tips silvery z
... ers
... a handle ...
a handle provide a grip ...
... Knowledge-conditional
Candidates
Query Generation

Large Language Models LanguageModel


Language Model
LLM
...
Problem Statement Multi-Level Chain-of-Thought (MLCoT) Encoding
I am a highly intelligent que- List objects that For each object, Summary visual
stion answering bot and I an- can be used as let's think the ra- features of the ob- Input
swer questions from a human a vehicle for hu- tionales why the- ject for each ratio-
perspective... man to {TASK} y afford the task nale Open Parcel

Figure 2. Overall framework of the proposed CoTDet with multi-level chain-of-thought prompting (MLCoT). We first generate
visual affordance knowledge from LLMs with the proposed novel MLCoT. Next, we perform knowledge-conditional object detection by
utilizing the knowledge to generate object queries for the scene as well as guide object localization through denoising training.

The pioneering work [42] adopts a two-stage framework the language model to filter effective visual functional at-
that first detects objects and then compares among objects tributes to afford tasks.
to select suitable objects via graph neural networks [43]. External Knowledge Utilization. Incorporating knowl-
The following work TOIST [19] distills target object names edge reasoning has attracted growing interest in computer
to pronouns such as something by conditioning on the lan- vision [11, 26, 34, 35, 44, 63] and vision-language [49] fields
guage description of the task. However, these works lack such as object detection [44, 46], visual relationship detec-
to model explicit requirements of tasks and objects’ affor- tion [16, 61] and visual question answering [9, 28, 30, 53].
dances to tasks, which limits their performance and gener- For tasks that focus on capturing relations among ob-
alization capabilities. jects, such as scene graph generation and visual relation-
Knowledge Acquisition in Vision-Language Tasks. Inte- ship detection, extracting knowledge of the interactions be-
grating external knowledge into computer vision tasks [16, tween general concepts becomes natural [11, 16, 61, 63].
35,44,46] and vision-language tasks [29,49,50,63] has been Other tasks like object detection and image classification
found beneficial. Previous methods [16, 28, 30, 35, 53, 61] rely on category-related knowledge that is retrieved in the
are interested in acquiring structured knowledge (e.g., Con- knowledge base as definitions or attributes of general con-
ceptNet [22]), which usually includes commonsense con- cepts [9, 26, 34, 44, 46, 53]. For knowledge utilization, they
cepts and relations and is presented in fixed data struc- mainly directly take external knowledge as an expansion
tures such as the graph or triple. Recently, large language to visual content or explicitly constrain the consistency be-
models [2, 7, 36] have been demonstrated to learn open- tween knowledge and visual content. Different from exist-
world commonsense knowledge from the large-scale cor- ing works, we leverage attribute-level commonsense knowl-
pus [32, 40]. Some works [9, 59] utilize language models edge about requirements for completing tasks and take ex-
to encode the representations of inputs or directly generate ternal knowledge of tasks as the condition to condition the
answers conditioning on visual inputs, leveraging the latent detector for task driven object detection.
knowledge in language models. Unlike previous works, we
prompt language model GPT-3 [8] to obtain external knowl- 3. Method
edge explicitly with the chain of thought (CoT) [25, 52, 64]
for better interpretability. To the best of our knowledge, we The framework of our proposed CoTDet is shown in
are the first to explore CoT to acquire visual commonsense Figure 2. First, we introduce the problem definition and
knowledge in text form, leveraging the reasoning ability of image and text encoders in Section 3.1. Second, we ac-
quire visual affordance knowledge from LLMs by lever- why the examples can afford the task and summarizes cor-
aging the multi-level chain-of-thought prompting and ag- responding visual attributes for rationales.
gregation (Section 3.2). Next, we present the knowledge- Object-level Prompting as Brainstorming. At the first
conditional decoder that conditions acquired knowledge to level, we prompt LLMs to generate daily object examples
detect and segment suitable objects (Section 3.3). Finally, that afford the input task S. Specifically, we design the fol-
we introduce the loss functions in Section 3.4. lowing text prompt:

3.1. Problem Definition and Encoder Prompt: What common objects in daily life can be used as
a vehicle for the human to [task]? Please list the twenty
Problem Definition. Given a task in text form S (e.g., most suitable objects.
“open parcel”) and an image I, task driven object detec- Output: knife, pen, paper cutter, scissors, screwdriver, ...
tion require detecting a set of objects most preferred to af-
ford the task. Note that the target objects indicated by the Where [task] is filled with the task text S. We denote the
task are not fixed in quantity and category, which may vary number of objects returned from LLMs as No . For simplic-
with changes in the image scene. In contrast, traditional ob- ity, we present the most critical parts of the prompts. For
ject detection [3, 10, 65] detects objects of fixed categories complete prompts, please refer to Appendix. One straight-
while referring image grounding [6, 17, 27, 33] localizes un- forward idea is to perform traditional category-specific ob-
ambiguous objects. ject detection with respect to the categories of object exam-
ples. However, it is not feasible due to the following ob-
Encoders. For the task S, we leverage the linguistic infor-
servations: (1) object examples are overly limited to partial
mation from two perspectives. First, the text S is preserved
object categories, resulting in the gap between object cat-
as input for extracting knowledge from LLMs. Besides, we
egories and the actual task requirements. For instance, the
employ RoBERTa [23] as the text encoder to obtain the text
fork in Figure 2 is not among the objects returned by LLMs.
feature ts for the task S, which will be used to query the
(2) a few noisy unsuitable objects may be output. Although
task-relevant visual content in the image. For the image I,
the noisy objects are few, relying entirely on the object ex-
we adopt the ResNet-101 [13] as the visual backbone to ex-
amples is risky. For example, for the task of extinguishing
tract multi-scale visual feature maps and flatten the maps
fire, LLMs return the fire axe, a common firefighting tool,
along the spatial dimension to features V .
but it cannot be directly used to extinguish the fire.
3.2. Visual Affordance Knowledge from LLMs Affordance-level Prompting with Rationales. To address
the above challenges and capture the essential visual affor-
To detect objects to afford a particular task, we naturally dances implied in representative object examples, we pro-
consider the task’s requirements first and subsequently lo- pose to generate rationales for why these objects can afford
calize suitable objects that meet the requirements. Never- the task and summarize visual affordances from rationales.
theless, the task requirements are abstract and can not di- With the object examples, we prompt LLMs to generate ra-
rectly correspond to the visual content in the image for lo- tionales as follows:
calizing the objects. Motivated by this, we propose to ex-
plicitly extract common visual attributes (i.e., visual affor- Prompt: For each object, let’s think about the rationales
dances) that make different objects can afford the task and for why they afford the task from the perspective of visual
use visual affordances to bridge task requirements and ob- features.
ject instances in the image. Furthermore, we generate visual Output: Knife: They have a sharp blade which can easily
affordances from LLMs because LLMs generally contain cut through paper and plastic; They have a handle which
world knowledge learned from a vast amount of text data. provides a good grip for the user. Pen: ...
Specifically, we first design a novel multi-level chain- By the above prompting, we get a set of rationales for each
of-thought prompting to leverage LLMs to generate visual object. Next, we further prompt LLMs to summarize visual-
affordances (see Section 3.2.1) and then encode and ag- related rationales to form visual affordances as follows:
gregate them automatically to be utilized for detection (see
Prompt: Summary corresponding visual features of the ob-
Section 3.2.2).
ject for each rationale.
Output: {A sharp blade and a handle. }, {...}, ...
3.2.1 Multi-Level Chain-of-Thought Prompting
Finally, we obtain No sets of visual affordances, where each
Our multi-level chain-of-thought prompting (MLCoT) set contains several visual attributes relevant to why objects
leverages LLMs to generate and summarize intermediate can afford the task S. And we define each set of visual affor-
reasoning steps from object examples to essential visual at- dance knowledge as a knowledge unit. Note that although
tributes with rationales. MLCoT first brainstorms the object each knowledge unit is derived from the rationales of an
examples to afford the task and then considers rationales object example, the affordance knowledge in that unit is not
limited to that object example or its related categories. For tent of the image, the task, and the visual affordance knowl-
example, the first visual affordance unit comprises “a sharp edge. Specifically, we utilize visual affordance knowledge
blade and a handle”, which correspond to the returned ob- to select visual features and combine them with the knowl-
ject “knife”. Notably, this visual affordance unit also ap- edge to generate queries, and then the spatial information in
plies to “box cutter” and “paper cutter”. these visual features naturally becomes the spatial priors of
reference points.
3.2.2 Knowledge Encoding and Aggregating Given visual features F = {f1 , f2 , ..., fi }, we first fuse
each feature fi with the task’s text feature ts , and then cal-
We further extract a refined knowledge base by filtering culate its relevance to the task’s visual affordance knowl-
out a few knowledge units corresponding to the unsuit- edge K = {(pkj , pvj )}N j=1 . Since each knowledge unit
able object examples mentioned in Section 3.2.1. For each (pkj , pvj ) in the knowledge base K is a set of affordances
knowledge unit, we concatenate textual descriptions of its that meet the task requirements, we use the fused feature’s
visual affordances into a textual sequence and then utilize largest similarity to each knowledge unit (pkj , pvj ) in the
RoBERTa [23] to obtain the sentence feature. To filter out knowledge base K as the feature’s relevance score. The cal-
unsuitable units, we compute the cosine similarity between culation is formulated as follows,
each pair of knowledge units and exclude outlier units if
their maximum similarity to other units falls below a pre- \label {eq:read} \begin {aligned} &s_{i,j} = \mathrm {cos}(\mathrm {fc}(f_i) + \mathrm {fc}(t_s), p_j^k), \\ &r_i = \mathrm {max}_j(s_{i,j}), d_i = \mathrm {argmax}_j(s_{i,j}), \end {aligned}
determined threshold. Additionally, for each selected unit, (2)
we extract its word representations via RoBERTa. In sum-
mary, we aggregate N visual affordance knowledge units, where cos(·, ·) computes the cosine similarity, fc(·) repre-
denoted as K = {pkj , pvj }N k v
j=1 , where pj and pj are the sen- sents the fully connected layer, and si,j is the similarity be-
tence feature and word features of j-th unit, respectively. tween i-th visual feature and j-th knowledge unit. Then,
ri and di mean the i-th visual feature’s relevance score and
3.3. Knowledge-conditioned Decoder index of the corresponding knowledge unit, respectively.
We base on the detection architecture of Deformable- Next, we select the visual features with the top-k largest
DETR [65], a DETR-like detector [3, 5, 18], which uses relevance scores {ri } to incorporate their corresponding
object queries to capture object-level information for detec- knowledge {(pkdi , pvdi )} to generate queries Qkn as follows,
tion (Section 3.3.1). Unlike randomly initializing the object
queries, we leverage visual affordance knowledge to gener- \label {eq:query} Q^{\text {kn}} = \mathrm {topk}_{r_i}\{f_i + \mathrm {AttentionPool}(f_i, p^v_{d_i})\}, (3)
ate the object queries (Section 3.3.2) and guide the bound-
where topkri means to select the corresponding features
ing box regression with denoising training (Section 3.3.3).
with the top-k largest relevance scores ri . The atten-
tion pooling layer [48] AttentionPool(fi , pvdi ) returns the
3.3.1 Introduction to Deformable-DETR weighted features on pvdi based on their similarities to the
Deformable-DETR contains a Transformer encoder and a fi . Note that, for each knowledge unit (pkj , pvj ), we use its
Transformer decoder. The encoder inputs visual features V global sentence feature pkj to compute its overall similarity
and outputs the refined visual features F = {f1 , f2 , ..., fi } to each visual feature in Eq. 2 while adopting word-level
via multi-scale deformable attention. The decoder ran- features pvj to better enhance the query’s fine-grained rep-
domly initializes queries Q = {q1 , q2 , ..., qk } and predicts resentations in Eq. 3. Similar to Deformable-DETR, we
a reference point pk for each object query qk , and these ref- further predict the reference points Lkn from the queries
erence points L = {l1 , l2 , ..., lk } serve as the initial guess Qkn . In addition, to facilitate the learning of Top-k selec-
of the box centres. Next, the decoder searches for objects O tion, the selected queries Qkn are directly fed into the pre-
for these queries Q with reference points L via multi-scale diction heads and supervised during training using the same
deformable cross-attention and self-attention, which is for- training loss in Section 3.4.
mulated as follows,
3.3.3 Knowledge-conditional Decoding
O = \mathrm {Deformable}([Q, L], F), (1)
With queries Qkn , reference points Lkn , and the refined vi-
where Deformable(·, ·) denotes the Transformer decoder of sual features F , we apply the Deformable decoder to search
Deformable-DETR. objects Okn as follows,

3.3.2 Knowledge-conditional Query Generation \label {eq:normal} O^{\text {kn}}=\mathrm {Deformable}([Q^{\text {kn}},L^{\text {kn}}], F). (4)

Instead of randomly initializing the queries, we generate the In addition to utilizing visual affordance knowledge for
queries and their reference points based on the visual con- query generation and providing the decoder with prior
task1: step on task2: sit comfortably task3: place flowers task4: get potatoes out of fire task5: water plant
task6: get lemon out of tea task7: dig hole task8: open bottle of beer task9: open parcel task10: serve wine
task11: pour sugar task12: smear butter task13: extinguish fire task14: pound carpet

Task(APbox @0.5)
Method Mean
task1 task2 task3 task4 task5 task6 task7 task8 task9 task10 task11 task12 task13 task14
GGNN [42] 36.6 29.8 40.5 37.6 41.0 17.2 43.6 17.9 21.0 40.6 22.3 28.4 39.1 40.7 32.6
TOIST [19] 44.0 39.5 46.7 43.1 53.6 23.5 52.8 21.3 23.0 46.3 33.1 41.7 48.1 52.9 41.3
TOIST† [19] 45.8 40.0 49.4 49.6 53.4 26.9 58.3 22.6 32.5 50.0 35.5 43.7 52.8 56.2 44.1
Ours 58.9 55.0 51.2 68.5 60.5 47.7 76.9 40.7 47.4 66.5 41.9 48.3 61.7 71.4 56.9

Table 1. Comparison with state-of-the-art models for task driven object detection on COCO-Tasks dataset. means the model is
with noun-pronoun distillation.

knowledge, we further improve the knowledge utilization where λcl and λbox are the hyperparameters of the weighted
by designing a knowledge-based denoising training [18]. loss. Our method can be easily extended to instance seg-
As the visual affordance knowledge indicates the target mentation by adding a segmentation head [5] and replacing
objects’ visual attributes, such as shape and size, the the box regression loss with the Dice loss Lmask .
knowledge-base denoising guides the decoder in learning
how to use this kind of visual knowledge to regress the tar- 4. Experiment
gets’ boxes.
4.1. Dataset and Implementation Details
Specifically, during the training stage, we first randomly
add noise to ground-truth boxes Ogt = {ogt M
m }m=1 to con- Dataset. We conduct experiments on the COCO-Tasks
struct the noised objects following DN-DETR [18] and then dataset [42], which comprises 14 different tasks (see Ta-
extract noised boxes’ visual features and centers as the ble 1). This dataset is derived from the COCO dataset [21],
noise M
noised queries F noise = {fm }m=1 . Notice that the previ- but with customized annotations for task driven object de-
ous denoising training method [18] adds noise to both boxes tection. Each task contains 3600 training and 900 testing
and categories labels to capture label-box relations better. images. Besides, we follow [19] to incorporate mask anno-
But we only add noise to boxes because we aim to utilize the tations to the original COCO-Tasks dataset for the instance
knowledge without noise to help denoise boxes. Therefore, segmentation benchmark.
we extract the knowledge unit (pkdm , pvdm ) for each ground- Implementation Details. Following previous works [19,
truth box ogt m through Eq. 2. Finally, the knowledge units
42], we use ResNet-101 [13] as the image encoder and
{(pkdm , pvdm )}M RoBERTa [23] as the text encoder. The model is pre-trained
m=1 guide the decoder to regress the ground-
truth boxes Ogt from the noised queries F noise , which is for- on the COCO dataset but images already part of COCO-
mulated as follows, Tasks are removed. We train the model for 4000 iterations
with the initial learning rate 1e-4 and use AdamW [24] as
\begin {aligned} \label {eq:kn} P^{\text {kn}} &= \{\mathrm {AttentionPool}(f_m^{\text {noise}}, p^v_{d_m})\}_{m=1}^M \\ O^{\text {denoise}}&=\mathrm {Deformable}([F^{\text {noise}} + P^{\text {kn}}, L^\text {noise}], F,) \\ \end {aligned} the optimizer. The hyperparameters λcl and λbox are 4 and
(5) 5. Following [19], we evaluate the segmentation and de-
tection performance of each task using APmask @0.5 and
APbox @0.5, respectively. And we denote their means across
where P kn is the visual affordance knowledge of noised all tasks as mAPmask and mAPbox . Unless otherwise speci-
queries, and the Deformable(·, ·) in Eq. 4 and Eq. 5 shares fied, we leverage the GPT-3 [2] to extract visual affordance
the same parameters. And the denoising is only considered knowledge due to its capability to generate rationales [52].
in the training stage.
4.2. Comparison with State-of-the-Art Methods
3.4. Loss Functions
Table 1 and Table 2 show the comparison of our CoT-
Following DETR [3], we use bipartite matching to find Det with state-of-the-art models (SOTAs) on detection and
the unique predictions for the ground-truth objects and segmentation benchmarks. Our model consistently outper-
adopt the same bounding box regression loss Lbox consist- forms the SOTAs [19, 42] on all benchmarks and tasks.
ing of L1 loss and GIoU [39] loss. Moreover, we use the Comparison with SOTAs. Compared to TOIST [19],
binary cross entropy loss as the classification loss Lcl . The our CoTDet achieves significant performance improve-
overall loss is represented as: ment (15.6% mAPbox and 14.8% mAPmask ), which demon-
strates the effectiveness of our task-relevant knowledge
\mathcal {L}_{cost} = \lambda _{cl}\mathcal {L}_{cl} +\lambda _{box}\mathcal {L}_{box}, (6) acquisition and utilization. Compared to the two-stage
Task(APmask @0.5)
Method Mean
task1 task2 task3 task4 task5 task6 task7 task8 task9 task10 task11 task12 task13 task14
GGNN [42] 31.8 28.6 45.4 33.7 46.8 16.6 37.8 15.1 15.0 49.9 24.9 18.9 49.8 39.7 32.4
TOIST [19] 37.0 34.4 44.7 34.2 51.3 18.6 40.5 17.1 23.4 43.8 29.3 39.9 46.6 42.4 35.2
TOIST† [19] 40.8 36.5 48.9 37.8 43.4 22.1 44.4 20.3 26.9 48.1 31.8 34.8 51.5 46.3 38.8
Ours 55.0 51.6 51.2 57.7 60.1 43.1 65.9 40.4 45.4 64.8 40.4 48.7 61.7 64.4 53.6

Table 2. Comparison with state-of-the-art models for task driven instance segmentation on COCO-Tasks dataset. † means the model
is with noun-pronoun distillation.

Ablation task2 task6 task9 mAPbox mAPmask tiveness of the proposed knowledge acquisition and utiliza-
tion. The results are shown in Table 3. In addition to mAP,
“objects” 25.4 16.5 21.0 31.9 31.3
“visual” 50.4 30.3 38.3 48.1 44.7 we report APbox @0.5 on relatively easy task2 (sit comfort-
w/o rationales 52.0 40.7 41.2 52.4 49.0 ably) and challenging task6 (get lemon out of tea) and task9
MLCoT 55.0 47.7 47.5 56.9 53.6 (open parcel) for reference, and the full results and analysis
MLCoT(ChatGPT) 50.6 48.1 50.3 57.0 54.0 on sub-tasks are provided in Appendix.
Def+GGNN [42] 38.6 24.7 23.4 38.8 35.8 MLCoT Prompting for Knowledge Acquisition. To eval-
Def+TOIST [19] 43.4 21.0 29.0 40.3 37.6 uate the impact of core designs in MLCoT, we replace
the MLCoT pipeline with the following approaches and
Init w/ MLCoT 42.2 35.9 35.6 48.7 46.4
Fuse w/ MLCoT 44.0 42.3 41.2 50.6 47.7
utilize the acquired knowledge as the condition to guide
Select w/ MLCoT 50.0 47.2 43.7 55.3 51.7 detection: (1) We encode the object categories returned
Full Decoder 55.0 47.7 47.5 56.9 53.6 by the object level of MLCoT as the knowledge to per-
form the knowledge-conditional object detection. The re-
Table 3. Ablation study about knowledge acquisition, detection sults (31.9% mAPbox and 31.3% mAPmask ) demonstrate that
framework, and knowledge utilization of our CoTDet. simply extracting the object categories from LLMs cannot
achieve satisfactory performance. (2) We attempt to ac-
method GGNN [42], we achieve 24.3% mAPbox and 21.2% quire affordance-level rather than object-level knowledge.
mAPmask performance gain, which demonstrates the impor- Specifically, we prompt LLMs by asking “what visual fea-
tance of leveraging the visual affordance knowledge rather tures can we use to determine the suitability of an object
than purely visual context information. for {TASK}?” to generate visual affordance knowledge di-
Comparison on Sub-tasks. The following comparisons rectly. The attempt improves the above object-level model
on sub-tasks further demonstrate that the affordance-level by 16.2% mAPbox and 13.4% mAPmask , showing the neces-
knowledge is capable of bridging tasks and objects. Our sity of exploring the essential visual affordances behind the
CoTDet significantly improves the detection and segmenta- object categories. However, this model still underperforms
tion performance on task4 (get potatoes out of fire), task6 by approximately 9% mAP compared to our full model. It is
(get lemon out of tea) and task7 (dig hole), achieving ap- difficult to summarize a unified description of widely vary-
proximately 20% mAP improvement on both benchmarks. ing objects without priors, resulting in only one set of visual
These tasks face the common challenge of the wide vari- attributes being returned from LLMs. (3) To increase the
ety of targets’ categories and visual appearances, which is diversity of visual affordances, we prompt LLMs to gen-
hardly dealt with by methods like [19, 42] that merely learn erate visual features for each object retrieved, which leads
the mapping between tasks and objects’ categories and vi- to a significant improvement to 52.4% mAPbox and 49.0%
sual features. In contrast, our method explicitly acquires the mAPmask . (4) Finally, we further add rationales to filter out
visual affordance knowledge of tasks to detect rare objects the misleading and irrelevant attributes, achieving a 4.5%
and avoid overfitting to common objects, outperforming and 4.6% increase in mAPbox and mAPmask , respectively.
significantly in these tasks. In addition, for those less chal- (5) We also evaluate the effect of using different LLMs
lenging tasks with a few ground-truth object categories, we to extract visual affordance knowledge. Our MLCoT with
still achieve approximately 8% mAP improvement, demon- ChatGPT [31] has a similar mAP to MLCoT with GPT-3.
strating the effectiveness of conditioning on visual affor- Knowledge-conditional Object Detection. To validate the
dances to object localization. effectiveness of our proposed knowledge-conditioned de-
coder, we conduct ablation studies with two baselines and
4.3. Ablation Study
three variants based on Deformable-DETR [65] framework:
We evaluate seven variants of our CoTDet and two SO- (1) We develop GGNN [42] on the Deformable-DETR de-
TAs with the same backbones as ours to validate the effec- tection framework. Def+GGNN simply learns the rela-
GT TOIST Ours GT “visual” w/o KDN Ours

(a) “get lemon out of tea” GT(∅),Ours(∅) (c) “pound carpet”

(b) “open bottle of beer” TOIST(∅), (d) “get lemon out of tea”
Rationale for (b): sharp blade with a pointed end to insert Rationale for (d): handle of the spatula is long enough to reach the bottom of
into the bottle cap. the cup; flat edge is designed to scoop up the lemon pieces.
Figure 3. Visualization for prediction results of our CoTDet, its variants, and the existing best-performing TOIST [19].

tions between objects and identifies objects based on their with a handle”, while the latter knife is ignored because its
contexts, limiting its performance. (2) Besides, similar visual attributes of straight mismatch the single knowledge
to TOIST [19], we initialize queries with the task’s tex- unit that includes “curved or angled”. Furthermore, with-
tual feature based on our framework. The performance gap out KDN, our detector lacks explicit guidance, leading to
(16.6% mAPbox and 14.1% mAPmask ) between Def+TOIST inaccurate detection in challenging scenes. Specifically, the
and our final model. (3) We introduce the visual affordance glove in (c) and the knife in (d) are not detected success-
knowledge extracted by MLCoT but simply use it to initial- fully, and the packing line in (d) is mistakenly detected.
ize the queries of the decoder (Init w/MLCoT). The model
achieves significant performance gain compared to the two 5. Conclusion
baselines. (4) We further fuse knowledge with the im-
age’s visual feature map to construct a multi-modal feature In this paper, we focus on challenging task driven object
map (Fuse w/MLCoT), which jointly understands the two detection, which is practical in the real world yet under-
modalities and improves performance (1.9% mAPbox and explored. To bridge the gap between abstract task require-
1.3% mAPmask ) compared to the last model. (5) Our pro- ments and objects in the image, we propose to explicitly
posed knowledge-conditional query generation, generating extract visual affordance knowledge for the task and de-
based on the visual content of the image, the task, and the tect objects having consistent visual attributes to the visual
visual affordance knowledge, helps the decoder better local- knowledge. Furthermore, our CoTDet utilizes visual affor-
ize the objects, resulting in average improvements of 4.7% dance knowledge to condition the decoder in localizing and
mAPbox and 4.0% mAPmask . (6) Finally, the knowledge- recognizing suitable objects.
conditional denoising training improves APbox and APmask Limitations: While acknowledging the disparity between
by 1.6% and 1.9%, respectively. the COCO-Task dataset and real-world application scenar-
ios, attributed to its limited task variety and preference for
4.4. Visualization images and annotations, our approach has the potential to
Figure 3 visualizes qualitative results for several exam- extend beyond these confines. Notably, our knowledge
ples. For (a), no objects in the image should be selected to acquisition and utilization are flexible and generalizable,
“get lemon out of tea”. Our model can successfully return granting it the capacity to transcend specific dataset, spe-
the empty set, while TOIST detects the french fry that is cific tasks, object categories, or tools. We leave this to fu-
one of the salient objects in the image as the tool. Simi- ture works. Furthermore, with the incorporation of LLM,
larly, as knives are uncommon for “opening bottle of beer”, our approach inherits potential social biases from LLM,
the knife in (b) is challenging for TOIST to identify and lo- which could potentially be reflected in the preference for
cate. Guided by the visual affordance of “sharp blade with selecting frequently used tools.
a pointed end”, our model correctly localizes and selects Acknowledgment: This work was supported by the Na-
the sharp knife. The (c) and (d) demonstrate effectiveness tional Natural Science Foundation of China (No.62206174),
without MLCoT or knowledge-conditional denoising train- Shanghai Pujiang Program (No.21PJ1410900), Shanghai
ing (KDN). With visual affordance knowledge obtained by Frontiers Science Center of Human-centered Artificial In-
directly asking LLMs, our model relies solely on matching telligence (ShangHAI), MoE Key Laboratory of Intelligent
with the single knowledge unit, which incorrectly detects Perception and Human-Machine Collaboration (Shang-
the trunk in (c) and misses the knife in (d). The former haiTech University), and Shanghai Engineering Research
trunk is easily confused with objects that are “flat, broad Center of Intelligent Vision and Imaging.
References ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 4, 6
[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb-
[14] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-
otar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu,
mentation from natural language expressions. In Computer
Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i
Vision–ECCV 2016: 14th European Conference, Amster-
can, not as i say: Grounding language in robotic affordances.
dam, The Netherlands, October 11–14, 2016, Proceedings,
arXiv preprint arXiv:2204.01691, 2022. 1
Part I 14, pages 108–124. Springer, 2016. 2
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
[15] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
modulated detection for end-to-end multi-modal understand-
guage models are few-shot learners. Advances in neural in-
ing. In Proceedings of the IEEE/CVF International Confer-
formation processing systems, 33:1877–1901, 2020. 1, 3, 6
ence on Computer Vision, pages 1780–1790, 2021. 2
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [16] Keizo Kato, Yin Li, and Abhinav Gupta. Compositional
end object detection with transformers. In Computer Vision– learning for human object interaction. In Proceedings of the
ECCV 2020: 16th European Conference, Glasgow, UK, Au- European Conference on Computer Vision (ECCV), pages
gust 23–28, 2020, Proceedings, Part I 16, pages 213–229. 234–251, 2018. 3
Springer, 2020. 1, 2, 4, 5, 6 [17] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and
[4] Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Tamara Berg. Referitgame: Referring to objects in pho-
Krähenbühl. Learning by cheating. In Conference on Robot tographs of natural scenes. In Proceedings of the 2014 con-
Learning, pages 66–75. PMLR, 2020. 1 ference on empirical methods in natural language processing
(EMNLP), pages 787–798, 2014. 4
[5] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-
pixel classification is not all you need for semantic segmen- [18] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni,
tation. Advances in Neural Information Processing Systems, and Lei Zhang. Dn-detr: Accelerate detr training by intro-
34:17864–17875, 2021. 2, 5, 6 ducing query denoising. In Proceedings of the IEEE/CVF
[6] Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Conference on Computer Vision and Pattern Recognition,
Zhou, and Houqiang Li. Transvg: End-to-end visual ground- pages 13619–13627, 2022. 2, 5, 6
ing with transformers. In Proceedings of the IEEE/CVF [19] Pengfei Li, Beiwen Tian, Yongliang Shi, Xiaoxue Chen, Hao
International Conference on Computer Vision, pages 1769– Zhao, Guyue Zhou, and Ya-Qin Zhang. Toist: Task oriented
1779, 2021. 2, 4 instance segmentation transformer with noun-pronoun distil-
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina lation. arXiv preprint arXiv:2210.10775, 2022. 1, 2, 3, 6, 7,
Toutanova. Bert: Pre-training of deep bidirectional 8
transformers for language understanding. arXiv preprint [20] Liang Lin, Pengxiang Yan, Xiaoqian Xu, Sibei Yang, Kun
arXiv:1810.04805, 2018. 3 Zeng, and Guanbin Li. Structured attention network for re-
[8] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, ferring image segmentation. IEEE Transactions on Multime-
scope, limits, and consequences. Minds and Machines, dia, 24:1922–1932, 2021. 2
30:681–694, 2020. 3 [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[9] Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Ying Nian Wu, and Prem Natarajan. Transform-retrieve- Zitnick. Microsoft coco: Common objects in context. In
generate: Natural language-centric outside-knowledge vi- Computer Vision–ECCV 2014: 13th European Conference,
sual question answering. In Proceedings of the IEEE/CVF Zurich, Switzerland, September 6-12, 2014, Proceedings,
Conference on Computer Vision and Pattern Recognition, Part V 13, pages 740–755. Springer, 2014. 6
pages 5067–5077, 2022. 2, 3 [22] Hugo Liu and Push Singh. Conceptnet—a practical com-
[10] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- monsense reasoning tool-kit. BT technology journal,
national conference on computer vision, pages 1440–1448, 22(4):211–226, 2004. 3
2015. 2, 4 [23] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
[11] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
and Mingyang Ling. Scene graph generation with exter- moyer, and Veselin Stoyanov. Roberta: A robustly optimized
nal knowledge and image reconstruction. In Proceedings of bert pretraining approach. arXiv preprint arXiv:1907.11692,
the IEEE/CVF conference on computer vision and pattern 2019. 4, 5, 6
recognition, pages 1969–1978, 2019. 3 [24] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- regularization. arXiv preprint arXiv:1711.05101, 2017. 6
shick. Mask r-cnn. In Proceedings of the IEEE international [25] Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-
conference on computer vision, pages 2961–2969, 2017. 1, Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin
2 Kalyan. Dynamic prompt learning via policy gradient for
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. semi-structured mathematical reasoning. In International
Deep residual learning for image recognition. In Proceed- Conference on Learning Representations (ICLR), 2023. 3
[26] Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit [39] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir
Menon, Junfeng Yang, Xin Wang, and Carl Vondrick. Dou- Sadeghian, Ian Reid, and Silvio Savarese. Generalized in-
bly right object recognition: A why prompt for visual ratio- tersection over union: A metric and a loss for bounding
nales. arXiv preprint arXiv:2212.06202, 2022. 3 box regression. In Proceedings of the IEEE/CVF conference
[27] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana on computer vision and pattern recognition, pages 658–666,
Camburu, Alan L Yuille, and Kevin Murphy. Generation 2019. 6
and comprehension of unambiguous object descriptions. In [40] Adam Roberts, Colin Raffel, and Noam Shazeer. How much
Proceedings of the IEEE conference on computer vision and knowledge can you pack into the parameters of a language
pattern recognition, pages 11–20, 2016. 4 model? arXiv preprint arXiv:2002.08910, 2020. 1, 3
[28] Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, [41] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor
and Marcus Rohrbach. Krisp: Integrating implicit and sym- Darrell, and Bernt Schiele. Grounding of textual phrases in
bolic knowledge for open-domain knowledge-based vqa. In images by reconstruction. In Computer Vision–ECCV 2016:
Proceedings of the IEEE/CVF Conference on Computer Vi- 14th European Conference, Amsterdam, The Netherlands,
sion and Pattern Recognition, pages 14111–14121, 2021. 3 October 11–14, 2016, Proceedings, Part I 14, pages 817–
[29] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and 834. Springer, 2016. 2
Roozbeh Mottaghi. Ok-vqa: A visual question answering [42] Johann Sawatzky, Yaser Souri, Christian Grund, and Jurgen
benchmark requiring external knowledge. In Proceedings Gall. What object should i use?-task driven object detection.
of the IEEE/cvf conference on computer vision and pattern In Proceedings of the IEEE/CVF Conference on Computer
recognition, pages 3195–3204, 2019. 3 Vision and Pattern Recognition, pages 7605–7614, 2019. 1,
[30] Medhini Narasimhan, Svetlana Lazebnik, and Alexander 3, 6, 7
Schwing. Out of the box: Reasoning with graph convolu- [43] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Ha-
tion nets for factual visual question answering. Advances in genbuchner, and Gabriele Monfardini. The graph neural
neural information processing systems, 31, 2018. 2, 3 network model. IEEE transactions on neural networks,
[31] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L 20(1):61–80, 2008. 3
Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agar- [44] Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jian-
wal, Katarina Slama, Alex Ray, et al. Training language wei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan,
models to follow instructions with human feedback. arXiv Lijuan Wang, Lu Yuan, et al. K-lite: Learning transfer-
preprint arXiv:2203.02155, 2022. 1, 7 able visual models with external knowledge. arXiv preprint
[32] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton arXiv:2204.09222, 2022. 2, 3
Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian [45] Cheng Shi and Sibei Yang. Spatial and visual perspective-
Riedel. Language models as knowledge bases? arXiv taking via view rotation and relation reasoning for embodied
preprint arXiv:1909.01066, 2019. 1, 3 reference understanding. In European Conference on Com-
[33] Bryan A Plummer, Arun Mallya, Christopher M Cervantes, puter Vision, pages 201–218. Springer, 2022. 2
Julia Hockenmaier, and Svetlana Lazebnik. Phrase local- [46] Krishna Kumar Singh, Santosh Divvala, Ali Farhadi, and
ization and visual relationship detection with comprehensive Yong Jae Lee. Dock: Detecting objects by transferring
image-language cues. In Proceedings of the IEEE inter- common-sense knowledge. In Proceedings of the European
national conference on computer vision, pages 1928–1937, Conference on Computer Vision (ECCV), pages 492–508,
2017. 4 2018. 2, 3
[34] Sarah Pratt, Rosanne Liu, and Ali Farhadi. What does a [47] Jiajin Tang, Ge Zheng, Cheng Shi, and Sibei Yang. Con-
platypus look like? generating customized prompts for zero- trastive grouping with transformer for referring image seg-
shot image classification. arXiv preprint arXiv:2209.03320, mentation. In Proceedings of the IEEE/CVF Conference
2022. 3 on Computer Vision and Pattern Recognition, pages 23570–
[35] Mengshi Qi, Yunhong Wang, Jie Qin, and Annan Li. Ke- 23580, 2023. 2
gan: Knowledge embedded generative adversarial networks [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
for semi-supervised scene parsing. In Proceedings of the reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
IEEE/CVF Conference on Computer Vision and Pattern Polosukhin. Attention is all you need. Advances in neural
Recognition, pages 5237–5246, 2019. 3 information processing systems, 30, 2017. 5
[36] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, [49] Duc Minh Vo, Hong Chen, Akihiro Sugimoto, and Hideki
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Nakayama. Noc-rek: Novel object captioning with retrieved
Peter J Liu. Exploring the limits of transfer learning with vocabulary from external knowledge. In 2022 IEEE/CVF
a unified text-to-text transformer. The Journal of Machine Conference on Computer Vision and Pattern Recognition
Learning Research, 21(1):5485–5551, 2020. 3 (CVPR), pages 17979–17987. IEEE, 2022. 3
[37] Joseph Redmon and Ali Farhadi. Yolov3: An incremental [50] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and An-
improvement. arXiv preprint arXiv:1804.02767, 2018. 1, 2 ton Van Den Hengel. Fvqa: Fact-based visual question an-
[38] Allen Z Ren, Bharat Govil, Tsung-Yen Yang, Karthik R swering. IEEE transactions on pattern analysis and machine
Narasimhan, and Anirudha Majumdar. Leveraging language intelligence, 40(10):2413–2427, 2017. 3
for accelerated learning of tool manipulation. In Conference [51] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong
on Robot Learning, pages 1531–1541. PMLR, 2023. 1 Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-
driven referring image segmentation. In Proceedings of Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII
the IEEE/CVF conference on computer vision and pattern 16, pages 606–623. Springer, 2020. 3
recognition, pages 11686–11695, 2022. 2 [64] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao,
[52] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten George Karypis, and Alex Smola. Multimodal chain-of-
Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought thought reasoning in language models. arXiv preprint
prompting elicits reasoning in large language models. arXiv arXiv:2302.00923, 2023. 3
preprint arXiv:2201.11903, 2022. 1, 3, 6 [65] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
[53] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Wang, and Jifeng Dai. Deformable detr: Deformable trans-
Anton Van Den Hengel. Ask me anything: Free-form vi- formers for end-to-end object detection. arXiv preprint
sual question answering based on knowledge from external arXiv:2010.04159, 2020. 1, 2, 4, 5, 7
sources. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4622–4630, 2016. 2, 3
[54] Sibei Yang, Guanbin Li, and Yizhou Yu. Cross-modal re-
lationship inference for grounding referring expressions. In
Proceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 4145–4154, 2019. 2
[55] Sibei Yang, Guanbin Li, and Yizhou Yu. Dynamic graph at-
tention for referring expression comprehension. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, pages 4644–4653, 2019. 2
[56] Sibei Yang, Guanbin Li, and Yizhou Yu. Propagating over
phrase relations for one-stage visual grounding. In Computer
Vision–ECCV 2020: 16th European Conference, Glasgow,
UK, August 23–28, 2020, Proceedings, Part XIX 16, pages
589–605. Springer, 2020. 2
[57] Sibei Yang, Guanbin Li, and Yizhou Yu. Relationship-
embedded representation learning for grounding referring
expressions. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 43(8):2765–2779, 2020. 2
[58] Sibei Yang, Meng Xia, Guanbin Li, Hong-Yu Zhou, and
Yizhou Yu. Bottom-up shift and reasoning for referring im-
age segmentation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
11266–11275, 2021. 2
[59] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu-
mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study
of gpt-3 for few-shot knowledge-based vqa. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 36,
pages 3081–3089, 2022. 3
[60] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng-
shuang Zhao, and Philip HS Torr. Lavt: Language-aware
vision transformer for referring image segmentation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 18155–18165, 2022. 2
[61] Dongran Yu, Bo Yang, Qianhao Wei, Anchen Li, and Shirui
Pan. A probabilistic graphical model based on neural-
symbolic reasoning for visual relationship detection. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 10609–10618, 2022. 3
[62] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu,
Mohit Bansal, and Tamara L Berg. Mattnet: Modular at-
tention network for referring expression comprehension. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 1307–1315, 2018. 2
[63] Alireza Zareian, Svebor Karaman, and Shih-Fu Chang.
Bridging knowledge graphs to generate scene graphs. In
Computer Vision–ECCV 2020: 16th European Conference,

You might also like