Cotdet: Affordance Knowledge Prompting For Task Driven Object Detection

The document presents CoTDet, a novel framework for task-driven object detection that leverages visual affordance knowledge and multi-level chain-of-thought prompting (MLCoT) from large language models (LLMs). It aims to enhance object detection by focusing on common attributes that enable various objects to perform specific tasks, rather than relying solely on predefined object categories. Experimental results demonstrate that CoTDet significantly outperforms existing methods in accuracy and provides rationales for object selection based on their affordances.

Uploaded by

Bhavna Buddhadev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views11 pages

Cotdet: Affordance Knowledge Prompting For Task Driven Object Detection

Uploaded by

Bhavna Buddhadev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection

Jiajin Tang* Ge Zheng* Jingyi Yu Sibei Yang†

School of Information Science and Technology, ShanghaiTech University
{tangjj,zhengge,yujingyi,yangsb}@shanghaitech.edu.cn
https://toneyaya.github.io/cotdet
arXiv:2309.01093v1 [cs.CV] 3 Sep 2023

(a) cup, (b) GT (c) Ours, (d) fork (e) temperature

Abstract bottle, knife... GGNN, TOIST( ) probe

Task driven object detection aims to detect object in-

stances suitable for affording a task in an image. Its chal-
lenge lies in object categories available for the task being
Object Detection Task Driven Object Detection (Open Parcel)
too diverse to be limited to a closed set of object vocabu-
(f) Multi-Level Chain-of-Thought Prompting
lary for traditional object detection. Simply mapping cat-
Choose objects Which com- Why these Then visual
egories and visual features of common objects to the task to open parcel... mon objects? objects? features?
Visual Affordance
cannot address the challenge. In this paper, we propose 1. Knife
1. Knife Knowledge!
Image easily cut paper Image
a sharp blade
to explore fundamental affordances rather than object cat- 2. Pen
3. Scissors
easily cut plastic
provide a grip a handle
egories, i.e., common attributes that enable different ob- 4. Screwdriver 2. Pen
protect from heat a sharp tip
... ...
jects to accomplish the same task. Moreover, we propose
a novel multi-level chain-of-thought prompting (MLCoT) to Figure 1. An example of (a) traditional object detection, (b)-
extract the affordance knowledge from large language mod- (e) task driven object detection, and (f) our multi-level chain-of-
els, which contains multi-level reasoning steps from task thought (MLCoT) prompting large language models (LLMs) to
to object examples to essential visual attributes with ratio- generate visual affordance knowledge.
nales. Furthermore, to fully exploit knowledge to benefit ob-
ject recognition and localization, we propose a knowledge-
conditional detection framework, namely CoTDet. It con- most suitable for the task [19, 42]. However, the challenges
ditions the detector from the knowledge to generate object for task driven object detection are multi-fold. Previous
queries and regress boxes. Experimental results demon- methods that directly learn the mapping between objects
strate that our CoTDet outperforms state-of-the-art meth- and tasks from objects’ visual context features or categories
ods consistently and significantly (+15.6 box AP and +14.8 cannot achieve satisfactory results. As shown in Figure 1c,
mask AP) and can generate rationales for why objects are the context-based approach GGNN [42] wrongly chooses
detected to afford the task. vegetable stem for afford task of “opening parcels” be-
cause it learns that visually slender objects can open parcels.
1. Introduction Similarly, the category-based approach TOIST [19] consid-
ers that no object in the image can perform the “opening
The traditional object detection task [3, 12, 37, 65], as parcels” since it could not even recognize an instance of
shown in Figure 1a, aims to detect object instances of given the knife in the image. In contrast, people will intelligently
categories in an image, so objects of categories such as the and naturally choose to use the knife to open parcels in the
cup, bottle, cake, and knife are detected. In contrast, de- scene of Figure 1b.
tection tasks in real-world applications [1, 4], such as in-
Recently, Large Language Models (LLMs) like GPT-
telligent service robots, usually appear in the form of tasks
3 [2] and ChatGPT [31] have demonstrated impressive ca-
rather than object categories [38]. For example, when an
pabilities in encoding general world knowledge from a vast
intelligent robot is asked to complete the task of “opening
amount of text data [32, 40, 52]. A naive approach is to
parcels”, the robot needs to autonomously locate the tool
prompt LLMs directly to return “what objects should we
shown in Figure 1b, i.e., a knife. So the core of this type
use to open parcels” and leverage the returned object cate-
of task is to detect the object instances in the image that are
gories to simplify task driven object detection to the tradi-
* Both authors contributed equally. † Corresponding author. tional one. However, we observe that LLMs usually only
return a few categories of commonly used objects, such as ate knowledge-aware queries based on image features and
the knife, pen, paper cutter, and scissors shown in Figure 1f. visual affordance knowledge. In addition to generating
According to these categories, although the knife in Fig- queries, we use visual affordance knowledge to guide the
ure 1b can be identified as the target object, the detection bounding box regression explicitly, inspired by the denois-
system will miss other objects that also can be used to open ing training [18]. Unlike [18] introduces denoising for
parcels, such as the fork in Figure 1d and the temperature accelerating training, our knowledge-conditional denoising
probe next to the microwave oven in Figure 1e. In turn, we training aims to teach the decoder how to utilize visual
ask why people can easily lock the fork and temperature knowledge to regress the boxes for queries.
probe in Figure 1d and 1e as the target objects? We argue Finally, we propose the CoTDet network, which acquires
that the reason is that people are not restricted to using spe- visual affordance knowledge from LLMs via the proposed
cific categories of objects to accomplish a task but instead MLCoT and performs knowledge-conditional object detec-
select objects based on the commonsense knowledge that tion to effectively utilize the knowledge. Moreover, our
objects with “a handle and sharp blade” or “a handle and CoTDet can easily be extended to task driven instance seg-
sharp tip” can “open parcels”. mentation by employing a segmentation head [5, 15].
In this paper, we propose to explicitly acquire visual af- In summary, our main contributions are:
fordance knowledge of the task (i.e., common visual at- • We are the first to propose to explicitly acquire vi-
tributes that enable different objects to afford the task) sual affordance knowledge and utilize the knowledge
and utilize the knowledge to bridge the task and objects. to bridge the task and object instances.
Figure 1f shows two sets of visual affordance knowledge • We propose a novel multi-level CoT prompting (ML-
(marked inside the yellow box) for opening parcels. How- CoT) to make abstract affordance knowledge con-
ever, it is not trivial to acquire such task-specific visual af- creted, which leverages LLMs to generate and summa-
fordance knowledge. rize intermediate reasoning steps from object examples
Furthermore, we propose a novel multi-level chain-of- to essential visual attributes with rationales.
thought prompting (MLCoT) to elicit visual affordance rea- • We claim that visual affordance knowledge can benefit
soning from LLMs. At the first level (object level), we both object recognition and localization and propose a
prompt LLMs to return common objects by the above- knowledge-conditional detection framework to condi-
mentioned approach. Unlike before, which treats the re- tion the detector to generate object queries and guide
turned object categories as target categories, we instead box regression through denoising training.
treat this query progress as brainstorming to obtain repre- • Our CoTDet not only consistently outperforms
sentative object examples. At the second level (affordance state-of-the-art methods (+15.6 AP50box and +14.8
level), we generate rationales from LLMs for why object AP50mask ) by a large margin but also can generate ra-
examples can afford the task and cooperate rationales to tionales for why objects are detected to afford tasks.
facilitate LLMs to reason and summarize the visual affor-
dances beyond object examples. As shown in Figure 1f,
the rationale and visual affordances that enable the knife to 2. Related Work
open parcels are “easily cut through paper and plastic...” Task Driven Object Detection/Instance Segmentation
and “a sharp blade and handle”, respectively. Our MLCoT aims to detect or segment out the most suitable objects in
can capture the essence of visual affordances behind these an image to afford a given task, such as opening parcel
object examples without being limited to object categories. or getting lemon out of tea. Different from traditional ob-
Thus we can successfully detect the fork and temperature ject detection or instance segmentation [3, 5, 10, 12, 37, 65],
probe in Figure 1d and 1e as they meet the visual affor- it requires modeling the preference for selecting objects
dances required by the task. based on a comprehensive understanding of the specific
Moreover, we claim that visual affordance knowledge task and the image scene. Although referring expres-
not only helps recognize and identify objects suitable for sion grounding [6, 15, 41, 45, 54, 56, 57] and segmenta-
the task but also helps localize objects more precisely be- tion [14, 20, 47, 51, 55, 58, 60, 62] similarly locate the ref-
cause visual attributes such as color and shape are use- erent according to a natural language description, they rely
ful in object localization. Therefore, unlike some meth- on the object attributes and relations in the scene for local-
ods [9, 30, 44, 46, 53] to take knowledge as complemen- ization without considering the prior knowledge needed to
tary to the image’s visual features, we condition the detector afford the given task. These factors result in the challenge
on the visual affordance knowledge to perform knowledge- of task driven object detection which requires complex and
conditional object detection. Specifically, we follow [19] to joint knowledge reasoning on the requirements for one spe-
use an end-to-end query-based detection framework [3, 65]. cific task and objects’ functional attributes beyond visual
But instead of randomly initializing queries, we gener- recognition and scene understanding.
Visual Affordance Knowledge

Output

Text Encoder
Affordance
Affordance
Affordance ...
Affordance Values Attention Generation
Affordance Pooling (bottom right) Prediction Head
Only for
Knowledge Aggregating Keys Knowledge Reading
Training

Tranformer Decoder
Visual Affordance Rationales
Noise
pierce paper sharp tips metal Open Parcel
sharp tips
pierce plastic slender Twee GT box
sharp tips silvery z
... ers
... a handle ...
a handle provide a grip ...
... Knowledge-conditional
Candidates
Query Generation

Large Language Models LanguageModel

Language Model
LLM
...
Problem Statement Multi-Level Chain-of-Thought (MLCoT) Encoding
I am a highly intelligent que- List objects that For each object, Summary visual
stion answering bot and I an- can be used as let's think the ra- features of the ob- Input
swer questions from a human a vehicle for hu- tionales why the- ject for each ratio-
perspective... man to {TASK} y afford the task nale Open Parcel

Figure 2. Overall framework of the proposed CoTDet with multi-level chain-of-thought prompting (MLCoT). We first generate
visual affordance knowledge from LLMs with the proposed novel MLCoT. Next, we perform knowledge-conditional object detection by
utilizing the knowledge to generate object queries for the scene as well as guide object localization through denoising training.

The pioneering work [42] adopts a two-stage framework the language model to filter effective visual functional at-
that first detects objects and then compares among objects tributes to afford tasks.
to select suitable objects via graph neural networks [43]. External Knowledge Utilization. Incorporating knowl-
The following work TOIST [19] distills target object names edge reasoning has attracted growing interest in computer
to pronouns such as something by conditioning on the lan- vision [11, 26, 34, 35, 44, 63] and vision-language [49] fields
guage description of the task. However, these works lack such as object detection [44, 46], visual relationship detec-
to model explicit requirements of tasks and objects’ affor- tion [16, 61] and visual question answering [9, 28, 30, 53].
dances to tasks, which limits their performance and gener- For tasks that focus on capturing relations among ob-
alization capabilities. jects, such as scene graph generation and visual relation-
Knowledge Acquisition in Vision-Language Tasks. Inte- ship detection, extracting knowledge of the interactions be-
grating external knowledge into computer vision tasks [16, tween general concepts becomes natural [11, 16, 61, 63].
35,44,46] and vision-language tasks [29,49,50,63] has been Other tasks like object detection and image classification
found beneficial. Previous methods [16, 28, 30, 35, 53, 61] rely on category-related knowledge that is retrieved in the
are interested in acquiring structured knowledge (e.g., Con- knowledge base as definitions or attributes of general con-
ceptNet [22]), which usually includes commonsense concepts [9, 26, 34, 44, 46, 53]. For knowledge utilization, they
cepts and relations and is presented in fixed data struc- mainly directly take external knowledge as an expansion
tures such as the graph or triple. Recently, large language to visual content or explicitly constrain the consistency be-
models [2, 7, 36] have been demonstrated to learn open- tween knowledge and visual content. Different from exist-
world commonsense knowledge from the large-scale cor- ing works, we leverage attribute-level commonsense knowl-
pus [32, 40]. Some works [9, 59] utilize language models edge about requirements for completing tasks and take ex-
to encode the representations of inputs or directly generate ternal knowledge of tasks as the condition to condition the
answers conditioning on visual inputs, leveraging the latent detector for task driven object detection.
knowledge in language models. Unlike previous works, we
prompt language model GPT-3 [8] to obtain external knowl- 3. Method
edge explicitly with the chain of thought (CoT) [25, 52, 64]
for better interpretability. To the best of our knowledge, we The framework of our proposed CoTDet is shown in
are the first to explore CoT to acquire visual commonsense Figure 2. First, we introduce the problem definition and
knowledge in text form, leveraging the reasoning ability of image and text encoders in Section 3.1. Second, we ac-
quire visual affordance knowledge from LLMs by lever- why the examples can afford the task and summarizes cor-
aging the multi-level chain-of-thought prompting and ag- responding visual attributes for rationales.
gregation (Section 3.2). Next, we present the knowledge- Object-level Prompting as Brainstorming. At the first
conditional decoder that conditions acquired knowledge to level, we prompt LLMs to generate daily object examples
detect and segment suitable objects (Section 3.3). Finally, that afford the input task S. Specifically, we design the fol-
we introduce the loss functions in Section 3.4. lowing text prompt:

3.1. Problem Definition and Encoder Prompt: What common objects in daily life can be used as
a vehicle for the human to [task]? Please list the twenty
Problem Definition. Given a task in text form S (e.g., most suitable objects.
“open parcel”) and an image I, task driven object detec- Output: knife, pen, paper cutter, scissors, screwdriver, ...
tion require detecting a set of objects most preferred to af-
ford the task. Note that the target objects indicated by the Where [task] is filled with the task text S. We denote the
task are not fixed in quantity and category, which may vary number of objects returned from LLMs as No . For simplic-
with changes in the image scene. In contrast, traditional ob- ity, we present the most critical parts of the prompts. For
ject detection [3, 10, 65] detects objects of fixed categories complete prompts, please refer to Appendix. One straight-
while referring image grounding [6, 17, 27, 33] localizes un- forward idea is to perform traditional category-specific ob-
ambiguous objects. ject detection with respect to the categories of object exam-
ples. However, it is not feasible due to the following ob-
Encoders. For the task S, we leverage the linguistic infor-
servations: (1) object examples are overly limited to partial
mation from two perspectives. First, the text S is preserved
object categories, resulting in the gap between object cat-
as input for extracting knowledge from LLMs. Besides, we
egories and the actual task requirements. For instance, the
employ RoBERTa [23] as the text encoder to obtain the text
fork in Figure 2 is not among the objects returned by LLMs.
feature ts for the task S, which will be used to query the
(2) a few noisy unsuitable objects may be output. Although
task-relevant visual content in the image. For the image I,
the noisy objects are few, relying entirely on the object ex-
we adopt the ResNet-101 [13] as the visual backbone to ex-
amples is risky. For example, for the task of extinguishing
tract multi-scale visual feature maps and flatten the maps
fire, LLMs return the fire axe, a common firefighting tool,
along the spatial dimension to features V .
but it cannot be directly used to extinguish the fire.
3.2. Visual Affordance Knowledge from LLMs Affordance-level Prompting with Rationales. To address
the above challenges and capture the essential visual affor-
To detect objects to afford a particular task, we naturally dances implied in representative object examples, we pro-
consider the task’s requirements first and subsequently lo- pose to generate rationales for why these objects can afford
calize suitable objects that meet the requirements. Never- the task and summarize visual affordances from rationales.
theless, the task requirements are abstract and can not di- With the object examples, we prompt LLMs to generate ra-
rectly correspond to the visual content in the image for lo- tionales as follows:
calizing the objects. Motivated by this, we propose to ex-
plicitly extract common visual attributes (i.e., visual affor- Prompt: For each object, let’s think about the rationales
dances) that make different objects can afford the task and for why they afford the task from the perspective of visual
use visual affordances to bridge task requirements and ob- features.
ject instances in the image. Furthermore, we generate visual Output: Knife: They have a sharp blade which can easily
affordances from LLMs because LLMs generally contain cut through paper and plastic; They have a handle which
world knowledge learned from a vast amount of text data. provides a good grip for the user. Pen: ...
Specifically, we first design a novel multi-level chain- By the above prompting, we get a set of rationales for each
of-thought prompting to leverage LLMs to generate visual object. Next, we further prompt LLMs to summarize visual-
affordances (see Section 3.2.1) and then encode and ag- related rationales to form visual affordances as follows:
gregate them automatically to be utilized for detection (see
Prompt: Summary corresponding visual features of the ob-
Section 3.2.2).
ject for each rationale.
Output: {A sharp blade and a handle. }, {...}, ...
3.2.1 Multi-Level Chain-of-Thought Prompting
Finally, we obtain No sets of visual affordances, where each
Our multi-level chain-of-thought prompting (MLCoT) set contains several visual attributes relevant to why objects
leverages LLMs to generate and summarize intermediate can afford the task S. And we define each set of visual affor-
reasoning steps from object examples to essential visual at- dance knowledge as a knowledge unit. Note that although
tributes with rationales. MLCoT first brainstorms the object each knowledge unit is derived from the rationales of an
examples to afford the task and then considers rationales object example, the affordance knowledge in that unit is not
limited to that object example or its related categories. For tent of the image, the task, and the visual affordance knowl-
example, the first visual affordance unit comprises “a sharp edge. Specifically, we utilize visual affordance knowledge
blade and a handle”, which correspond to the returned ob- to select visual features and combine them with the knowl-
ject “knife”. Notably, this visual affordance unit also ap- edge to generate queries, and then the spatial information in
plies to “box cutter” and “paper cutter”. these visual features naturally becomes the spatial priors of
reference points.
3.2.2 Knowledge Encoding and Aggregating Given visual features F = {f1 , f2 , ..., fi }, we first fuse
each feature fi with the task’s text feature ts , and then cal-
We further extract a refined knowledge base by filtering culate its relevance to the task’s visual affordance knowl-
out a few knowledge units corresponding to the unsuit- edge K = {(pkj , pvj )}N j=1 . Since each knowledge unit
able object examples mentioned in Section 3.2.1. For each (pkj , pvj ) in the knowledge base K is a set of affordances
knowledge unit, we concatenate textual descriptions of its that meet the task requirements, we use the fused feature’s
visual affordances into a textual sequence and then utilize largest similarity to each knowledge unit (pkj , pvj ) in the
RoBERTa [23] to obtain the sentence feature. To filter out knowledge base K as the feature’s relevance score. The cal-
unsuitable units, we compute the cosine similarity between culation is formulated as follows,
each pair of knowledge units and exclude outlier units if
their maximum similarity to other units falls below a pre- \label {eq:read} \begin {aligned} &s_{i,j} = \mathrm {cos}(\mathrm {fc}(f_i) + \mathrm {fc}(t_s), p_j^k), \\ &r_i = \mathrm {max}_j(s_{i,j}), d_i = \mathrm {argmax}_j(s_{i,j}), \end {aligned}
determined threshold. Additionally, for each selected unit, (2)
we extract its word representations via RoBERTa. In sum-
mary, we aggregate N visual affordance knowledge units, where cos(·, ·) computes the cosine similarity, fc(·) repre-
denoted as K = {pkj , pvj }N k v
j=1 , where pj and pj are the sen- sents the fully connected layer, and si,j is the similarity be-
tence feature and word features of j-th unit, respectively. tween i-th visual feature and j-th knowledge unit. Then,
ri and di mean the i-th visual feature’s relevance score and
3.3. Knowledge-conditioned Decoder index of the corresponding knowledge unit, respectively.
We base on the detection architecture of Deformable- Next, we select the visual features with the top-k largest
DETR [65], a DETR-like detector [3, 5, 18], which uses relevance scores {ri } to incorporate their corresponding
object queries to capture object-level information for detec- knowledge {(pkdi , pvdi )} to generate queries Qkn as follows,
tion (Section 3.3.1). Unlike randomly initializing the object
queries, we leverage visual affordance knowledge to gener- \label {eq:query} Q^{\text {kn}} = \mathrm {topk}_{r_i}\{f_i + \mathrm {AttentionPool}(f_i, p^v_{d_i})\}, (3)
ate the object queries (Section 3.3.2) and guide the bound-
where topkri means to select the corresponding features
ing box regression with denoising training (Section 3.3.3).
with the top-k largest relevance scores ri . The atten-
tion pooling layer [48] AttentionPool(fi , pvdi ) returns the
3.3.1 Introduction to Deformable-DETR weighted features on pvdi based on their similarities to the
Deformable-DETR contains a Transformer encoder and a fi . Note that, for each knowledge unit (pkj , pvj ), we use its
Transformer decoder. The encoder inputs visual features V global sentence feature pkj to compute its overall similarity
and outputs the refined visual features F = {f1 , f2 , ..., fi } to each visual feature in Eq. 2 while adopting word-level
via multi-scale deformable attention. The decoder ran- features pvj to better enhance the query’s fine-grained rep-
domly initializes queries Q = {q1 , q2 , ..., qk } and predicts resentations in Eq. 3. Similar to Deformable-DETR, we
a reference point pk for each object query qk , and these ref- further predict the reference points Lkn from the queries
erence points L = {l1 , l2 , ..., lk } serve as the initial guess Qkn . In addition, to facilitate the learning of Top-k selec-
of the box centres. Next, the decoder searches for objects O tion, the selected queries Qkn are directly fed into the pre-
for these queries Q with reference points L via multi-scale diction heads and supervised during training using the same
deformable cross-attention and self-attention, which is for- training loss in Section 3.4.
mulated as follows,
3.3.3 Knowledge-conditional Decoding
O = \mathrm {Deformable}([Q, L], F), (1)
With queries Qkn , reference points Lkn , and the refined vi-
where Deformable(·, ·) denotes the Transformer decoder of sual features F , we apply the Deformable decoder to search
Deformable-DETR. objects Okn as follows,

3.3.2 Knowledge-conditional Query Generation \label {eq:normal} O^{\text {kn}}=\mathrm {Deformable}([Q^{\text {kn}},L^{\text {kn}}], F). (4)

Instead of randomly initializing the queries, we generate the In addition to utilizing visual affordance knowledge for
queries and their reference points based on the visual con- query generation and providing the decoder with prior
task1: step on task2: sit comfortably task3: place flowers task4: get potatoes out of fire task5: water plant
task6: get lemon out of tea task7: dig hole task8: open bottle of beer task9: open parcel task10: serve wine
task11: pour sugar task12: smear butter task13: extinguish fire task14: pound carpet

Task(APbox @0.5)
Method Mean
task1 task2 task3 task4 task5 task6 task7 task8 task9 task10 task11 task12 task13 task14
GGNN [42] 36.6 29.8 40.5 37.6 41.0 17.2 43.6 17.9 21.0 40.6 22.3 28.4 39.1 40.7 32.6
TOIST [19] 44.0 39.5 46.7 43.1 53.6 23.5 52.8 21.3 23.0 46.3 33.1 41.7 48.1 52.9 41.3
TOIST† [19] 45.8 40.0 49.4 49.6 53.4 26.9 58.3 22.6 32.5 50.0 35.5 43.7 52.8 56.2 44.1
Ours 58.9 55.0 51.2 68.5 60.5 47.7 76.9 40.7 47.4 66.5 41.9 48.3 61.7 71.4 56.9
†
Table 1. Comparison with state-of-the-art models for task driven object detection on COCO-Tasks dataset. means the model is
with noun-pronoun distillation.

knowledge, we further improve the knowledge utilization where λcl and λbox are the hyperparameters of the weighted
by designing a knowledge-based denoising training [18]. loss. Our method can be easily extended to instance seg-
As the visual affordance knowledge indicates the target mentation by adding a segmentation head [5] and replacing
objects’ visual attributes, such as shape and size, the the box regression loss with the Dice loss Lmask .
knowledge-base denoising guides the decoder in learning
how to use this kind of visual knowledge to regress the tar- 4. Experiment
gets’ boxes.
4.1. Dataset and Implementation Details
Specifically, during the training stage, we first randomly
add noise to ground-truth boxes Ogt = {ogt M
m }m=1 to con- Dataset. We conduct experiments on the COCO-Tasks
struct the noised objects following DN-DETR [18] and then dataset [42], which comprises 14 different tasks (see Ta-
extract noised boxes’ visual features and centers as the ble 1). This dataset is derived from the COCO dataset [21],
noise M
noised queries F noise = {fm }m=1 . Notice that the previ- but with customized annotations for task driven object de-
ous denoising training method [18] adds noise to both boxes tection. Each task contains 3600 training and 900 testing
and categories labels to capture label-box relations better. images. Besides, we follow [19] to incorporate mask anno-
But we only add noise to boxes because we aim to utilize the tations to the original COCO-Tasks dataset for the instance
knowledge without noise to help denoise boxes. Therefore, segmentation benchmark.
we extract the knowledge unit (pkdm , pvdm ) for each ground- Implementation Details. Following previous works [19,
truth box ogt m through Eq. 2. Finally, the knowledge units
42], we use ResNet-101 [13] as the image encoder and
{(pkdm , pvdm )}M RoBERTa [23] as the text encoder. The model is pre-trained
m=1 guide the decoder to regress the ground-
truth boxes Ogt from the noised queries F noise , which is for- on the COCO dataset but images already part of COCO-
mulated as follows, Tasks are removed. We train the model for 4000 iterations
with the initial learning rate 1e-4 and use AdamW [24] as
\begin {aligned} \label {eq:kn} P^{\text {kn}} &= \{\mathrm {AttentionPool}(f_m^{\text {noise}}, p^v_{d_m})\}_{m=1}^M \\ O^{\text {denoise}}&=\mathrm {Deformable}([F^{\text {noise}} + P^{\text {kn}}, L^\text {noise}], F,) \\ \end {aligned} the optimizer. The hyperparameters λcl and λbox are 4 and
(5) 5. Following [19], we evaluate the segmentation and de-
tection performance of each task using APmask @0.5 and
APbox @0.5, respectively. And we denote their means across
where P kn is the visual affordance knowledge of noised all tasks as mAPmask and mAPbox . Unless otherwise speci-
queries, and the Deformable(·, ·) in Eq. 4 and Eq. 5 shares fied, we leverage the GPT-3 [2] to extract visual affordance
the same parameters. And the denoising is only considered knowledge due to its capability to generate rationales [52].
in the training stage.
4.2. Comparison with State-of-the-Art Methods
3.4. Loss Functions
Table 1 and Table 2 show the comparison of our CoT-
Following DETR [3], we use bipartite matching to find Det with state-of-the-art models (SOTAs) on detection and
the unique predictions for the ground-truth objects and segmentation benchmarks. Our model consistently outper-
adopt the same bounding box regression loss Lbox consist- forms the SOTAs [19, 42] on all benchmarks and tasks.
ing of L1 loss and GIoU [39] loss. Moreover, we use the Comparison with SOTAs. Compared to TOIST [19],
binary cross entropy loss as the classification loss Lcl . The our CoTDet achieves significant performance improve-
overall loss is represented as: ment (15.6% mAPbox and 14.8% mAPmask ), which demon-
strates the effectiveness of our task-relevant knowledge
\mathcal {L}_{cost} = \lambda _{cl}\mathcal {L}_{cl} +\lambda _{box}\mathcal {L}_{box}, (6) acquisition and utilization. Compared to the two-stage
Task(APmask @0.5)
Method Mean
task1 task2 task3 task4 task5 task6 task7 task8 task9 task10 task11 task12 task13 task14
GGNN [42] 31.8 28.6 45.4 33.7 46.8 16.6 37.8 15.1 15.0 49.9 24.9 18.9 49.8 39.7 32.4
TOIST [19] 37.0 34.4 44.7 34.2 51.3 18.6 40.5 17.1 23.4 43.8 29.3 39.9 46.6 42.4 35.2
TOIST† [19] 40.8 36.5 48.9 37.8 43.4 22.1 44.4 20.3 26.9 48.1 31.8 34.8 51.5 46.3 38.8
Ours 55.0 51.6 51.2 57.7 60.1 43.1 65.9 40.4 45.4 64.8 40.4 48.7 61.7 64.4 53.6

Table 2. Comparison with state-of-the-art models for task driven instance segmentation on COCO-Tasks dataset. † means the model
is with noun-pronoun distillation.

Ablation task2 task6 task9 mAPbox mAPmask tiveness of the proposed knowledge acquisition and utiliza-
tion. The results are shown in Table 3. In addition to mAP,
“objects” 25.4 16.5 21.0 31.9 31.3
“visual” 50.4 30.3 38.3 48.1 44.7 we report APbox @0.5 on relatively easy task2 (sit comfort-
w/o rationales 52.0 40.7 41.2 52.4 49.0 ably) and challenging task6 (get lemon out of tea) and task9
MLCoT 55.0 47.7 47.5 56.9 53.6 (open parcel) for reference, and the full results and analysis
MLCoT(ChatGPT) 50.6 48.1 50.3 57.0 54.0 on sub-tasks are provided in Appendix.
Def+GGNN [42] 38.6 24.7 23.4 38.8 35.8 MLCoT Prompting for Knowledge Acquisition. To eval-
Def+TOIST [19] 43.4 21.0 29.0 40.3 37.6 uate the impact of core designs in MLCoT, we replace
the MLCoT pipeline with the following approaches and
Init w/ MLCoT 42.2 35.9 35.6 48.7 46.4
Fuse w/ MLCoT 44.0 42.3 41.2 50.6 47.7
utilize the acquired knowledge as the condition to guide
Select w/ MLCoT 50.0 47.2 43.7 55.3 51.7 detection: (1) We encode the object categories returned
Full Decoder 55.0 47.7 47.5 56.9 53.6 by the object level of MLCoT as the knowledge to per-
form the knowledge-conditional object detection. The re-
Table 3. Ablation study about knowledge acquisition, detection sults (31.9% mAPbox and 31.3% mAPmask ) demonstrate that
framework, and knowledge utilization of our CoTDet. simply extracting the object categories from LLMs cannot
achieve satisfactory performance. (2) We attempt to ac-
method GGNN [42], we achieve 24.3% mAPbox and 21.2% quire affordance-level rather than object-level knowledge.
mAPmask performance gain, which demonstrates the impor- Specifically, we prompt LLMs by asking “what visual fea-
tance of leveraging the visual affordance knowledge rather tures can we use to determine the suitability of an object
than purely visual context information. for {TASK}?” to generate visual affordance knowledge di-
Comparison on Sub-tasks. The following comparisons rectly. The attempt improves the above object-level model
on sub-tasks further demonstrate that the affordance-level by 16.2% mAPbox and 13.4% mAPmask , showing the neces-
knowledge is capable of bridging tasks and objects. Our sity of exploring the essential visual affordances behind the
CoTDet significantly improves the detection and segmenta- object categories. However, this model still underperforms
tion performance on task4 (get potatoes out of fire), task6 by approximately 9% mAP compared to our full model. It is
(get lemon out of tea) and task7 (dig hole), achieving ap- difficult to summarize a unified description of widely vary-
proximately 20% mAP improvement on both benchmarks. ing objects without priors, resulting in only one set of visual
These tasks face the common challenge of the wide vari- attributes being returned from LLMs. (3) To increase the
ety of targets’ categories and visual appearances, which is diversity of visual affordances, we prompt LLMs to gen-
hardly dealt with by methods like [19, 42] that merely learn erate visual features for each object retrieved, which leads
the mapping between tasks and objects’ categories and vi- to a significant improvement to 52.4% mAPbox and 49.0%
sual features. In contrast, our method explicitly acquires the mAPmask . (4) Finally, we further add rationales to filter out
visual affordance knowledge of tasks to detect rare objects the misleading and irrelevant attributes, achieving a 4.5%
and avoid overfitting to common objects, outperforming and 4.6% increase in mAPbox and mAPmask , respectively.
significantly in these tasks. In addition, for those less chal- (5) We also evaluate the effect of using different LLMs
lenging tasks with a few ground-truth object categories, we to extract visual affordance knowledge. Our MLCoT with
still achieve approximately 8% mAP improvement, demon- ChatGPT [31] has a similar mAP to MLCoT with GPT-3.
strating the effectiveness of conditioning on visual affor- Knowledge-conditional Object Detection. To validate the
dances to object localization. effectiveness of our proposed knowledge-conditioned de-
coder, we conduct ablation studies with two baselines and
4.3. Ablation Study
three variants based on Deformable-DETR [65] framework:
We evaluate seven variants of our CoTDet and two SO- (1) We develop GGNN [42] on the Deformable-DETR de-
TAs with the same backbones as ours to validate the effec- tection framework. Def+GGNN simply learns the rela-
GT TOIST Ours GT “visual” w/o KDN Ours

(a) “get lemon out of tea” GT(∅),Ours(∅) (c) “pound carpet”

(b) “open bottle of beer” TOIST(∅), (d) “get lemon out of tea”
Rationale for (b): sharp blade with a pointed end to insert Rationale for (d): handle of the spatula is long enough to reach the bottom of
into the bottle cap. the cup; flat edge is designed to scoop up the lemon pieces.
Figure 3. Visualization for prediction results of our CoTDet, its variants, and the existing best-performing TOIST [19].

tions between objects and identifies objects based on their with a handle”, while the latter knife is ignored because its
contexts, limiting its performance. (2) Besides, similar visual attributes of straight mismatch the single knowledge
to TOIST [19], we initialize queries with the task’s tex- unit that includes “curved or angled”. Furthermore, with-
tual feature based on our framework. The performance gap out KDN, our detector lacks explicit guidance, leading to
(16.6% mAPbox and 14.1% mAPmask ) between Def+TOIST inaccurate detection in challenging scenes. Specifically, the
and our final model. (3) We introduce the visual affordance glove in (c) and the knife in (d) are not detected success-
knowledge extracted by MLCoT but simply use it to initial- fully, and the packing line in (d) is mistakenly detected.
ize the queries of the decoder (Init w/MLCoT). The model
achieves significant performance gain compared to the two 5. Conclusion
baselines. (4) We further fuse knowledge with the im-
age’s visual feature map to construct a multi-modal feature In this paper, we focus on challenging task driven object
map (Fuse w/MLCoT), which jointly understands the two detection, which is practical in the real world yet under-
modalities and improves performance (1.9% mAPbox and explored. To bridge the gap between abstract task require-
1.3% mAPmask ) compared to the last model. (5) Our pro- ments and objects in the image, we propose to explicitly
posed knowledge-conditional query generation, generating extract visual affordance knowledge for the task and de-
based on the visual content of the image, the task, and the tect objects having consistent visual attributes to the visual
visual affordance knowledge, helps the decoder better local- knowledge. Furthermore, our CoTDet utilizes visual affor-
ize the objects, resulting in average improvements of 4.7% dance knowledge to condition the decoder in localizing and
mAPbox and 4.0% mAPmask . (6) Finally, the knowledge- recognizing suitable objects.
conditional denoising training improves APbox and APmask Limitations: While acknowledging the disparity between
by 1.6% and 1.9%, respectively. the COCO-Task dataset and real-world application scenar-
ios, attributed to its limited task variety and preference for
4.4. Visualization images and annotations, our approach has the potential to
Figure 3 visualizes qualitative results for several exam- extend beyond these confines. Notably, our knowledge
ples. For (a), no objects in the image should be selected to acquisition and utilization are flexible and generalizable,
“get lemon out of tea”. Our model can successfully return granting it the capacity to transcend specific dataset, spe-
the empty set, while TOIST detects the french fry that is cific tasks, object categories, or tools. We leave this to fu-
one of the salient objects in the image as the tool. Simi- ture works. Furthermore, with the incorporation of LLM,
larly, as knives are uncommon for “opening bottle of beer”, our approach inherits potential social biases from LLM,
the knife in (b) is challenging for TOIST to identify and lo- which could potentially be reflected in the preference for
cate. Guided by the visual affordance of “sharp blade with selecting frequently used tools.
a pointed end”, our model correctly localizes and selects Acknowledgment: This work was supported by the Na-
the sharp knife. The (c) and (d) demonstrate effectiveness tional Natural Science Foundation of China (No.62206174),
without MLCoT or knowledge-conditional denoising train- Shanghai Pujiang Program (No.21PJ1410900), Shanghai
ing (KDN). With visual affordance knowledge obtained by Frontiers Science Center of Human-centered Artificial In-
directly asking LLMs, our model relies solely on matching telligence (ShangHAI), MoE Key Laboratory of Intelligent
with the single knowledge unit, which incorrectly detects Perception and Human-Machine Collaboration (Shang-
the trunk in (c) and misses the knife in (d). The former haiTech University), and Shanghai Engineering Research
trunk is easily confused with objects that are “flat, broad Center of Intelligent Vision and Imaging.
References ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 4, 6
[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb-
[14] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-
otar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu,
mentation from natural language expressions. In Computer
Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i
Vision–ECCV 2016: 14th European Conference, Amster-
can, not as i say: Grounding language in robotic affordances.
dam, The Netherlands, October 11–14, 2016, Proceedings,
arXiv preprint arXiv:2204.01691, 2022. 1
Part I 14, pages 108–124. Springer, 2016. 2
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
[15] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan-
Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
modulated detection for end-to-end multi-modal understand-
guage models are few-shot learners. Advances in neural in-
ing. In Proceedings of the IEEE/CVF International Confer-
formation processing systems, 33:1877–1901, 2020. 1, 3, 6
ence on Computer Vision, pages 1780–1790, 2021. 2
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- [16] Keizo Kato, Yin Li, and Abhinav Gupta. Compositional
end object detection with transformers. In Computer Vision– learning for human object interaction. In Proceedings of the
ECCV 2020: 16th European Conference, Glasgow, UK, Au- European Conference on Computer Vision (ECCV), pages
gust 23–28, 2020, Proceedings, Part I 16, pages 213–229. 234–251, 2018. 3
Springer, 2020. 1, 2, 4, 5, 6 [17] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and
[4] Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Tamara Berg. Referitgame: Referring to objects in pho-
Krähenbühl. Learning by cheating. In Conference on Robot tographs of natural scenes. In Proceedings of the 2014 con-
Learning, pages 66–75. PMLR, 2020. 1 ference on empirical methods in natural language processing
(EMNLP), pages 787–798, 2014. 4
[5] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-
pixel classification is not all you need for semantic segmen- [18] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni,
tation. Advances in Neural Information Processing Systems, and Lei Zhang. Dn-detr: Accelerate detr training by intro-
34:17864–17875, 2021. 2, 5, 6 ducing query denoising. In Proceedings of the IEEE/CVF
[6] Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Conference on Computer Vision and Pattern Recognition,
Zhou, and Houqiang Li. Transvg: End-to-end visual ground- pages 13619–13627, 2022. 2, 5, 6
ing with transformers. In Proceedings of the IEEE/CVF [19] Pengfei Li, Beiwen Tian, Yongliang Shi, Xiaoxue Chen, Hao
International Conference on Computer Vision, pages 1769– Zhao, Guyue Zhou, and Ya-Qin Zhang. Toist: Task oriented
1779, 2021. 2, 4 instance segmentation transformer with noun-pronoun distil-
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina lation. arXiv preprint arXiv:2210.10775, 2022. 1, 2, 3, 6, 7,
Toutanova. Bert: Pre-training of deep bidirectional 8
transformers for language understanding. arXiv preprint [20] Liang Lin, Pengxiang Yan, Xiaoqian Xu, Sibei Yang, Kun
arXiv:1810.04805, 2018. 3 Zeng, and Guanbin Li. Structured attention network for re-
[8] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, ferring image segmentation. IEEE Transactions on Multime-
scope, limits, and consequences. Minds and Machines, dia, 24:1922–1932, 2021. 2
30:681–694, 2020. 3 [21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[9] Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Ying Nian Wu, and Prem Natarajan. Transform-retrieve- Zitnick. Microsoft coco: Common objects in context. In
generate: Natural language-centric outside-knowledge vi- Computer Vision–ECCV 2014: 13th European Conference,
sual question answering. In Proceedings of the IEEE/CVF Zurich, Switzerland, September 6-12, 2014, Proceedings,
Conference on Computer Vision and Pattern Recognition, Part V 13, pages 740–755. Springer, 2014. 6
pages 5067–5077, 2022. 2, 3 [22] Hugo Liu and Push Singh. Conceptnet—a practical com-
[10] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- monsense reasoning tool-kit. BT technology journal,
national conference on computer vision, pages 1440–1448, 22(4):211–226, 2004. 3
2015. 2, 4 [23] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
[11] Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
and Mingyang Ling. Scene graph generation with exter- moyer, and Veselin Stoyanov. Roberta: A robustly optimized
nal knowledge and image reconstruction. In Proceedings of bert pretraining approach. arXiv preprint arXiv:1907.11692,
the IEEE/CVF conference on computer vision and pattern 2019. 4, 5, 6
recognition, pages 1969–1978, 2019. 3 [24] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
[12] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- regularization. arXiv preprint arXiv:1711.05101, 2017. 6
shick. Mask r-cnn. In Proceedings of the IEEE international [25] Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-
conference on computer vision, pages 2961–2969, 2017. 1, Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin
2 Kalyan. Dynamic prompt learning via policy gradient for
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. semi-structured mathematical reasoning. In International
Deep residual learning for image recognition. In Proceed- Conference on Learning Representations (ICLR), 2023. 3
[26] Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit [39] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir
Menon, Junfeng Yang, Xin Wang, and Carl Vondrick. Dou- Sadeghian, Ian Reid, and Silvio Savarese. Generalized in-
bly right object recognition: A why prompt for visual ratio- tersection over union: A metric and a loss for bounding
nales. arXiv preprint arXiv:2212.06202, 2022. 3 box regression. In Proceedings of the IEEE/CVF conference
[27] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana on computer vision and pattern recognition, pages 658–666,
Camburu, Alan L Yuille, and Kevin Murphy. Generation 2019. 6
and comprehension of unambiguous object descriptions. In [40] Adam Roberts, Colin Raffel, and Noam Shazeer. How much
Proceedings of the IEEE conference on computer vision and knowledge can you pack into the parameters of a language
pattern recognition, pages 11–20, 2016. 4 model? arXiv preprint arXiv:2002.08910, 2020. 1, 3
[28] Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, [41] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor
and Marcus Rohrbach. Krisp: Integrating implicit and sym- Darrell, and Bernt Schiele. Grounding of textual phrases in
bolic knowledge for open-domain knowledge-based vqa. In images by reconstruction. In Computer Vision–ECCV 2016:
Proceedings of the IEEE/CVF Conference on Computer Vi- 14th European Conference, Amsterdam, The Netherlands,
sion and Pattern Recognition, pages 14111–14121, 2021. 3 October 11–14, 2016, Proceedings, Part I 14, pages 817–
[29] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and 834. Springer, 2016. 2
Roozbeh Mottaghi. Ok-vqa: A visual question answering [42] Johann Sawatzky, Yaser Souri, Christian Grund, and Jurgen
benchmark requiring external knowledge. In Proceedings Gall. What object should i use?-task driven object detection.
of the IEEE/cvf conference on computer vision and pattern In Proceedings of the IEEE/CVF Conference on Computer
recognition, pages 3195–3204, 2019. 3 Vision and Pattern Recognition, pages 7605–7614, 2019. 1,
[30] Medhini Narasimhan, Svetlana Lazebnik, and Alexander 3, 6, 7
Schwing. Out of the box: Reasoning with graph convolu- [43] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Ha-
tion nets for factual visual question answering. Advances in genbuchner, and Gabriele Monfardini. The graph neural
neural information processing systems, 31, 2018. 2, 3 network model. IEEE transactions on neural networks,
[31] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L 20(1):61–80, 2008. 3
Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agar- [44] Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jian-
wal, Katarina Slama, Alex Ray, et al. Training language wei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan,
models to follow instructions with human feedback. arXiv Lijuan Wang, Lu Yuan, et al. K-lite: Learning transfer-
preprint arXiv:2203.02155, 2022. 1, 7 able visual models with external knowledge. arXiv preprint
[32] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton arXiv:2204.09222, 2022. 2, 3
Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian [45] Cheng Shi and Sibei Yang. Spatial and visual perspective-
Riedel. Language models as knowledge bases? arXiv taking via view rotation and relation reasoning for embodied
preprint arXiv:1909.01066, 2019. 1, 3 reference understanding. In European Conference on Com-
[33] Bryan A Plummer, Arun Mallya, Christopher M Cervantes, puter Vision, pages 201–218. Springer, 2022. 2
Julia Hockenmaier, and Svetlana Lazebnik. Phrase local- [46] Krishna Kumar Singh, Santosh Divvala, Ali Farhadi, and
ization and visual relationship detection with comprehensive Yong Jae Lee. Dock: Detecting objects by transferring
image-language cues. In Proceedings of the IEEE inter- common-sense knowledge. In Proceedings of the European
national conference on computer vision, pages 1928–1937, Conference on Computer Vision (ECCV), pages 492–508,
2017. 4 2018. 2, 3
[34] Sarah Pratt, Rosanne Liu, and Ali Farhadi. What does a [47] Jiajin Tang, Ge Zheng, Cheng Shi, and Sibei Yang. Con-
platypus look like? generating customized prompts for zero- trastive grouping with transformer for referring image seg-
shot image classification. arXiv preprint arXiv:2209.03320, mentation. In Proceedings of the IEEE/CVF Conference
2022. 3 on Computer Vision and Pattern Recognition, pages 23570–
[35] Mengshi Qi, Yunhong Wang, Jie Qin, and Annan Li. Ke- 23580, 2023. 2
gan: Knowledge embedded generative adversarial networks [48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
for semi-supervised scene parsing. In Proceedings of the reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
IEEE/CVF Conference on Computer Vision and Pattern Polosukhin. Attention is all you need. Advances in neural
Recognition, pages 5237–5246, 2019. 3 information processing systems, 30, 2017. 5
[36] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, [49] Duc Minh Vo, Hong Chen, Akihiro Sugimoto, and Hideki
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Nakayama. Noc-rek: Novel object captioning with retrieved
Peter J Liu. Exploring the limits of transfer learning with vocabulary from external knowledge. In 2022 IEEE/CVF
a unified text-to-text transformer. The Journal of Machine Conference on Computer Vision and Pattern Recognition
Learning Research, 21(1):5485–5551, 2020. 3 (CVPR), pages 17979–17987. IEEE, 2022. 3
[37] Joseph Redmon and Ali Farhadi. Yolov3: An incremental [50] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and An-
improvement. arXiv preprint arXiv:1804.02767, 2018. 1, 2 ton Van Den Hengel. Fvqa: Fact-based visual question an-
[38] Allen Z Ren, Bharat Govil, Tsung-Yen Yang, Karthik R swering. IEEE transactions on pattern analysis and machine
Narasimhan, and Anirudha Majumdar. Leveraging language intelligence, 40(10):2413–2427, 2017. 3
for accelerated learning of tool manipulation. In Conference [51] Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong
on Robot Learning, pages 1531–1541. PMLR, 2023. 1 Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-
driven referring image segmentation. In Proceedings of Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII
the IEEE/CVF conference on computer vision and pattern 16, pages 606–623. Springer, 2020. 3
recognition, pages 11686–11695, 2022. 2 [64] Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao,
[52] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten George Karypis, and Alex Smola. Multimodal chain-of-
Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought thought reasoning in language models. arXiv preprint
prompting elicits reasoning in large language models. arXiv arXiv:2302.00923, 2023. 3
preprint arXiv:2201.11903, 2022. 1, 3, 6 [65] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
[53] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Wang, and Jifeng Dai. Deformable detr: Deformable trans-
Anton Van Den Hengel. Ask me anything: Free-form vi- formers for end-to-end object detection. arXiv preprint
sual question answering based on knowledge from external arXiv:2010.04159, 2020. 1, 2, 4, 5, 7
sources. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 4622–4630, 2016. 2, 3
[54] Sibei Yang, Guanbin Li, and Yizhou Yu. Cross-modal re-
lationship inference for grounding referring expressions. In
Proceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 4145–4154, 2019. 2
[55] Sibei Yang, Guanbin Li, and Yizhou Yu. Dynamic graph at-
tention for referring expression comprehension. In Proceed-
ings of the IEEE/CVF International Conference on Com-
puter Vision, pages 4644–4653, 2019. 2
[56] Sibei Yang, Guanbin Li, and Yizhou Yu. Propagating over
phrase relations for one-stage visual grounding. In Computer
Vision–ECCV 2020: 16th European Conference, Glasgow,
UK, August 23–28, 2020, Proceedings, Part XIX 16, pages
589–605. Springer, 2020. 2
[57] Sibei Yang, Guanbin Li, and Yizhou Yu. Relationship-
embedded representation learning for grounding referring
expressions. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 43(8):2765–2779, 2020. 2
[58] Sibei Yang, Meng Xia, Guanbin Li, Hong-Yu Zhou, and
Yizhou Yu. Bottom-up shift and reasoning for referring im-
age segmentation. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
11266–11275, 2021. 2
[59] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu-
mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study
of gpt-3 for few-shot knowledge-based vqa. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 36,
pages 3081–3089, 2022. 3
[60] Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Heng-
shuang Zhao, and Philip HS Torr. Lavt: Language-aware
vision transformer for referring image segmentation. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 18155–18165, 2022. 2
[61] Dongran Yu, Bo Yang, Qianhao Wei, Anchen Li, and Shirui
Pan. A probabilistic graphical model based on neural-
symbolic reasoning for visual relationship detection. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 10609–10618, 2022. 3
[62] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu,
Mohit Bansal, and Tamara L Berg. Mattnet: Modular at-
tention network for referring expression comprehension. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 1307–1315, 2018. 2
[63] Alireza Zareian, Svebor Karaman, and Shih-Fu Chang.
Bridging knowledge graphs to generate scene graphs. In
Computer Vision–ECCV 2020: 16th European Conference,

Det GPT
No ratings yet
Det GPT
17 pages
2021 - Joseph Et Al - Towards Open World Object Detection
No ratings yet
2021 - Joseph Et Al - Towards Open World Object Detection
16 pages
Enhancing Object Detection with Common Sense
No ratings yet
Enhancing Object Detection with Common Sense
17 pages
Review of Deep Learning for Object Detection
No ratings yet
Review of Deep Learning for Object Detection
21 pages
bmvc14 Sun Fromvirtualtoreal
No ratings yet
bmvc14 Sun Fromvirtualtoreal
12 pages
Object Detection With Deep Learning: A Review
No ratings yet
Object Detection With Deep Learning: A Review
21 pages
Object Detection and Text-to-Speech Integration
No ratings yet
Object Detection and Text-to-Speech Integration
10 pages
Real-Time Object Detection with Deep Learning
No ratings yet
Real-Time Object Detection with Deep Learning
7 pages
5 Semi Marked PDF
No ratings yet
5 Semi Marked PDF
7 pages
Sensors 22 04833
No ratings yet
Sensors 22 04833
17 pages
4216-Article Text-7270-1-10-20190705
No ratings yet
4216-Article Text-7270-1-10-20190705
9 pages
Object Detection and Recognition Using TensorFlow For Blind People
No ratings yet
Object Detection and Recognition Using TensorFlow For Blind People
6 pages
Efficient Model for Salient Object Detection
No ratings yet
Efficient Model for Salient Object Detection
16 pages
Transfer Learning For Object Detection Using State-of-the-Art Deep Neural Networks
No ratings yet
Transfer Learning For Object Detection Using State-of-the-Art Deep Neural Networks
7 pages
Deep Learning for Real-Time Object Detection
No ratings yet
Deep Learning for Real-Time Object Detection
13 pages
An Investigation of Deep Neural Network Based Techniques For Object Detection An
No ratings yet
An Investigation of Deep Neural Network Based Techniques For Object Detection An
6 pages
Object Detection Techniques with ODUELAN
No ratings yet
Object Detection Techniques with ODUELAN
6 pages
Object Detection Using Machine Learningand Neural Networks
No ratings yet
Object Detection Using Machine Learningand Neural Networks
10 pages
Microsoft COCO Dataset Overview 2014
No ratings yet
Microsoft COCO Dataset Overview 2014
16 pages
Object Shape Learning via Orientation Columns
No ratings yet
Object Shape Learning via Orientation Columns
14 pages
(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model
No ratings yet
(2025-AEJ) Object Detection in Real-Time Video Surveillance Using Attention Based transformer-YOLOv8 Model
14 pages
Autonomous Drivingg
No ratings yet
Autonomous Drivingg
12 pages
Accuracy-Relevance in Transfer Learning
No ratings yet
Accuracy-Relevance in Transfer Learning
4 pages
Deep Learning in Object Detection, PDF
No ratings yet
Deep Learning in Object Detection, PDF
64 pages
Biological Vision Inspired Context-Awareness Network For Various Non-Generic Object Detection
No ratings yet
Biological Vision Inspired Context-Awareness Network For Various Non-Generic Object Detection
15 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
Implementation of An Improved Multi-Object Detection, Tracking, and Counting For Autonomous Driving
No ratings yet
Implementation of An Improved Multi-Object Detection, Tracking, and Counting For Autonomous Driving
29 pages
Sensors 24 07566
No ratings yet
Sensors 24 07566
20 pages
Final Project Paper Akash
No ratings yet
Final Project Paper Akash
5 pages
Taskonomy: Disentangling Task Transfer Learning
No ratings yet
Taskonomy: Disentangling Task Transfer Learning
11 pages
Hierarchical Novelty Detection in Vision
No ratings yet
Hierarchical Novelty Detection in Vision
23 pages
A Literature Review of Object Detection Using YOLOv4 Detector
No ratings yet
A Literature Review of Object Detection Using YOLOv4 Detector
7 pages
General Framework for Object Detection
No ratings yet
General Framework for Object Detection
9 pages
OD Trans Christopher-Lang2022 Q2
No ratings yet
OD Trans Christopher-Lang2022 Q2
15 pages
DALL-E For Detection: Language-Driven Compositional Image Synthesis For Object Detection
No ratings yet
DALL-E For Detection: Language-Driven Compositional Image Synthesis For Object Detection
22 pages
Object Detection for the Visually Impaired
No ratings yet
Object Detection for the Visually Impaired
11 pages
Visual Relationship Detection Survey
No ratings yet
Visual Relationship Detection Survey
14 pages
(2025-AEJ) A Relation-Enhanced Mean-Teacher Framework For Source-Free Domain Adaptation of Object Detection
No ratings yet
(2025-AEJ) A Relation-Enhanced Mean-Teacher Framework For Source-Free Domain Adaptation of Object Detection
12 pages
Deep Learning For X Ray Image To Text Generation
No ratings yet
Deep Learning For X Ray Image To Text Generation
4 pages
CNN-Based Pallet Detection in Warehouses
No ratings yet
CNN-Based Pallet Detection in Warehouses
6 pages
4876-Article Text-7942-1-10-20190709
No ratings yet
4876-Article Text-7942-1-10-20190709
8 pages
Recent Advances in Deep Learning For Object Detection
No ratings yet
Recent Advances in Deep Learning For Object Detection
26 pages
Blind Assistance System Using Image Processing
No ratings yet
Blind Assistance System Using Image Processing
11 pages
Context-Aware Multi-Label Image Classification
No ratings yet
Context-Aware Multi-Label Image Classification
13 pages
Transformer For Object Detection Review and Benchmark
No ratings yet
Transformer For Object Detection Review and Benchmark
16 pages
Autonomous Driving Data Annotation
No ratings yet
Autonomous Driving Data Annotation
10 pages
Detecting Missing Objects in Images
No ratings yet
Detecting Missing Objects in Images
9 pages
EdgeYOLO: Real-Time Object Detection
No ratings yet
EdgeYOLO: Real-Time Object Detection
7 pages
YOLO Advances To Its Genesis: A Decadal and Comprehensive Review of The You Only Look Once (YOLO) Series
No ratings yet
YOLO Advances To Its Genesis: A Decadal and Comprehensive Review of The You Only Look Once (YOLO) Series
83 pages
Object-Centric Visual Reasoning Model
No ratings yet
Object-Centric Visual Reasoning Model
22 pages
Real-Time Object Detection Analysis
No ratings yet
Real-Time Object Detection Analysis
11 pages
Deep Learning for Object Identification
No ratings yet
Deep Learning for Object Identification
2 pages
Prior Knowledge in Weakly Supervised Detection
No ratings yet
Prior Knowledge in Weakly Supervised Detection
105 pages
AI for Object Detection and Localization
No ratings yet
AI for Object Detection and Localization
11 pages
A Review of Object Detection Based On Convolutional Neural Network
No ratings yet
A Review of Object Detection Based On Convolutional Neural Network
6 pages
GPT 2 Finetunning
No ratings yet
GPT 2 Finetunning
15 pages
Cross-Category Object Recognition Framework
No ratings yet
Cross-Category Object Recognition Framework
8 pages
ASPIRING MEDICS FREE MOCK MMI - Ethics
No ratings yet
ASPIRING MEDICS FREE MOCK MMI - Ethics
68 pages
Knowledge Management Systems Lifecycle
No ratings yet
Knowledge Management Systems Lifecycle
33 pages
Class I-IV Learning Indicators Guide
No ratings yet
Class I-IV Learning Indicators Guide
19 pages
Jurisprudence and Legal Theory 7
No ratings yet
Jurisprudence and Legal Theory 7
4 pages
Literature Review
No ratings yet
Literature Review
48 pages
Dynamic Theory of Churning Ties
No ratings yet
Dynamic Theory of Churning Ties
45 pages
UG Motor Learning Lab Portfoliog Spring
No ratings yet
UG Motor Learning Lab Portfoliog Spring
47 pages
Data Management Thesis Proposal Guide
100% (3)
Data Management Thesis Proposal Guide
7 pages
Module 10 - Mentoring and Coaching
No ratings yet
Module 10 - Mentoring and Coaching
27 pages
Debating Challenge Book 2 Word List
No ratings yet
Debating Challenge Book 2 Word List
16 pages
NotebookLM As A Powerful and Useful Knowledge Tool
No ratings yet
NotebookLM As A Powerful and Useful Knowledge Tool
3 pages
Qualitative Research in Medicine: Guidelines
No ratings yet
Qualitative Research in Medicine: Guidelines
6 pages
Nahuales
No ratings yet
Nahuales
28 pages
Ethical Criticism
No ratings yet
Ethical Criticism
106 pages
What Are Learning Outcomes
No ratings yet
What Are Learning Outcomes
3 pages
Mosbys Fundamentals of Therapeutic Massage 6th Edition Fritz Ebook and TestBank Bundle Exam Prep
No ratings yet
Mosbys Fundamentals of Therapeutic Massage 6th Edition Fritz Ebook and TestBank Bundle Exam Prep
333 pages
Accounting Information Systems 14th Edition Marshall Romney Ebook and TestBank Bundle
No ratings yet
Accounting Information Systems 14th Edition Marshall Romney Ebook and TestBank Bundle
327 pages
Conceptual Framework in Research
No ratings yet
Conceptual Framework in Research
74 pages
5 Day Reproduction Lesson Plan
No ratings yet
5 Day Reproduction Lesson Plan
5 pages
Film - A Montage of Theories
100% (4)
Film - A Montage of Theories
388 pages
What Is The Meaning of Education
No ratings yet
What Is The Meaning of Education
5 pages
Effective Art Study Techniques Guide
100% (1)
Effective Art Study Techniques Guide
4 pages
Ethics 3
No ratings yet
Ethics 3
4 pages
Master Thesis Ios
100% (3)
Master Thesis Ios
5 pages
The Weirdest People in The World
No ratings yet
The Weirdest People in The World
70 pages
Malraux Anti-Memoires
No ratings yet
Malraux Anti-Memoires
9 pages
Vygotsky - Learning Culture Social Interaction - 2012
No ratings yet
Vygotsky - Learning Culture Social Interaction - 2012
10 pages
Detailed Lesson Plan in TLE L-4 Edited
No ratings yet
Detailed Lesson Plan in TLE L-4 Edited
5 pages
Objectives of ICT Use in Education
No ratings yet
Objectives of ICT Use in Education
7 pages
Hooper Greenhill 2007
No ratings yet
Hooper Greenhill 2007
25 pages

Cotdet: Affordance Knowledge Prompting For Task Driven Object Detection

Uploaded by

Cotdet: Affordance Knowledge Prompting For Task Driven Object Detection

Uploaded by

CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection

Jiajin Tang* Ge Zheng* Jingyi Yu Sibei Yang†

(a) cup, (b) GT (c) Ours, (d) fork (e) temperature

Task driven object detection aims to detect object in-

Large Language Models LanguageModel

(a) “get lemon out of tea” GT(∅),Ours(∅) (c) “pound carpet”

You might also like