0% found this document useful (0 votes)
100 views31 pages

GEOMATH: Benchmark for RS Math Reasoning

The document introduces G EO M ATH, a benchmark for multimodal mathematical reasoning in remote sensing, consisting of 3,773 vehicle-related questions across six mathematical subjects. It evaluates the performance of 14 vision-language models (VLMs) in solving these problems, highlighting challenges in high-resolution image analysis and the need for domain-specific knowledge. The benchmark aims to enhance the development of trustworthy remote sensing interpretation systems by rigorously assessing VLMs' mathematical reasoning capabilities.

Uploaded by

ychades150
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views31 pages

GEOMATH: Benchmark for RS Math Reasoning

The document introduces G EO M ATH, a benchmark for multimodal mathematical reasoning in remote sensing, consisting of 3,773 vehicle-related questions across six mathematical subjects. It evaluates the performance of 14 vision-language models (VLMs) in solving these problems, highlighting challenges in high-resolution image analysis and the need for domain-specific knowledge. The benchmark aims to enhance the development of trustworthy remote sensing interpretation systems by rigorously assessing VLMs' mathematical reasoning capabilities.

Uploaded by

ychades150
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Under review as a conference paper at ICLR 2025

000
001 G EO M ATH : A B ENCHMARK FOR M ULTIMODAL
002
003 M ATHEMATICAL R EASONING IN R EMOTE S ENSING
004
005
Anonymous authors
006 Paper under double-blind review
007
008
009
010 A BSTRACT
011
Vision-language models (VLMs) have demonstrated impressive performance in
012
various Earth observation tasks, particularly in zero-shot capabilities. However,
013
their mathematical reasoning skills in remote sensing (RS) remain unexplored
014 due to the lack of relevant data. To close this gap, we introduce G EO M ATH, a
015 multimodal mathematical reasoning benchmark meticulously designed for the RS
016 domain. It comprises 3773 high-quality vehicle-related questions from aerial per-
017 spectives, spanning 6 mathematical subjects and 20 topics. All data used in this
018 benchmark were collected by our drones from various altitudes and perspectives.
019 Despite the limited geographical coverage, full access to all parameters of the RS
020 images and detailed vehicle information ensures that the constructed mathematical
021 problems are rigorous and diverse. With G EO M ATH, we have conducted a com-
022
prehensive and quantitative evaluation of 14 prominent VLMs. Solving these math
problems requires high-resolution visual perception and domain-specific mathe-
023
matical knowledge, which poses a challenge even for state-of-the-art VLMs. We
024
further explore the impact of image resolution and the zero-shot prompting strat-
025 egy on the scores, analyzing the reasons behind GPT-4o’s reasoning errors. By
026 comparing the gap between InternVL2 and GPT-4o, we find that the latter exhibits
027 some level of cross-view knowledge transfer capability.
028
029
030 1 I NTRODUCTION
031
032 Deep learning has achieved significant success in remote sensing (RS), but it often faces safety
033
concerns due to its black-box nature (Höhl et al., 2024). The advent of vision-language models
(VLM) (Yin et al., 2023), which exhibit strong mathematical reasoning capabilities, offers a new
034
approach to developing reliable RS interpretation systems (Wang et al., 2024c). VLMs can emulate
035
human-like visual reasoning by employing a visual encoder to act as the “eyes” for perception and
036 leveraging a large language model (LLM) as the “brain” for analysis (Dasgupta et al., 2022), facili-
037 tating seamless information transfer between visual and textual modalities. Unlike traditional deep
038 learning models, VLMs can offer a transparent reasoning process. To ensure the development of
039 trustworthy RS interpretation systems, it is crucial to rigorously assess the multimodal mathemati-
040 cal reasoning abilities of VLMs.
041
Numerous RS Visual Question Answering (VQA) datasets (Lobry et al., 2020; Zheng et al., 2021;
042 Zhang et al., 2023a) have been created to evaluate the capabilities of multimodal question answering
043 systems. However, most of these questions primarily assess the model’s visual perception abilities,
044 with math-related questions representing only a small fraction. These math questions are often
045 limited to counting and 2D spatial relationships, leaving the model’s broader mathematical reasoning
046 capabilities largely unexplored. Moreover, since these questions can be answered without domain-
047 specific knowledge (e.g. metric geometry, imaging principles, perspective transformation), they
048 inevitably lack specialization. Hence, there is a pressing need to (1) establish a new benchmark
049 that requires domain-specific knowledge, to facilitate the development of RS VQA systems, and
050
(2) assess the progress of vision-language geofoundation models (VLGFMs) (Zhou et al., 2024),
especially their mathematical reasoning capabilities.
051
052 In this paper, we present G EO M ATH, a multi-modal mathematical reasoning benchmark within the
053 context of remote sensing imagery. It encompasses six mathematical subjects: geometry, logic,
statistics, arithmetic, counting, and algebra. The benchmark supports five potential application

1
Under review as a conference paper at ICLR 2025

054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
Figure 1: Accuracies of four leading VLMs, one VLGFM, and random chance on our proposed
069
G EO M ATH across mathematical subjects and reasoning steps.
070
071 scenarios: surveying, surveillance, market research, entertainment, and military. Each question
072 in the benchmark provides detailed reasoning steps, with the minimum reasoning step size being
073 2 and the maximum being 6. The benchmark covers 11 distinct 4K resolution RS scenes, with
074 varying combinations of drone’s above ground level (AGL) and pitch angles. In general, G EO M ATH
075
comprises 3,773 newly created problems (Table 1). For fine-grained evaluation, the examples are
annotated with metadata, including question type, answer type, rationale, reasoning steps, pitch
076
angle, AGL, and necessary context. A detailed description of data collection can be found in §2.
077
078 We conducted extensive experiments in G EO M ATH to evaluate the reasoning abilities of 14 foun-
079 dation models, which exhibit state-of-the-art performance in multimodal reasoning tasks. Among
080 these models, GPT-4o (OpenAI, 2023) is a proprietary model and GeoChat (Kuckreja et al., 2024b),
081
is fine-tuned in RS data. Furthermore, we explore several zero-shot prompting techniques to shift the
model from a single-step reasoning paradigm to a multi-step reasoning mode, aligning more closely
082
with human cognitive processes. It includes Chain-of-Thought (CoT) (Wei et al., 2022) and Plan-
083
and-Solve (PS) (Wang et al., 2023) designed for LLMs, as well as Description CoT (DespCoT) (Wu
084 et al., 2023) and Compositional CoT (CCoT) (Mitra et al., 2024) tailored for VLMs.
085
086 To our knowledge, we have taken a meaningful first step towards multimodal mathematical reason-
087
ing in RS. This work selects vehicles as the main subject and provides a preliminary exploration
of mathematical problems in remote sensing, without involving multisource RS images or complex
088
sensor characteristics. As illustrated in Figure 1, GPT-4o demonstrates superior performance in five
089
subjects. However, even the highest overall accuracy achieved is only 34.6%. We highlight the chal-
090 lenges that high-resolution RS images pose to VLMs. Our in-depth analysis in §3.3 and E.6 reveals
091 that the knowledge transfer capabilities of GPT-4o are another key factor contributing to its superior
092 performance in G EO M ATH. We hope that G EO M ATH will serve as a valuable resource, providing a
093 benchmark for the future development of trustworthy multimodal interpretation systems of RS.
094
095
096
2 T HE G EO M ATH DATASET
097
As mentioned above, there is a noticeable gap in the RS VQA benchmarks, which mainly focus
098
on evaluating the perceptual capabilities of models while neglecting the mathematical capabilities.
099 Therefore, our dataset, G EO M ATH, aims to bridge this gap by providing a robust evaluation bench-
100 mark for mathematical reasoning intertwined with RS visual perception. In this section, we present
101 the G EO M ATH, following the steps of data collection, metadata annotation, question design, and
102 question generation. Finally, we perform data analysis on the dataset.
103
104 2.1 DATA C OLLECTION
105
106 To the best of our knowledge, there is currently no dedicated mathematical dataset specifically de-
107 signed for remote sensing. Existing open-source RS datasets (Xia et al., 2018; Li et al., 2020) often
lack sensor metadata and provide limited target attributes. Consequently, these datasets can only

2
Under review as a conference paper at ICLR 2025

108
support the formulation of simple mathematical problems, such as counting the object according to
109 its color or judging the relative position in the image. To develop a more specialized and diverse
110 mathematical dataset, we use unmanned aerial vehicles (UAVs) to collect data from scratch. This
111 approach ensures comprehensive access to sensor parameters and detailed information about ground
112 targets. To enhance the diversity of mathematical problems, we choose vehicles as the subject of
113 drone photography. Compared to buildings or land cover (Yang & Newsam, 2010), vehicles have
114 richer attributes and more fine-grained categories. Data collection is divided into two parts: aerial
115 imagery and ground video.
116
117 Aerial Imagery. All aerial images in G EO M ATH were collected with a small UAV platform, DJI
118 Mini3, between 10-16 September 2023, in Shanghai. The dataset consists of 4K high-resolution
119 RS images from 11 distinct scenes, captured at 9 different above-ground levels (AGLs) and 3 pitch
120 angles. This implies that these RS images possess different spatial resolutions and perspectives. In
121
addition, the collected images cover a variety of weather scenarios, such as sunny, cloudy, and rainy
days, along with different lighting conditions. Details are provided in §B.1.
122
123
124 Ground Video. We record ground videos from the same areas to facilitate accurate annotation of
125
vehicle brands and models. Specifically, we select time slots with relatively low vehicular mobility,
avoiding rush hours and meal times. Additionally, to mitigate the vehicle mismatch between drone
126
images and ground videos caused by vehicle entry and exit, we capture two sets of ground videos
127
before and after the drone captures aerial photos. This ensures that vehicles entering or exiting the
128 scene halfway through the capture are recorded in the videos. However, there are instances where
129 vehicles pass through the scene briefly, leading to cases where they are not captured in either video.
130 In such situations, we mask these vehicles with a black mask in the images to ensure that all visible
131 vehicles have fully known attributes. Due to privacy concerns, ground videos will not be released.
132
133 2.2 M ETADATA A NNOTATION
134
135 The metadata we use can be categorized into two main components. The first includes camera-
136 related parameters, such as intrinsic parameters (focal length, pixel size, sensor dimensions) and
137 extrinsic parameters (pitch angle, AGL). These are extracted from the raw data from the drone. The
138 second component pertains to vehicle fine-grained attributes, which require manual annotation. To
139 accurately describe the length and width of vehicles, we use rotated bounding boxes to annotate their
140 positions (Yang et al., 2022). Then, a 360 degree angle representation is used to depict the vehicle’s
141
orientation (Hu & Tong, 2023). Identifying specific vehicle brands from aerial imagery presents a
significant challenge for human annotators, and as a result, existing publicly available RS vehicle
142
datasets have not achieved brand-level annotations (Mundhenk et al., 2016; Zhu et al., 2021).
143
144 However, leveraging the previously mentioned ground videos, we successfully created the first RS
145 vehicle dataset with fine-grained attributes, identifying vehicles down to the model level within each
146 brand. Specifically, we match the vehicles in the aerial image with the vehicles in the ground video
147
one by one according to their locations, and then call the DCD’s API 1 to identify the specific model
based on the vehicle’s appearance and logo in the ground image. For vehicles whose models could
148
not be identified, we used a black mask to cover them from the image. Then we used the DCD
149
car database to obtain detailed attributes, such as the size and price of each car. Vehicle prices
150 were sourced during August 2024, and the average price is calculated based on the maximum and
151 minimum market values. With detailed vehicle attributes and sensor parameters (§B.2), G EO M ATH
152 can be established. In the next subsection, we will list the metadata used for each type of question.
153
154
2.3 Q UESTION D ESIGN AND G ENERATION
155
156 Recent works (Li et al., 2024; Xu et al., 2024) adpot GPT to automatically generate RS VQAs,
157 to reduce manual labor. Compared to template-based methods, model-generated questions exhibit
158 greater diversity. However, in mathematical benchmarks, the rigor of the questions is paramount.
159 Given the current performance of GPT on multimodal mathematical benchmarks (Lu et al., 2024b;
160 Wang et al., 2024b), we cannot fully trust it. Therefore, we choose a template-based question gen-
161
1
https://dcdapp.com

3
Under review as a conference paper at ICLR 2025

What model is the vehicle closest to the drone?


162 Pixel coordinate
system
163
164
165 Solving this problem requires 6 steps. Step 1,
Image coordinate
locate the pixel coordinates of each vehicle.
166 system Step 2, convert them from pixel to image
167 coordinate system. Step 3, convert them from
image to camera coordinate system. Step 4,
168 calculate distance of each vehicle in camera
169 Camera coordinate coordinate system. Step 5, find the minimum
system distance to the drone: 127.6 m. Step 6,
170 identify the model of this vehicle: Audi A4L
171 How long does it take for you to fly to this car?
172
Ground plane My maximum flying speed is 15 m/s, so the
173 fastest it takes is 127.6 / 15 ≈ 8.5 s

174
175 Figure 2: Mathematical modeling of UAV Scenes and examples for geometric question.
176
177 eration approach, which offers more control over content compared to generative models. To com-
178 pensate for the lack of diversity, we design more than 80 templates based on 20 topics(A.3).
179
Geometry. The geometric questions extend the spatial relationships in RS VQA from the 2D pixel
180 plane to the 3D real world. The related camera parameters include pitch angle θ, AGL H, focal
181 length f , and pixel size p. Relevant domain knowledge includes metric geometry and prospective
182 geometry. Figure 2 illustrates a typical UAV reconnaissance scenario. Given the relatively flat terrain
183 of the shooting area, we can assume that it satisfies the assumption of a flat surface (Novak, 2017).
184 We validated this assumption by placing normal vectors n on reference objects such as poles. Given
185 the camera parameters, the pixel coordinates of a car can be used to compute its camera coordinates.
186 The complete calculation formulas are detailed in §B.3. Based on these, the closest vehicle can be
187 identified and the shortest flight time can be estimated based on the speed of the drone. In addition,
188
we can estimate the area of the captured region as well as the size and orientation of the vehicles.
189 Arithmetic. We construct a series of arithmetic questions, including addition, subtraction, multi-
190 plication, and division, based on the prices of the vehicles. For example, questions may ask which
191 of the two cars is more expensive or how many of a certain type of car can be bought with 1 million
192 RMB. Considering that vehicle prices can be unstable due to market fluctuations, we have provided
193 the vehicle models and their corresponding prices in the context field of each problem. We exclude
194 questions that can be answered solely through pure text, ensuring that the model must rely on visual
195 data to arrive at the correct answer. This approach ensures that the model can obtain all the necessary
196
information to solve the current mathematical problems in an offline environment, without the need
for retrieval-augmented generation (RAG) (Gao et al., 2023) techniques.
197
198 Counting. By incorporating more fine-grained attributes of the cars, we are able to construct a
199 wider variety of counting questions with varying levels of difficulty. Related attributes includes
200 vehicle types, brands, models, and prices. The generated questions not only involve counting based
201 on single-attribute constraints but also include comparative counting and counting based on multiple
202 attribute constraints. For example, questions may ask for the number of cars priced above 100,000
203 RMB or the number of white SUVs. In G EO M ATH, each image contains an average of 25.8 cars.
204
The differences between vehicles are smaller compared to those between different object categories,
making the task more challenging.
205
206 Algebra. The algebraic questions are primarily divided into two categories: single-variable alge-
207 bra and multi-variable algebra. The model needs to use its visual perception capabilities to obtain
208 certain variables and then solve equations to determine the target variable. The relevant domain
209 knowledge includes spatial coordinate system transformations, such as determining the coordinates
210 of a vehicle in the image or camera coordinate system based on its pixel coordinates obtained from
211 the image. We also construct algebraic questions related to prices, such as calculating the price of
212 the vehicle closer to the image bottom based on the total price of two cars and their price ratio.
213 Logic. In the design of logic problems, beyond incorporating image-based information, some
214 common-sense knowledge from daily life is introduced. For example, electric vehicles do not need
215 to visit gas stations regularly, and the number of passengers a taxi can accommodate is equal to the
total number of seats in the vehicle minus one (excluding the driver).

4
Under review as a conference paper at ICLR 2025

216
Statistic Number
217
Total questions 3,773
218 - Multiple-choice questions 1,352 (35.8%)
219 - Free-form questions 2,181 (57.8%)
220 - True/False questions 240 (6.4%)
221 Unique number of images 360
- Pitch angle: 90 117 (32.5%)
222
- Pitch angle: 60 126 (35%)
223 - Pitch angle: 45 117 (32.5%)
224 - Above ground level: low 138 (38.3%)
- Above ground level: medium 108 (30.0%)
225 - Above ground level: high 114 (31.7%)
226 Unique number of questions 424
227 Unique number of answers 686
228 Maximum question length 236
Minimum question length 45
229
Average question length 101.5
230
Maximum reasoning steps 6
231 Minimum reasoning steps 2 Figure 3: Question types covered by G EO M ATH. There
232 Average reasoning steps 3.34 are 6 subjects and 20 topics in our benchmark. ARI:
233 arithmetic, CNT: counting, ALG: algebra, STA: statis-
234 Table 1: Key statistics of G EO M ATH. tics, LOG: logic, GEO: geometry.
235
236
Statistics. We design statistical questions based on vehicle prices and sizes, covering maximum,
237
minimum, mean, and mode. Related domain knowledge is metric geometry.
238
239 Existing RS VQA tasks focus mainly on single-step reasoning (Lobry et al., 2020), such as land
240 cover and building classification. Our benchmark emphasizes multistep reasoning ability (Chen
241 et al., 2024), with the minimum reasoning steps for all questions being 2 and the maximum being 6.
242
As shown in Figure 1, the longer reasoning steps place higher requirements on the model’s reasoning
capabilities. We are the first RS VQA dataset that provides multistep reasoning processes for each
243
question. Although it offers a feasible solution approach, it is not necessarily the only one. For the
244
sake of rigor, the reasoning steps are not used to calculate the model scores. However, they can serve
245 as a reference to help in analyzing the reasons behind the model’s reasoning errors (see §3.4).
246
247 During the question generation phase, we prioritize the selection of images in which vehicles are not
248
significantly occluded by buildings or trees to build the benchmark. The process consists of three
steps: 1) generating image-level questions without modifying the images; 2) generating single-
249
instance questions by randomly selecting a vehicle and drawing a rotated bounding box around it
250
as a visual prompt; and 3) generating two-instances questions by randomly selecting two vehicles
251 and drawing their rotated bounding boxes in different colors (e.g. red and blue). Vehicles near the
252 edge of the image are excluded to avoid difficulties due to incomplete visual information. Finally,
253 the generated questions are manually reviewed for accuracy.
254
255
256
2.4 DATA A NALYSIS
257
The main statistics of G EO M ATH are presented in Table 1. There are three types of questions:
258
multiple choice, free-form, and Ture or False. The answers to free-form questions are classified as
259 integers, floating numbers, lists, or strings. Variations in pitch angle and AGL ensure the diversity
260 of observation patterns in G EO M ATH. The examples in §A.2 illustrate the various types of math
261 problem. The comparison of the reasoning steps in Figure 9 with other RS VQA datasets highlights
262 the complexity of the problems G EO M ATH. More details on data analysis are available in §C.
263
264
265 3 E XPERIMENTS
266
267 GeoChat (Kuckreja et al., 2024a) has shown that fine-tuning VLMs on RS datasets enhances their
268 generalization capabilities across various multimodal RS tasks. Our objective is to perform quali-
269 tative and quantitative analyzes using G EO M ATH to assess whether this generalization extends to
multimodal RS tasks that require specialized knowledge. §3.1 outlines our evaluation strategy, while

5
Under review as a conference paper at ICLR 2025

Subject AGL Pitch Angle Type


270 Model LLM ALL
ALG ARI CNT GEO LOG STA Low Med High 45 60 90 FRE CHO T/F
271 Random chance - 11.7 11.5 8.4 10.6 8.0 22.6 9.2 10.7 13.2 12.3 11.1 12.5 12.2 0.0 24.3 51.3
272 Small-scale VLMs (LLM’s Parameters < 10 Billion)
273 GeoChat Vicuna-7B 12.6 15.4 5.4 16.7 9.4 24.1 4.5 13.9 12.7 10.7 11.5 12.3 13.7 4.6 19.9 42.1
274 XComposer2 InternLM2-7B 13.6 4.0 2.9 20.0 2.3 28.7 23.4 15.4 13.5 12.0 12.0 14.4 14.7 11.0 11.8 49.6
Qwen-VL-Chat Qwen-7B 16.5 6.7 10.7 15.8 11.4 30.2 24.1 18.8 18.1 17.8 18.1 18.4 18.3 9.5 26.9 49.2
275 LLaVA-v1.5-7B Vicuna-7B 18.3 12.5 11.6 15.8 12.2 35.8 22.2 19.1 22.7 18.3 19.9 18.5 21.4 10.9 28.3 54.2
276 XComposer2.5 InternLM2-7B 18.5 4.0 16.4 26.1 7.2 26.8 30.3 20.4 18.6 17.6 18.4 20.0 18.5 7.7 30.8 55.0
DeepSeek-VL DeepSeek-7B-Base 18.5 9.2 18.5 18.9 8.8 28.7 27.2 22.1 18.6 17.4 19.4 20.2 19.0 8.8 30.9 53.8
277 MiniCPM-V 2.5 Llama3-8B 20.0 21.5 11.8 24.2 9.2 29.9 23.2 20.4 20.1 17.7 19.0 19.6 19.8 6.5 33.3 59.6
278 MiniCPM-V 2.6 Qwen2-7B 21.6 16.3 19.3 29.2 10.0 30.1 24.5 24.3 18.6 20.1 19.2 23.3 21.0 9.6 34.5 51.3
InternVL2-8B InternLM2.5-7B-Chat 23.7 7.3 22.6 24.4 13.9 34.8 39.1 27.2 25.0 22.0 23.7 24.1 26.9 12.0 38.2 66.3
279
Large-scale VLMs (LLM’s Parameters > 10 Billion)
280
LLaVA-v1.5-13B Vicuna-13B 17.2 10.2 18.7 19.2 11.8 22.8 20.3 17.2 18.0 19.5 18.3 18.4 17.8 5.7 33.1 47.5
281 InternVL-Chat-V1.5 InternLM2-Chat-20B 18.8 17.1 13.4 18.9 9.6 30.4 23.4 20.7 17.2 19.0 16.3 19.7 21.3 8.1 29.7 59.2
282 LLaVA-v1.6-34B Hermes-Yi-34B 23.9 12.1 17.7 31.7 15.1 37.0 29.6 26.1 24.6 21.6 22.9 25.6 23.9 10.6 39.4 61.7
InternVL2-40B Nous-Hermes-2-Yi-34B 26.8 20.1 24.7 23.6 12.0 47.4 33.3 30.1 27.4 24.5 25.3 29.8 27.2 15.9 40.5 59.6
283 GPT-4o - 33.5 35.7 24.2 33.6 15.5 48.2 43.6 36.8 31.7 29.9 30.7 34.1 34.2 18.8 50.8 62.5
284 Zero-Shot Prompting Technique
285 CoT (LLaVA-v1.6-34B) Hermes-Yi-34 20.7 14.2 14.6 28.3 10.9 34.1 22.1 22.0 18.9 20.9 19.2 20.5 22.4 9.3 34.0 49.2
286 CoT (InternVL2-40B) Nous-Hermes-2-Yi-34B 30.2 22.8 25.5 35.6 11.1 49.4 36.8 32.0 28.5 28.2 29.0 31.4 28.6 16.6 44.2 68.3
CoT (GPT-4o) - 34.1 32.8 23.9 34.4 14.9 51.3 47.1 36.6 33.0 31.5 32.0 33.5 36.3 20.7 49.0 69.2
287 PS (InternVL2-40B) Nous-Hermes-2-Yi-34B 28.4 21.7 22.6 29.2 12.6 48.1 36.2 31.7 27.7 26.0 28.5 30.8 26.5 16.1 42.9 62.1
288 PS (GPT-4o) - 34.6 35.3 24.2 32.5 14.5 55.1 45.8 38.4 31.9 32.0 33.3 35.1 34.8 20.5 50.7 68.8
CCoT (InternVL2-40B) Nous-Hermes-2-Yi-34B 24.8 19.8 19.5 20.8 12.5 44.2 32.0 26.5 24.9 24.9 24.7 25.6 26.2 13.8 39.6 52.9
289 DCoT (InternVL2-40B) Nous-Hermes-2-Yi-34B 25.0 20.1 19.2 23.3 12.5 41.9 33.0 27.3 25.5 24.0 25.5 25.0 26.7 14.9 37.1 60.4
290
291 Table 2: Accuracy scores on the G EO M ATH. ALL: average accuracy of the six subjects. Mathemat-
292 ical subjects: ALG: algebra, ARI: arithmetic, CNT: counting, GEO: geometry, LOG: logic, STA:
293 statistics. Reasoning steps indicate the logical sequence of thoughts taken to solve this question.
294 FRE: free-form question, CHO: multiple choice question, T/F: true or false question. The highest
295
scores among models in each section and overall are highlighted in blue and red, respectively.
296
297 Section §3.2 details the VLMs evaluated. Quantitative results are presented in Sections §3.3, fol-
298
lowed by a qualitative analysis result in §3.4.
299
300 3.1 E VALUATION P ROTOCOLS
301
In the realm of multimodal mathematical reasoning benchmarks, such as MathVista (Lu et al.,
302
2024b), GPT is used to derive answers from the responses of various models. However, fre-
303
quent OpenAI API calls for each evaluation can incur substantial costs, challenging independent
304 researchers. Another reason for not using GPT to extract answers is that most RS interpretation
305 systems are typically deployed in offline environments. To reduce the barrier, we design a two-stage
306 answer generation-extraction strategy. In the first stage, the model freely generates answers, focus-
307 ing solely on reasoning without format constraints. In the second stage, the model extracts content
308 in the specified format from its response, improving the accuracy of the format. This decoupling
309 of reasoning and formatting allows us to extract the final answer in an offline environment using
310 regular expressions. During question generation, the type of data for each answer is stored in the
311 “eva” field. In the extraction phase, regular expressions are applied based on the answer type to
312
retrieve the answer from the model’s response. G EO M ATH includes multiple-choice, free-form, and
true/false questions, with free-form being strings, integers, floats, or lists. So, we use the accuracy
313
scores as a metric for evaluation. This allows users to efficiently assess their model performance in
314
G EO M ATH locally using the evaluation function we provided. For details on the evaluation prompts
315 and parameters, refer to §D.
316
317
3.2 E XPERIMENTAL S ETUP
318
319 We evaluated the models in G EO M ATH under three setups: (a) Vision-language Foundation Models
320 that include general models such as LLaVA (Liu et al., 2023), Qwen-VL-Chat (Bai et al., 2023),
321 XComposer2 (Zhang et al., 2023b), DeepSeek-VL (Lu et al., 2024a), InternVL (Chen et al., 2023),
322 MimiCPM-V (Hu et al., 2024), GPT-4o (OpenAI, 2023) and remote sensing VLM GeoChat (Kuck-
323 reja et al., 2024b). (b) Zero-shot prompting setting with CoT (Wei et al., 2022), PS (Wang et al.,
2023), DCoT (Wu et al., 2023) and CCoT (Mitra et al., 2024).

6
Under review as a conference paper at ICLR 2025

324
325
326
327
328
329
330
331
332
333
334 Figure 4: Impact of image resolution and AGL on Figure 5: Distribution of pixel area occupied
335 accuracy scores for InternVL2-40B. by vehicle under different AGLs.
336
337 3.3 E XPERIMENTAL R ESULTS
338
339 Table 2 provides the performance results of various models, including prominent open source VLMs
340 and the leading proprietary model. In light of literature Chen et al. (2024) indicating that LLMs
341 exceeding 10 billion parameters emerge thinking and reasoning capabilities, we have categorized
342 these models into two groups based on the size of their embedded LLMs to facilitate comparison.
343
We create a random chance to serve as a reference baseline. A random option is selected for multiple
choice and true/false questions, while free-form questions are left blank. We generate the random
344
chance three times and average the results, then record in Table 2.
345
346 Among the VLMs evaluated, all models outperform random chance. Notably, InternVL2-8B
347 achieved the highest score of 23.7 within small-scale models. Among the models that do not use
348 zero-shot prompting, GPT-4o consistently achieves the highest overall score of 33.5. Although it
349
falls behind InternVL2-40B in the arithmetic category, it retains a leading position in all other di-
mensions. Surprisingly, GeoChat (Kuckreja et al., 2024b), fine-tuned using LLaVA-v1.5-7B on RS
350
data, exhibited a performance decline (for more details, see §E.4). To gain deeper insights into the
351
reasoning capabilities of the model, we categorize the reasoning steps in Figure 1 into three groups:
352 short (2 steps), medium (3-4 steps) and long (5-6 steps). The results indicate that the accuracy
353 decreases sharply as the length of the reasoning steps increases.
354
355
For multiple-choice and true/false questions, models often do not require a full understanding of the
domain-specific knowledge being tested. Instead, they can rely on logical reasoning and mathemat-
356
ical intuition to arrive at the correct answer. This approach may lead to a superficial understanding,
357
where the model knows the correct answer without truly understanding the underlying concepts. To
358 more accurately assess how well the models grasp RS expertise, we incorporated 57.8% free-form
359 questions into G EO M ATH, as shown in Table 1. These questions require the model first to extract
360 the correct visual cues from the images and then to apply professional knowledge in remote sens-
361 ing to calculate the precise answer, which makes them considerably more challenging. Among the
362 models without using zero-shot prompting, GPT-4o achieves a free-form question score of 18.8,
363 demonstrating the superior capability of GPT-4o.
364
365 Impact of Image Resolution. RS images of G EO M ATH have a high native resolution of
366 4000×2250 pixels. In the previous experiments, the original 4K images were directly fed into
367 the model without pre-processing. However, when these images are resized to the default resolu-
368 tion used by the models, such as 336×336 in LLaVA-v1.5, locating and counting vehicles becomes
369
more challenging. To quantitatively examine the impact of image resolution on model performance,
we performed a comparison experiment using the InternVL-40B model, which supports dynamic
370
resolution technology (Liu et al., 2024a). Figure 4 illustrates how image resolution and AGL af-
371
fect accuracy scores. We group AGL into three classes in ascending order: low (20-40m), medium
372 (60-80m), and high (100-120m). Surprisingly, the big increase in resolution has little impact on the
373 score. The results show that at lower altitudes, model performance improves with increasing reso-
374 lution. However, the performance gains from higher resolutions are less pronounced than expected.
375 This could be due to two main factors: first, high-resolution images represent a smaller portion of the
376 training samples in the foundation model; second, the visual encoder’s limited output tokens require
377 compression of high-resolution visual data. Figure 5 shows the distribution of the area of pixels
occupied by the vehicle under different AGLs, obtained by multiplying the length by the width of

7
Under review as a conference paper at ICLR 2025

378
379
380
381
382
383
384
385
386
387 (a) Errors in answers and rationales (b) Types of rationales for incorrect answers
388
389
Figure 6: Error analysis of GPT-4o results. Notations: “Answer” is “Ans.”, “Rationale” is “Rat.”,
“Partially Correct” is “Partial”.
390
391
Question: There Question: What are the coordi-
392 were 5 cars in nates of the projection point of the
393 drone monitoring point 1 meter above the ground
394 area two hours plane at the center of the vehicle in
395 ago, how many the pixel coordinate system?
cars entered this
396 Choices: (A) [2847, 1986] (B) [1153, 1986]
area at least within
397 two hours? (C) [1153, 264] (D) [2847, 264]
398 GPT-4o output: GPT-4o output: From the image, the center of the ve-
399 There are currently 7 cars in the area. If hicle appears to be around the coordinates [2847, 1986].
there were 5 cars two hours ago, then at Since the point is 1 meter above the ground, the projection
400
least 2 cars have entered the area within the point will be displaced upwards by approximately 69 pixels.
401 New y-coordinate = 1986 − 69 ≈ 1917 Therefore, the new
last two hours.
402 coordinates are approximately [2847, 1917]. Thus, the best op-
403 Answer: 2 tion is: [2847, 1986]
404 Correct output: 2 Correct output: (A) [2847, 1986]
405 (a) Correct answer and explanation (b) Correct answer but wrong explanation
406
407 Figure 7: Two examples from GPT-4o. In (b), GPT-4o is unable to accurately estimate the projection
408
point. A precise calculation should take into account the imaging geometry knowledge.
409
410 the vehicle’s rotated bounding box. When the UAV is at high AGL, vehicles only occupy a smaller
411 area in the images. The visual token obtained after downsampling in the visual encoder will lose
412 some of the detailed features of small targets. We present more examples in §E.3 to further illustrate
413 the impact of image resolution.
414
415 Impact of zero-shot prompting. We attempt to enhance the reasoning performance of VLM, by
416 applying zero-shot prompting strategies. The CoT method, simply by appending ”Let’s think step
417 by step”, significantly improves performance on InternVL and GPT-4o. Specifically, it improves
418
InternVL2-40B by 3.4 points. However, in LLaVA-v1.6-34B, the performance decreases by 3.2
points. This reflects its deficiency in multi-step reasoning capability. This highlights the disparity
419
between the two models in terms of multi-step reasoning capability. The PS method, based on
420
the ”plan-then-execute” approach, improves the performance of GPT-4o by 1.1 points, placing it
421 first among all models. This indicates that GPT-4o has the ability to design a plan for specific
422 problems and can eliminate some of the original errors through this planning process. Recently,
423 several zero-shot prompting methods tailored for VLMs have emerged, such as DCoT and CCoT,
424 but experimental results indicate that their performance remains suboptimal. We attribute this to
425 RS images that contain numerous small objects, making it difficult to fully describe or relate them
426 compared to natural images. More examples of zero-shot reasoning techniques are provided in §E.5.
427
428 3.4 Q UALITATIVE A NALYSIS
429
430 Success and failure analysis of GPT-4o. In §3.3, GPT-4o is currently the top-ranked model in
431 G EO M ATH. To understand its success and failure, we perform a two-stage manual analysis of
the model’s output. In the first stage, we assessed the correctness of the rationale provided by

8
Under review as a conference paper at ICLR 2025

432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448 Figure 8: Comparison of cross-view knowledge transfer ability Between InternVL2 and GPT-4o.
449
450 the model and then evaluated the precision of the results based on the answers extracted through
451 regularization. Figure 6 (a) illustrates the eight patterns of GPT-4o outputs judged manually. We
452 find that 54.3% of the outputs are incorrect answers with the wrong rationale, indicating the models’
453 deficiency in reasoning capabilities within the RS domain. Even among the correct answers, there
454 is a 2.3% chance of being accompanied by incorrect rationale. In the second stage, we summarize
455 four common types of reasoning errors through observation: reliance on common sense, lack of
456 domain-specific knowledge, computational errors, and incorrect visual cues. Figure 6 (b) shows the
457 classification of reasons for erroneous rationale. The primary cause of reasoning errors is the model’s
458
lack of domain-specific knowledge in remote sensing, which also explains why G EO M ATH presents
a greater challenge compared to existing multimodal mathematical reasoning benchmarks. The
459
second most common cause is the failure to accurately extract key visual clues, which accounts for
460
21. 9%, highlighting the model’s deficiency in perception capabilities for RS images. We perform
461 a qualitative analysis of representative examples generated by GPT-4o. In Figure 7 (a), we find that
462 GPT-4o not only produces the correct answers but also provides accurate reasoning, including the
463 correct method to calculate cars. However, in Figure 7 (b), while the model predicts the correct
464 answer, it fails to give the correct reasoning. Its logic is correct, but it lacks the imaging geometry
465 to perform precise calculations.
466
467 Comparison of InternVL and GPT-4o. Interestingly, we observe that GPT-4o demonstrates the
468 ability to infer vehicle prices based on visual attributes observed from an aerial view. As shown in
469 Figure 8, GPT-4o correctly assessed the price of each vehicle, while InternVL2, despite arriving at
470 the correct answer by chance, provided an incorrect analysis. Even for humans, attempting to deter-
471 mine fine-grained details of a vehicle from aerial images is highly challenging. To our knowledge,
472 no existing RS data provides vehicle price information for training, which validates the cross-view
473 knowledge transfer ability of GPT-4o. Further analysis in §E.6 reveals that GPT-4o outperforms
474
other models in answering price-related questions. This suggests that GPT-4o is able to estimate ve-
hicle prices more accurately from an aerial perspective based on existing knowledge. By revealing
475
the potential gap between the two best performing VLMs in G EO M ATH, we hope to provide some
476
guidance for future research. More comparisons of various VLMs can be found in §E.7.
477
478
479 4 R ELATED W ORK
480
481 Several benchmarks (Lu et al., 2024b; Wang et al., 2024b; Liu et al., 2024b) have been proposed
482 to evaluate the multimodal mathematical reasoning capabilities of VLMs, but most focus on pure
483 mathematical theory and computation, without involving remote sensing expertise. Existing bench-
484 marks, such as MathVista (Lu et al., 2024b), rely primarily on small figures, charts, and few natural
485 images to provide visual context. This work presents a domain-specific multimodal mathematical
reasoning benchmark that leverages high-resolution RS images as visual contexts.

9
Under review as a conference paper at ICLR 2025

486
The strong performance of LLMs enables VLGFMs to transparently present their entire reasoning
487 process, offering a new pathway to develop trustworthy RS interpretation systems (Wang et al.,
488 2024c). However, existing VLGFM benchmarks (Hu et al., 2023; Li et al., 2024) provide only final
489 answers, omitting intermediate reasoning steps, which hinders the evaluation of the validity of the
490 reasoning and the reliability of the answers (Chen et al., 2024). To address this gap, we introduce
491 the first VLGFM benchmark that incorporates multistep reasoning processes and features longer
492 reasoning steps than existing RS VQA datasets.
493
494
5 C ONCLUSION
495
496
In this work, we propose G EO M ATH, a novel benchmark designed to evaluate the mathematical
497 reasoning capabilities of VLMs in the context of RS imagery. We evaluated 14 prominent models
498 and observed that even advanced models like GPT-4o struggle due to a lack of domain-specific
499 mathematical knowledge. Furthermore, we highlight the detrimental effect of low-resolution input
500 on model performance, emphasizing that fully utilizing visual clues in high-resolution RS imagery
501 with many small objects is crucial. Moreover, our analysis of the reasons behind GPT-4o’s reasoning
502 errors offers valuable insights for future investigations.
503
504
R EFERENCES
505
506 Mohamad M Al Rahhal, Yakoub Bazi, Sara O Alsaleh, Muna Al-Razgan, Mohamed Lamine
507 Mekhalfi, Mansour Al Zuair, and Naif Alajlan. Open-ended remote sensing visual question an-
508 swering with transformers. International Journal of Remote Sensing, 43(18):6809–6823, 2022.
509
19
510
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang
511 Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.
512 arXiv preprint arXiv:2308.12966, 2023. 6
513
514 Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3 cot: A
515 novel benchmark for multi-domain multi-step multi-modal chain-of-thought. arXiv preprint
516 arXiv:2405.16473, 2024. 5, 7, 10
517
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qing-
518
long Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. In-
519 ternvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv
520 preprint arXiv:2312.14238, 2023. 6
521
522 Ishita Dasgupta, Andrew K Lampinen, Stephanie CY Chan, Antonia Creswell, Dharshan Kumaran,
523 James L McClelland, and Felix Hill. Language models show human-like content effects on rea-
524 soning. arXiv preprint arXiv:2207.07051, 2022. 1
525
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and
526
Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv
527 preprint arXiv:2312.10997, 2023. 4
528
529 Adrian Höhl, Ivica Obadic, Miguel Ángel Fernández Torres, Hiba Najjar, Dario Oliveira, Zeynep
530 Akata, Andreas Dengel, and Xiao Xiang Zhu. Opening the black-box: A systematic review on
531 explainable ai in remote sensing. arXiv preprint arXiv:2402.13791, 2024. 1
532
533
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang,
Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models
534
with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. 6
535
536 Wenxing Hu and Minglei Tong. Trr360d: A dataset for 360 degree rotated rectangular box table
537 detection. arXiv preprint arXiv:2303.01894, 2023. 3
538
539 Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, and Xiang Li. Rsgpt: A remote sensing
vision language model and benchmark. arXiv preprint arXiv:2307.15266, 2023. 10, 19

10
Under review as a conference paper at ICLR 2025

540
Kartik Kuckreja, Muhammad S. Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fa-
541 had S. Khan. Geochat: Grounded large vision-language model for remote sensing. The IEEE/CVF
542 Conference on Computer Vision and Pattern Recognition, 2024a. 5
543
544 Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and
545 Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing.
546 In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
547
27831–27840, 2024b. 2, 6, 7
548 Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote
549 sensing images: A survey and a new benchmark. ISPRS journal of photogrammetry and remote
550 sensing, 159:296–307, 2020. 2
551
552 Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark
553 dataset for remote sensing image understanding. arXiv preprint arXiv:2406.12384, 2024. 3, 10,
554
19
555 Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv
556 preprint arXiv:2304.08485, 2023. 6, 22
557
558 Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction
559 tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
560 (CVPR), pp. 26296–26306. IEEE, 2024a. 7
561
Wentao Liu, Qianjun Pan, Yi Zhang, Zhuo Liu, Ji Wu, Jie Zhou, Aimin Zhou, Qin Chen, Bo Jiang,
562 and Liang He. Cmm-math: A chinese multimodal math dataset to evaluate and enhance the
563 mathematics reasoning of large multimodal models. arXiv preprint arXiv:2409.02834, 2024b. 9
564
565 Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for
566 remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566,
567 2020. 1, 5, 19
568
Sylvain Lobry, Begüm Demir, and Devis Tuia. Rsvqa meets bigearthnet: A new, large-scale, vi-
569 sual question answering dataset for remote sensing. In 2021 IEEE International Geoscience and
570 Remote Sensing Symposium (IGARSS), pp. 1218–1221. IEEE, 2021. 19
571
572 Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren,
573 Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding.
574 arXiv preprint arXiv:2403.05525, 2024a. 6
575
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-
576 Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of
577 foundation models in visual contexts. In International Conference on Learning Representations,
578 2024b. 3, 6, 9
579
580 Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-
581 thought prompting for large multimodal models. In Proceedings of the IEEE/CVF Conference
582
on Computer Vision and Pattern Recognition (CVPR), pp. 14420–14431. IEEE, 2024. 2, 6
583 T Nathan Mundhenk, Goran Konjevod, Wesam A Sakla, and Kofi Boakye. A large contextual
584 dataset for classification, detection and counting of cars with deep learning. In Computer Vision–
585 ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016,
586 Proceedings, Part III 14, pp. 785–800. Springer, 2016. 3
587
588 Libor Novak. Vehicle detection and pose estimation for autonomous driving. Ph. D. dissertation,
589
PhD thesis, Masters thesis, 2017. 4
590 OpenAI. Gpt-4 technical report, 2023. 2, 6
591
592 Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and
593 Robin Roberson Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene
understanding. IEEE Access, 9:89644–89654, 2021. 19

11
Under review as a conference paper at ICLR 2025

594
Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. Earthvqa: Towards
595 queryable earth via relational reasoning-based remote sensing visual question answering. In Thir-
596 tieth AAAI Conference on Artificial Intelligence, 2024a. 19
597
598 Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring
599 multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804,
600
2024b. 3, 9
601 Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim.
602 Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language
603 models. arXiv preprint arXiv:2305.04091, 2023. 2, 6
604
Sheng Wang, Wei Han, Xiaohui Huang, Xiaohan Zhang, Lizhe Wang, and Jun Li. Trustworthy
605
remote sensing interpretation: Concepts, technologies, and applications. ISPRS Journal of Pho-
606
togrammetry and Remote Sensing, 209:150–172, 2024c. 1, 10
607
608 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny
609 Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in
610 neural information processing systems, 35:24824–24837, 2022. 2, 6
611
Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C Gee, and Yixin Nie. The role
612 of chain-of-thought in complex vision-language reasoning task, 2023. 2, 6
613
614 Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello
615 Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. In
616
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3974–3983,
2018. 2
617
618 Linrui Xu, Ling Zhao, Wang Guo, Qiujun Li, Kewang Long, Kaiqi Zou, Yuhan Wang, and Haifeng
619 Li. Rs-gpt4v: A unified multimodal instruction-following dataset for remote sensing image un-
620 derstanding. arXiv preprint arXiv:2406.12479, 2024. 3
621
Xue Yang, Gefan Zhang, Xiaojiang Yang, Yue Zhou, Wentao Wang, Jin Tang, Tao He, and Junchi
622
Yan. Detecting rotated objects as gaussian distributions and its 3-d generalization. IEEE Trans-
623
actions on Pattern Analysis and Machine Intelligence, 45(4):4335–4354, 2022. 3
624
625 Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classification.
626 In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic
627 information systems, pp. 270–279, 2010. 3
628 Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on
629 multimodal large language models. arXiv preprint arXiv:2306.13549, 2023. 1
630
631 Zhenghang Yuan, Lichao Mou, Zhitong Xiong, and Xiao Xiang Zhu. Change detection meets visual
632 question answering. IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2022. 19
633 Meimei Zhang, Fang Chen, and Bin Li. Multistep question-driven visual question answering for
634 remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 61:1–12, 2023a. 1, 19
635
636 Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuan-
637
grui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-
language large model for advanced text-image comprehension and composition. arXiv preprint
638
arXiv:2309.15112, 2023b. 6
639
640 Xiangtao Zheng, Binqiang Wang, Xingqian Du, and Xiaoqiang Lu. Mutual attention inception
641 network for remote sensing visual question answering. IEEE Transactions on Geoscience and
642 Remote Sensing, 60:1–14, 2021. 1, 19
643
Yue Zhou, Litong Feng, Yiping Ke, Xue Jiang, Junchi Yan, Xue Yang, and Wayne Zhang. Towards
644
vision-language geo-foundation model: A survey. arXiv preprint arXiv:2406.09385, 2024. 1
645
646 Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, Qinghua Hu, and Haibin Ling. De-
647 tection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 44(11):7380–7399, 2021. 3

12
Under review as a conference paper at ICLR 2025

648
C ONTENTS
649
650
651
A Problem Design 14
652 A.1 Mathematical Reasoning Definition . . . . . . . . . . . . . . . . . . . . . . . . . 14
653
A.2 Mathematical Reasoning Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 15
654
655 A.3 Topic Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
656
657 B Data Collection Details 17
658
B.1 UAV data collection information . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
659
660 B.2 Details of Matadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
661
B.3 Details of Coordinate System Transformation . . . . . . . . . . . . . . . . . . . . 17
662
663
664
C More Dataset Analysis 19
665
666 D More Details on the Setup 20
667 D.1 Prompts for Response Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 20
668
669 D.2 Model Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
670
671 E More Experimental Results 21
672
E.1 Analysis of Pitch Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
673
674 E.2 Analysis of Response Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
675 E.3 Impact of Image Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
676
677
E.4 Analysis of GeoChat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
678 E.5 More Examples of Zero-Shot Prompting Techniques . . . . . . . . . . . . . . . . 24
679
E.6 Cross-view Knowledge Transfer Ability of GPT-4o . . . . . . . . . . . . . . . . . 29
680
681 E.7 Comparisons of Different Models . . . . . . . . . . . . . . . . . . . . . . . . . . 30
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701

13
Under review as a conference paper at ICLR 2025

702
A P ROBLEM D ESIGN
703
704
A.1 M ATHEMATICAL R EASONING D EFINITION
705
706 Six subjects of mathematical reasoning in remote sensing are defined in Table 3.
707
708 Mathematical Subject Description
709
It emphasizes spatial understanding, analysis of 2D and 3D coordinate system, and
710 Geometry
reasoning about their relationships. Measure distance, size, area, and angle based on
711 (28.6%)
imaging principles and perspective transformation.
712 It focuses on critical thinking, induction, and deduction reasoning from provided in-
713 Logic
formation. The key components include premises, conclusions, and the use of abstract
(20.3%)
714 reasoning.
715 Statistics It focuses on data interpretation and analysis, such as measuring the maximum, min-
716 (17.4%) imum, median, mean, and mode.
717 It covers the fundamental operations such as addition, subtraction, multiplication, and
Arithmetic
718 (14.6%) division.
719
Counting It involves determining the number of specific objects based on single or multiple
720 (9.5%) constraints.
721
Algebra It encompasses understanding variables, equations, such as solving univariate and
722 multivariate equations.
(9.5%)
723
724 Table 3: Definitions and proportions of six mathematical subjects in G EO M ATH.
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755

14
Under review as a conference paper at ICLR 2025

756
A.2 M ATHEMATICAL R EASONING E XAMPLES
757
758
Math Examples
759
760
Context: The sensor parameters that may be used are as follows:
Focal Length: 12 millimeters. Pixel Size: 0.004325 millimeters.
761 Image Width: 4000 pixels. Image Height: 2250 pixels.
762 Question: How many meters are the two vehicles in the red and
763 blue boxes apart?
764 Rationale:
765 GEO Step 1, locate the center point of two vehicles in the red and blue
boxes: [3554, 1051] and [2583, 1974]
766 Step 2, convert them from pixel to image coordinate system: ...
767 Step 3, convert them from image to camera coordinate system: ...
768 Step
p 4, calculate the distance in the camera coordinate system:
769 (24.4 − 6.8)2 + (−1.2 − 9.9)2 + (43.6 − 32.5)2 ≈ 23
770 Answer: 23
771 Question: There were 27 cars in this area an hour ago, how
772 many cars have entered this area at least within an hour?
773 Solution:
774 LOG Step 1, count all current vehicles: 29
Step 2, the number of cars entering the area is at least equal
775
to the increase in the number of cars in this area: 29-27=2
776 Answer: 2
777
Question: What color of vehicle is most common in the image?
778 Rationale:
779 Step 1, identify the color of all vehicles: [‘white’, ‘brown’, ...]
780 STA Step 2, count vehicles for each color: {‘white’: 7, ‘brown’: 3, ...}
781 Step 3, sort to get the most common color: white
782
Answer: white
783 Context: The vehicle price dictionary that may be used is as fol-
784 lows: {‘nio ec6’: 385000, ‘byd dolphin’: ...}
785 Question: What is the price difference between the car in the red
box and the car in the blue car? (Unit: RMB)
786
ARI Rationale:
787 Step 1, identify the model of two cars: byd song plus and aito m5
788 Step 2, query the prices of two vehicles: 155000 and 265000
789 Step 3, calculate the price difference: 265000-155000=110000
790 Answer: 110000
791 Question: How many SUV vehicles are there in the image?
792 Rationale:
793 CNT Step 1, identify the type of all vehicles: [‘suv’, ‘suv’, ...]
794 Step 2, count all SUV vehicles: 17
Answer: 17
795
796
Context: The sensor parameters that may be used are as follows:
797 Focal Length: 12 millimeters. Pixel Size: 0.004325 millimeters.
798 Image Width: 4000 pixels. Image Height: 2250 pixels.
799 Question: The equation of the ground plane in the camera coordi-
800 nate system is: -cos(90)*y-sin(90)*z+40=0. What are the coordi-
nates of the center point of the vehicle in the red box in the camera
801
coordinate system? (Unit: meter)
802 ALG
Rationale:
803 Step 1, locate the center point of the vehicle: [420, 534]
804 Step 2, convert the center point of the vehicle from the pixel coor-
805 dinate system to the image coordinate system: [-6, -2]
Step 3, convert the center point of the vehicle from the image co-
806
ordinate system to the camera coordinate system: [-22, -8, 40]
807 Answer: [-22, -8, 40]
808
809 Table 4: Examples of six mathematical reasoning subjects in G EO M ATH.

15
Under review as a conference paper at ICLR 2025

810
A.3 T OPIC S UMMARY
811
812 The topics are summarized in Table 5.
813
814 Topic Subject Visual Skill Application
815 Perspective Geometry GEO Location Surveying
816 Metric Geometry GEO Location Surveying & Military
817 Spatial Relation GEO FG Recognition, Location Surveying & Military
818 Comparison LOG FG Recognition, Visual Prompt Entertainment
819 Deduction LOG FG Recognition, Visual Prompt Surveillance
Induction LOG FG Recognition Surveillance
820
821 Maximum STA FG Recognition, Location Market Research
Minimum STA FG Recognition, Location Market Research
822
Mean STA FG Recognition, Location Market Research
823 Median STA FG Recognition, Location Market Research
824 Mode STA FG Recognition, Location Market Research
825 Addition ARI FG Recognition Market Research
826 Subtraction ARI FG Recognition, Visual Prompt Market Research
827 Multiplication ARI FG Recognition, Visual Prompt Market Research
Division ARI FG Recognition, Visual Prompt Market Research
828
829
Counting based on single property CNT FG Recognition Market Research
Counting based on multiple property CNT FG Recognition Market Research
830 Counting based on comparison CNT FG Recognition Market Research
831
Univariate Equation ALG Location, Visual Prompt Surveying
832 Multivariate Equations ALG Location, Visual Prompt Surveying
833
834 Table 5: Summary of the 20 different topics in G EO M ATH. The table provides details on their sub-
835 ject and visual skill types. Location represents the ability to provide the pixel coordinates of key
836 points. FG recognition, short for fine-grained recognition, refers to the ability to identify critical
837 visual cues in RS images, including the specific properties and models of vehicles. Visual prompt
838 indicates the capability to determine the referenced target based on various colored boxes added
839 to the image. Surveying suggests that remote sensing professionals can leverage this capability
840
to enhance the efficiency of geological surveys and obtain interpretable and reliable results. Mili-
tary indicates that it can be used in unmanned warfare to improve the intelligence level of drones.
841
Entertainment indicates that users can utilize this capability to satisfy their curiosity. Surveil-
842
lance indicates that this capability can be used to monitor activities within a specific area. Market
843 research indicates that automotive companies can leverage this capability to conduct fine-grained
844 analysis of customer preferences within a specific region.
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863

16
Under review as a conference paper at ICLR 2025

864
B DATA C OLLECTION D ETAILS
865
866 B.1 UAV DATA COLLECTION INFORMATION
867
868
AGL Pitch Angle
869 Scenario Date Time Weather
20 30 40 60 70 80 100 110 120 45 60 90
870
A 0910 Noon Sunny 17 7 10 6 9 5 4 8 5 24 27 20
871
B 0911 Noon Sunny 9 7 4 2 4 3 2 2 2 13 13 9
872 C 0912 Morning Sunny 12 14 8 6 8 4 3 7 3 29 21 15
873 D 0912 Afternoon Sunny 13 10 7 5 7 4 2 5 1 22 20 12
874 E 0913 Morning Cloudy 11 7 4 4 4 3 4 4 2 15 19 9
875 F 0913 Morning Cloudy 7 15 6 9 6 4 10 1 2 22 26 12
876 G 0914 Noon Cloudy 13 6 3 3 4 3 5 6 3 16 21 9
877 H 0914 Noon Cloudy 8 5 6 4 3 3 4 9 2 17 16 11
I 0915 Noon Rainy 13 15 11 5 8 3 3 7 2 30 21 16
878
J 0915 Afternoon Cloudy 18 14 10 7 4 7 6 3 3 21 31 20
879 K 0916 Noon Cloudy 11 11 7 10 4 4 4 13 4 28 25 15
880
881 Table 6: The data collected by the drone covers multiple weather conditions, AGLs, and pitch angles.
882
883
884 B.2 D ETAILS OF M ATADATA
885
886 Type Details
887
Camera Focal length, ISO, pixel size, shutter speed, aperture, sensor size, image resolution,
888 parameters pitch angle, AGL, latitude, longitude, timestamp.
889
Location of pixel coordinate system, rotated bounding box, front direction, brand,
890 Vehicle model, color, type, powertrain, length, width, height, sunroof, roof rack, max price,
891 attributes min price, number of doors / seats.
892
893 Table 7: Details of metadata, where most vehicle attributes are obtained from the ground video.
894
895
896 B.3 D ETAILS OF C OORDINATE S YSTEM T RANSFORMATION
897
The complete derivation processes for two coordinate system transformations are provided here.
898
899 The transformation between the pixel coordinate system and the image coordinate system can be
900 represented by an affine matrix, as follows:
901 # 1 0 w "
xP xI
" #
p 2
902 1 h
yP =  0 p 2  yI (1)
903
1 0 0 1 1
904
905 where p represents the pixel size of sensor. w2 and h2 denote the origin offsets, with the origin of the
906 image coordinate system typically located at the image’s top-left corner. Given the pixel coordinates
907 of a certain point, its corresponding image coordinates can be calculated as follows:
908 
909 xI = (xP − w/2) · p
(2)
910 yI = (yP − h/2) · p
911
912 The transformation from the camera coordinate system to the image coordinate system is a con-
913 version from three-dimensional to two-dimensional coordinates. Assuming the focal length of the
914 camera is f , then we have
915 # xC
 
xI f 0 0 0
" # "
916  y 
zc yI = 0 f 0 0  C  (3)
917 zC
1 0 0 1 0
1

17
Under review as a conference paper at ICLR 2025

918
where zC denotes the depth of the point, which can be obtained by a depth camera (binocular or
919 structured light). Because the drone camera we are using cannot provide depth information, we
920 need to find another way.
921
922
When the ground satisfies the ground plane assumption, given the AGL of the drone and the pitch
angle of the camera, the ground plane equation in the camera coordinate system is as follows:
923
924 − cos θ · YC − sin θ · ZC + H = 0 (4)
925
926 The equation of the line connecting the camera origin to the projection point on the pixel plane in
927 the camera coordinate system is given by:
928 
929 XC = xI · t
930 Y = yI · t (5)
 C
931 ZC = f · t
932
933 Substituting the line equation into the ground plane equation yields:
934
H
935 t= (6)
936
yI cos θ + f sin θ
937
Substituting t back into the line equation yields:
938
939 xI H yI H fH
( , , ) (7)
940 yI cos θ + f sin θ yI cos θ + f sin θ yI cos θ + f sin θ
941
942 To preserve the spatial mapping between camera coordinates and pixel coordinates, we refrained
943 from cropping the 4K images to increase the dataset size, as is commonly done in most remote
944 sensing datasets.
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971

18
Under review as a conference paper at ICLR 2025

972
C M ORE DATASET A NALYSIS
973
974
Images Mathematical Subject
975 Dataset
Number Size
#VQAs
CNT GEO LOG ARI ALG STA
Rationale
976 RSVQA-LR (Lobry et al., 2020) 772 512 77,232 ¥ ¥ q q q q q
RSVQA-HR (Lobry et al., 2020) 100,659 512 1,066,316 ¥ ¥ q q q q q
977 RSVQAxBEN (Lobry et al., 2021) 590,325 20 to 120 14,758,150 q q q q q q q
978 FloodNet (Rahnemoonfar et al., 2021) 4,056 4,000 11,000 ¥ q q q q q q
RSIVQA (Zheng et al., 2021) 37,264 256 to 4,000 111,134 ¥ ¥ q q q q q
979 CDVQA (Yuan et al., 2022) 2,968 512 122,000 q q q ¥ q q q
980 VQA-TextRS (Al Rahhal et al., 2022) 2144 256 to 600 6245 q ¥ q q q q q
CRSVQA (Zhang et al., 2023a) 4,639 600 4,644 ¥ ¥ q q q q q
981 RSIEval (Hu et al., 2023) 100 512 936 ¥ ¥ q q q q q
982 EarthVQA (Wang et al., 2024a) 6,000 1024 208,593 ¥ ¥ q q q q q
VRSBench (Li et al., 2024) 29,614 512 123,221 ¥ ¥ q q q q q
983
G EO M ATH 360 4,000 3,773 ¥ ¥ ¥ ¥ ¥ ¥ ¥
984
985 Table 8: Comparison between existing remote sensing vision-language datasets and our G EO M ATH
986 dataset. G EO M ATH dataset provides a more comprehensive coverage of mathematical problems.
987 Additionally, it is the first RS VQA dataset to provide the rationale, which means reasoning pro-
988 cesses.
989
990 Previous datasets offer counting-type VQAs based on a single condition, with the object attributes
991 being relatively few and primarily focused on color. G EO M ATH not only enriches the attributes of
992 the object, but also introduces object counting under multiple constraints, significantly increasing
993 the difficulty. Moreover, G EO M ATH is the first to extend spatial relationships from the plane to
994 three-dimensional space, substantially enhancing the complexity of tasks, while previous datasets
995
provided geometric problems that were restricted to planar spatial relationships.
996
RSVQA FloodNet RSIVQA
997 Q: How many buildings on the left of a Q: How many buildings are flooded? Q: How many large vehicles are there in
road are there in the image? A: 3 this picture?
998 A: 38 Rationale: Step 1, identify the status of all A: 258
Rationale: Step 1, locate the road and all buildings. Step 2, count the buildings that Rationale: Step 1, identify the size of all
999 buildings. Step 2, count buildings with x-
coordinates smaller than road.
are flooded. vehicles. Step 2, count the vehicles with
Max reasoning steps: 2 large size.
Max reasoning steps: 2 Max reasoning steps: 2
1000
1001 CRSVQA VQA-TextRS RSIEval
Q: What color is the car between the purple Q: What is the size of the tree in the corner? Q: Where is the parking lot with only three
1002 and the red car? A: Large tree cars located in the image?
A: Gray Rationale: Step 1, locate the tree in the A: Upper left corner
1003 Rationale: Step 1, locate the purple and the corner. Step 2, identify the size of the tree. Rationale: Step 1, locate all parking lots.
Step 2, count cars in each parking lot. Step
red car. Step 2, flocate the car between Max reasoning steps: 2
3, determine if the number of cars in each
1004 them. Step 3, identify the color of the car.
parking lot is equal to three.
Max reasoning steps: 3
Max reasoning steps: 3
1005
EarthVQA GeoChat-Bench RS-GPT4V
1006 Q: What are the road types around the Q: What is the color of the ship anchored at Q: What is the relative position of the
residential area? the harbor? chimney emitting smoke to the other
1007 A: Two-way single lanes, and wide lanes. A: Grey chimney?
Rationale: Step 1, locate the road around Rationale: Step 1, locate the ship anchored A: On the right
1008 the residential area. Step 2, identify the
attributes of the road.
at the harbor. Step 2, identify the color of
the ship.
Rationale: Step 1, locate these two
chimneys. Step 2, compare the coordinates
Max reasoning steps: 2 Max reasoning steps: 2 of them.
1009 Max reasoning steps: 2

1010 VRSBench GeoMath


Q: What is the object located closest to the Q: Assuming the flight speed of the drone is 10 meters per second, how long does it take for the drone
1011 top? to reach the closest vehicle? (Unit: second)
A: Small vehicle A: 12.9
1012 Rationale: Step 1, locate all objects. Step Rationale: Step 1, locate all vehicles. Step 2, convert the center point of each vehicle form pixel to
image coordinate system. Step 3, convert the center point of each vehicle form image to camera
2, fnd the object with the smallest y-axis
coordinate system. Step 4, calculate the distance of each vehicle to drone. Step 5, sort these distances
1013 coordinate value. Step 3, identify the
attributes of the target. and find the closest vehicle. Step 6, calculate the flight time of the drone.
Max reasoning steps : 3 Max reasoning steps: 6
1014
1015
Figure 9: Examples of mathematical problems requiring the maximum reasoning steps across var-
1016
ious RS VQA benchmarks. Except for GeoMath, these benchmarks do not explicitly provide rea-
1017 soning steps; the examples shown are manual analysis results. Undoubtedly, G EO M ATH currently
1018 has the longest reasoning steps among RS VQA benchmarks.
1019
1020
1021
1022
1023
1024
1025

19
Under review as a conference paper at ICLR 2025

1026
D M ORE D ETAILS ON THE S ETUP
1027
1028
D.1 P ROMPTS FOR R ESPONSE G ENERATION
1029
1030 The prompt used to instruct the foundation models to generate responses is illustrated in Table 9.
1031
1032 Question type Stage Task instruction
1033
Observe this image captured by a drone and answer the question by choos-
1034 Multiple-choice Generation
ing the best option. Question: {question} Choices: {choices}
1035
Based on the question ({question}) and reasoning provided in the out-
1036 Multiple-choice Extraction put, conclude the final answer in the format ‘Answer: $LETTER’ (without
1037 quotes) where LETTER is one of ABCD.
1038
Observe this image captured by a drone and answer the question. Question:
1039 True/False Generation
{question}
1040
Based on the question ({question}) and reasoning provided in the output,
1041 True/False Extraction conclude the final answer in the format ‘Answer: Yes’ or ‘Answer: No’
1042 (without quotes).
1043
Observe this image captured by a drone and answer the question. Question:
1044 Free-form Generation
{question}
1045
Based on the question ({question}) and reasoning provided in the output,
1046 Free-form Extraction
conclude the final answer in the format ‘Answer: XX’ (without quotes).
1047
1048 Table 9: The task instructions for different question types.
1049
1050
1051 D.2 M ODEL H YPERPARAMETERS
1052
1053
The hyperparameters for the experiments in §3.2 are set to their default values unless otherwise
specified. Table 10 details the specific generation parameters for the various VLMs we evaluated.
1054
1055
1056
Model Generation Setup
1057 Official API, model = gpt-4o, temperature = 0, max tokens = 1000, evalu-
GPT-4o
1058 ation dates range from Sep 12 to 18, 2024.
1059 GeoChat do sample = False, temperature = 0.0, max new tokens = 1000
1060 Framework: https://github.com/InternLM/lmdeploy
1061
Others
session len = 8192, temperature = 0.0, max tokens = 1000
1062
1063 Table 10: Generating parameters for various VLMs.
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079

20
Under review as a conference paper at ICLR 2025

1080
E M ORE E XPERIMENTAL R ESULTS
1081
1082
E.1 A NALYSIS OF P ITCH A NGLE
1083
1084 Among the 14 models, we find that half the models, represented by GPT-4o, achieve the highest
1085 scores at a camera pitch angle of 90 degrees, while other half models, e.g., InternVL2-40B, perform
1086 better at 60 degrees. All models show the poorest performance at a pitch angle of 45 degrees,
1087 which can be attributed to the lack of low-angle samples during training. Therefore, enhancing the
1088 generalizability of RS VLMs under different viewing angles is a potential research direction.
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107 Figure 10: Impact of camera pitch angle on model performance.
1108
1109
1110 E.2 A NALYSIS OF R ESPONSE L ENGTH
1111
1112 We analyze the accuracy scores and average response lengths of 14 models and find a clear positive
1113
correlation between them. The longer the response lengths of the model, the more likely it is to
achieve higher accuracy on G EO M ATH.
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133 Figure 11: The relationship between the model’s response length and accuracy score.

21
Under review as a conference paper at ICLR 2025

1134
E.3 I MPACT OF I MAGE R ESOLUTION
1135
1136
1137 Figure 12 illustrates the detrimental impact of lower image resolution on InternVL2-40B. The AGL
1138 of this photo is 100 meters and the pitch angle is 60 degrees. As the image resolution decreases from
1139 4000×2250 to 336×189, the number of pixels representing the car in the red box drops from 6926
1140
to just 44 (calculated based on the length and width of the rotated bounding box), causing the model
to lose its ability to accurately recognize vehicle types. Earlier VLMs typically used fixed input
1141
resolutions (Liu et al., 2023), often with significantly lower values. To fully exploit the visual clues
1142
available in the G EO M ATH images, enabling support for 4K input resolution would be a promising
1143 direction for future research on VLM.
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156 Question: How many mid-size vehicles are there in the image?
1157
Output when image resolution equals to 4000×2250:
1158 To determine the number of mid-size vehicles in the image, let’s carefully examine the parking area. 1.
1159 Identify the Vehicles: Look for vehicles that are neither too small (such as compact cars) nor too large
1160 (such as SUVs or trucks). Mid-size vehicles typically include sedans and some small SUVs. 2. Count the
1161 Vehicles: The parking area has a total of 18 vehicles. By visually inspecting each vehicle, we can estimate
its size. 3. Estimate Mid-Size Vehicles: Upon close inspection, it appears that 10 vehicles are mid-size.
1162 Therefore, based on visual inspection, there are 10 mid-size vehicles in the image.
1163 Answer: 10
1164
1165 Output when image resolution equals to 1000×562:
To determine the number of mid-size vehicles in the image, let’s carefully examine the parking area. 1.
1166
Identify the Vehicles: Look for vehicles that are neither too small (such as compact cars) nor too large
1167 (such as SUVs or trucks). Mid-size vehicles typically include sedans and some small SUVs. 2. Count the
1168 Vehicles: Starting from the left, there are several rows of parked vehicles. In the first row, there are two
1169 vehicles that appear to be mid-size. In the second row, there are three vehicles that appear to be mid-size. In
1170 the third row, there are two vehicles that appear to be mid-size. In the fourth row, there are three vehicles
that appear to be mid-size. 3. Summarize the Count: First row: 2 mid-size vehicles. Second row: 3
1171
mid-size vehicles. Third row: 2 mid-size vehicles. Fourth row: 3 mid-size vehicles 4. Total Count: 2 + 3 +
1172 2 + 3 = 10. Therefore, there are 10 mid-size vehicles in the image.
1173 Answer: 10
1174
1175 Output when image resolution equals to 336×189:
To determine the number of mid-size vehicles in the image, let’s carefully examine the scene. The image
1176
shows a parking lot adjacent to a building and a road. The vehicles are parked in a row, and we need to
1177 identify the mid-size vehicles among them. Mid-size vehicles typically have a length between 180 inches
1178 (4.57 meters) and 200 inches (5.08 meters). They are larger than compact cars but smaller than full-size
1179 vehicles. Upon close inspection, it appears that all the vehicles in the parking lot are mid-size vehicles.
1180
They are consistent in size and shape, fitting the typical characteristics of mid-size cars. Therefore, based
on common knowledge and information available in the image, I estimate that there are approximately 15
1181 mid-size vehicles.
1182 Answer: 15
1183
1184
Figure 12: Example illustrating the impact of image resolution on the InternVL2-40B. As the resolu-
1185
tion decreases, reasoning vehicle types from image information becomes increasingly challenging.
1186
1187

22
Under review as a conference paper at ICLR 2025

1188
E.4 A NALYSIS OF G EO C HAT
1189
1190 We found that fine-tuning models using task-specific RS image-text instruction sets may impair their
1191 capabilities on some multimodal reasoning tasks. This indicates that existing VLGFMs are funda-
1192 mentally still engaged in data fitting, which does not lead to the emergence of true RS intelligence.
1193 This raises an important question. What kind of image-text instruction set can lead us toward a
1194 “GPT-4v moment” in remote sensing?
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213 Figure 13: Performance difference analysis between LLaVA-v1.5-7B and GeoChat-7B. The latter
1214
can be roughly regarded as the former finetuned on RS data.
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241

23
Under review as a conference paper at ICLR 2025

1242
E.5 M ORE E XAMPLES OF Z ERO -S HOT P ROMPTING T ECHNIQUES
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
Question: Assuming the flight speed of the drone is 16 meters per second, how long does it take for the
1253
drone to reach the closest vehicle? (Unit: second)
1254 Ground truth: 9.4
1255 Solution:
1256 Step 1, locate the center point of each vehicle. Step 2, convert center point of each vehicle from pixel
1257 coordinate system to image coordinate system. Step 3, convert center point of each vehicle from image
coordinate system to camera coordinate system. Step 4, calculate distance of each vehicle in camera
1258
coordinate system. Step 5, find the minimum distance to the drone: 150.9. Step 6, calculate the flight time
1259 of the drone: 150.9/16 = 9.4
1260
1261 Model Response:
1262 To determine the time it takes for the drone to reach the farthest vehicle, we need to calculate the distance it
needs to travel and then divide that by the drone’s speed.
1263
1264 1. Determine the distance to the farthest vehicle:
1265 - The farthest vehicle is located at the edge of the parking lot, which is horizontally aligned with the drone’s
1266 position.
1267
- The distance can be approximated by considering the width of the parking lot and the distance from the
drone’s position to the farthest vehicle.
1268
1269 2. Estimate the width of the parking lot:
1270 - The parking lot appears to be approximately 100 meters wide.
1271
3. Calculate the distance to the farthest vehicle:
1272
- The drone is positioned at an altitude of 120 meters.
1273 - The farthest vehicle is at the edge of the parking lot, which is approximately 100 meters horizontally from
1274 the drone’s position.
1275 - Therefore, the total horizontal distance to the farthest vehicle is 120 meters (altitude) + 100 meters
1276 (horizontal distance) = 220 meters.
1277
4. Calculate the time taken:
1278 - The drone’s speed is 16 meters per second.
1279 - Time = Distance / Speed
1280 - Time = 220 meters / 16 meters per second
1281 - Time = 13.75 seconds
1282 So, it takes approximately 13.75 seconds for the drone to reach the farthest vehicle.
1283
1284
1285 Figure 14: The response of the InternVL2-40B model without CoT.
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295

24
Under review as a conference paper at ICLR 2025

1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310 Question: Assuming the flight speed of the drone is 16 meters per second, how long does it take for the
1311 drone to reach the closest vehicle? (Unit: second)
1312 Ground truth: 9.4
Solution:
1313 Step 1, locate the center point of each vehicle. Step 2, convert center point of each vehicle from pixel
1314 coordinate system to image coordinate system. Step 3, convert center point of each vehicle from image
1315 coordinate system to camera coordinate system. Step 4, calculate distance of each vehicle in camera
1316 coordinate system. Step 5, find the minimum distance to the drone: 150.9. Step 6, calculate the flight time
of the drone: 150.9/16 = 9.4
1317
1318 Model Response:
1319 To determine the time it takes for the drone to reach the farthest vehicle, we need to calculate the distance it
1320 needs to travel and then divide that by the drone’s speed.
1321
1. Determine the distance to the farthest vehicle:
1322
- The farthest vehicle is located at the edge of the parking lot, which is horizontally aligned with the drone’s
1323 position.
1324 - The distance can be approximated by considering the width of the parking lot and the distance from the
1325 drone’s position to the farthest vehicle.
1326
2. Estimate the width of the parking lot:
1327
- The parking lot appears to be approximately 100 meters wide.
1328
1329 3. Calculate the distance to the farthest vehicle:
1330 - The drone is positioned at an altitude of 120 meters.
1331
- The farthest vehicle is at the edge of the parking lot, which is approximately 100 meters horizontally from
the drone’s position.
1332 - Therefore, the total distance to the farthest vehicle is the sum of the horizontal distance (100 meters) and
1333 the vertical distance (120 meters).
1334
1335 4. Calculate the total
p distance: √ √
Total Distance = (100 m)2 + (120 m)2 = 10000 + 14400 = 24400 ≈ 156.2 meters
1336
1337 5. **Calculate the time taken:**
1338 - The drone’s speed is 16 meters per second.
1339 - Time taken = Distance / Speed
1340 Time = 16156.2 meters
meters/second
≈ 9.76 seconds
1341
Therefore, it takes approximately 9.76 seconds for the drone to reach the farthest vehicle.
1342
1343
1344 Figure 15: The response of the InternVL2-40B model based on CoT. The answer is very close.
1345
1346
1347
1348
1349

25
Under review as a conference paper at ICLR 2025

1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
Question: Assuming the flight speed of the drone is 16 meters per second, how long does it take for the
1364
drone to reach the closest vehicle? (Unit: second)
1365 Ground truth: 9.4
1366 Solution:
1367 Step 1, locate the center point of each vehicle. Step 2, convert center point of each vehicle from pixel
1368 coordinate system to image coordinate system. Step 3, convert center point of each vehicle from image
coordinate system to camera coordinate system. Step 4, calculate distance of each vehicle in camera
1369
coordinate system. Step 5, find the minimum distance to the drone: 150.9. Step 6, calculate the flight time
1370 of the drone: 150.9/16 = 9.4
1371
1372 Model Response:
1373 To determine how long it takes for the drone to reach the farthest vehicle, we need to calculate the distance
between the drone and the farthest vehicle and then divide that distance by the drone’s speed.
1374
1375 ### Step-by-Step Solution:
1376
1377 1. Determine the distance from the drone to the farthest vehicle:
1378
- The image is captured from a height of 120 meters.
- The farthest vehicle is located at the edge of the parking lot, which is approximately 1/3 of the image width
1379 from the center.
1380 - The image width is 4000 pixels.
1381 - The pixel size is 0.004325 millimeters.
1382
First, convert the image width from pixels to meters:
1383
Image Width (meters) = 4000 pixels × 0.004325 mm/pixel = 17.3 meters
1384
1385 Next, calculate the distance from the center of the image to the farthest vehicle:
1386 Distance to farthest vehicle = 17.3 2meters = 8.65 meters
1387
2. Calculate the time taken by the drone to reach the farthest vehicle:
1388
- The drone’s speed is 16 meters per second.
1389
1390 Now, divide the distance by the speed:
1391 Time = Distance
Speed
= 168.65 meters
meters/second
= 0.540625 seconds
1392
### Conclusion:
1393 The drone takes approximately 0.540625 seconds to reach the farthest vehicle.
1394
1395 This calculation assumes a straight-line distance from the drone to the farthest vehicle and does not
1396 account for any potential obstacles or changes in flight path.
1397
1398 Figure 16: The response of the InternVL2-40B model based on PS.
1399
1400
1401
1402
1403

26
Under review as a conference paper at ICLR 2025

1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416 Question: Assuming the flight speed of the drone is 16 meters per second, how long does it take for the
1417 drone to reach the closest vehicle? (Unit: second)
Ground truth: 9.4
1418 Solution:
1419 Step 1, locate the center point of each vehicle. Step 2, convert center point of each vehicle from pixel
1420 coordinate system to image coordinate system. Step 3, convert center point of each vehicle from image
1421 coordinate system to camera coordinate system. Step 4, calculate distance of each vehicle in camera
coordinate system. Step 5, find the minimum distance to the drone: 150.9. Step 6, calculate the flight time
1422
of the drone: 150.9/16 = 9.4
1423
1424 Model Response:
1425 To determine how long it takes for the drone to reach the farthest vehicle, we need to calculate the distance
1426 between the drone and the farthest vehicle and then divide that distance by the drone’s speed.
1427
### Relevant Image Information:
1428 1. Drone’s Starting Point: The drone is positioned at the top center of the image.
1429 2. Farthest Vehicle: The farthest vehicle is located at the bottom right corner of the image, near the parking
1430 lot.
1431 3. Drone’s Speed: The drone’s speed is 16 meters per second.
1432
### Steps to Calculate the Time:
1433 1. Estimate the Distance:
1434 - The drone appears to be at a height above the farthest vehicle.
1435 - The distance from the drone to the farthest vehicle can be approximated by the diagonal distance from the
1436
top center to the bottom right of the image.
1437 2. Approximate Distance Calculation:
1438 - The image dimensions are not provided, but we can estimate the distance by considering the relative sizes
1439 of the objects in the image.
1440 - Let’s assume the image dimensions are approximately 100 meters by 100 meters (this is a rough estimate
based on the size of the buildings and the parking lot). √
1441
- The diagonal distance in a square with side length (s) is given by (s 2).
1442
1443 3. Calculate the
√ Distance:
1444 - d ≈ 100 × 2 ≈ 141.4 meters.
1445
1446
4. Calculate the Time:
1447 - t = Distance 141.4 meters
= 16 meters
Speed per second
1448 approx8.84 seconds.
1449
1450 ### Conclusion:
1451 The drone takes approximately 8.84 seconds to reach the farthest vehicle, assuming the drone’s speed is 16
meters per second and the image dimensions are roughly 100 meters by 100 meters.
1452
1453
1454 Figure 17: The response of the InternVL2-40B model based on DCoT.
1455
1456
1457

27
Under review as a conference paper at ICLR 2025

1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471 Question: Assuming the flight speed of the drone is 16 meters per second, how long does it take for the
1472 drone to reach the closest vehicle? (Unit: second)
1473 Ground truth: 9.4
1474 Solution:
Step 1, locate the center point of each vehicle. Step 2, convert center point of each vehicle from pixel
1475 coordinate system to image coordinate system. Step 3, convert center point of each vehicle from image
1476 coordinate system to camera coordinate system. Step 4, calculate distance of each vehicle in camera
1477 coordinate system. Step 5, find the minimum distance to the drone: 150.9. Step 6, calculate the flight time
1478 of the drone: 150.9/16 = 9.4
1479
Model Response:
1480 To answer the question, we need to determine the distance between the drone and the farthest vehicle and
1481 then calculate the time it takes for the drone to cover that distance at the given speed.
1482
1483 1. Objects and Attributes:
- Drone: The drone is the object of interest with a specified speed of 16 meters per second.
1484
- Vehicle: The farthest vehicle is the object we need to reach.
1485 - Building: The building is near the drone.
1486 - Parking Lot: The parking lot is where the farthest vehicle is located.
1487 - Trees: Trees surround the building and parking lot.
1488 - Path: A path connects the parking lot to the building.
1489
2. Object Relationships:
1490 - The drone is near the building.
1491 - The farthest vehicle is in the parking lot.
1492 - The building is adjacent to the parking lot.
1493
- Trees surround both the building and the parking lot.
- A path connects the parking lot to the building.
1494
1495 ### Step-by-Step Calculation
1496
1497 1. Estimate the Distance:
- From the image, visually estimate the distance between the drone and the farthest vehicle.
1498
- Let’s assume the distance is approximately 50 meters.
1499
1500 2. Calculate Time:
1501 - Time = Distance / Speed
1502 - Time = 50 meters / 16 meters per second
- Time = 3.125 seconds
1503
1504 Therefore, it takes approximately 3.125 seconds for the drone to reach the farthest vehicle.
1505
1506
1507
Figure 18: The response of the InternVL2-40B model based on CCoT.
1508
1509
1510
1511

28
Under review as a conference paper at ICLR 2025

1512
E.6 C ROSS - VIEW K NOWLEDGE T RANSFER A BILITY OF GPT-4 O
1513
1514 Previous RS VQA datasets focused primarily on obvious attributes such as color and relative size,
1515 and do not require domain-specific automotive knowledge to answer. In contrast, our benchmark
1516 captures a variety of fine-grained attributes of ground vehicles, including color, model, brand, price,
1517 power type, etc., by aligning drone-captured aerial images with corresponding ground videos. Un-
1518 like the default experimental setting, we do not provide any context on vehicle prices. We further
1519 filter price-related questions from G EO M ATH to quantitatively assess the models’ ability to esti-
1520
mate car prices based on the visual appearance of cars in remote sensing images. This specific
capability partially reflects the models’ cross-view knowledge transfer ability. As expected, GPT-4o
1521
achieves an accuracy of 53.2% on price-related questions, demonstrating a dominant performance.
1522
InternVL2-40B also performs well, trailing GPT-4o by 12.9 points but still significantly outperform-
1523 ing other models.
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540 Figure 19: Accuracy scores of VLMs on price-related questions.
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565

29
Under review as a conference paper at ICLR 2025

1566
E.7 C OMPARISONS OF D IFFERENT M ODELS
1567
1568
Question: What color of SUV vehicles is most common in the
1569
image?
1570 Choices: A) black. B) pink. C) orange. D) blue.
1571 Ground truth: D) blue
1572 Solution: Step 1, identify the type of all vehicles.
1573 Step 2, identify the color of suv vehicles.
Step 3, count suv vehicles for each color: {‘gray’: 2, ‘blue’: 3,
1574
‘brown’: 1, ‘black’: 2}
1575 Step 4, sort to get the most common color of suv vehicle: blue
1576
1577 XComposer2: The answer is A.
1578 Answer: A
1579
1580 GeoChat: A) black.
Answer: A
1581
1582 LLaVA-v1.5-7B: A) black.
1583 Answer: A
1584
1585 InternVL-Chat-V1.5: A) black.
1586
Answer: A
1587
DeepSeek-VL: A) black.
1588 Answer: A
1589
1590 InternVL2-8B: A) black.
1591 Answer: A
1592
MiniCPM-V 2.5: The question pertains to identifying the most common color of SUV vehicles in the image.
1593
However, the image provided does not contain any SUV vehicles; it is an aerial view of a building complex
1594 with solar panels and surrounding greenery. Therefore, none of the choices (A) black, B) pink, C) orange, or
1595 D) blue apply to the content of the image. The question cannot be answered based on the provided image.
1596 Answer: None
1597
MiniCPM-V 2.6: The most common color of SUV vehicles in the image is black. This can be determined
1598
by observing the parking lot where several SUVs are visible, and the majority of them are black. There are
1599 no pink, orange, or blue SUVs that can be clearly identified in the image. Therefore, the best option is A)
1600 black.
1601 Answer: A
1602
1603
LLaVA-v1.6-34B: A) black.
Answer: A
1604
1605 InternVL2-40B: A) black.
1606 Answer: A
1607
1608 GPT-4o: A) black.
Answer: A
1609
1610
1611 Figure 20: All VLM baselines predicted the wrong answer to this complex problem.
1612
1613
1614
1615
1616
1617
1618
1619

30
Under review as a conference paper at ICLR 2025

1620
1621
1622
1623
1624 Question: How many sports vehicles are there in the image?
Ground truth: 1
1625 Solution:
1626 Step 1, identify the type of all vehicles: [‘sports’, ‘mpv’, ‘mid-
1627 size’, ‘subcompact’]
1628 Step 2, count all sports vehicles: 1
1629
1630
1631
1632 InternLM-XComposer2: There are two sports vehicles in the image.
1633 Answer: 2
1634
1635 GeoChat: There are a total of 2 sports vehicles in the image.
Answer: 1
1636
1637 LLaVA-v1.5-7B: There are two sports vehicles in the image.
1638 Answer: 2
1639
1640 InternVL-Chat-V1.5: There are two sports vehicles in the image. One is a red sports car parked on the left
1641 side of the parking lot, and the other is a black sports car parked on the right side of the parking lot.
Answer: 2
1642
1643 DeepSeek-VL: There is one sports vehicle in the image.
1644 Answer: 1
1645
1646 InternVL2-8B: There are a total of four sports vehicles visible in the image.
1647 Answer: 4
1648
MiniCPM-V 2.5: In the image, there is one sports vehicle visible. It is a red sports car parked on the brick
1649 pavement. The car’s design, with its sleek body and low profile, is characteristic of sports cars, which are
1650 typically designed for high performance and speed. The other vehicles in the image do not exhibit the same
1651 design features and are therefore not classified as sports vehicles.
1652 Answer: 1
1653
MiniCPM-V 2.6: To determine the number of sports vehicles in the image, one would need to identify
1654 vehicles that are typically associated with sports, such as those with a sleek design, low profile, and possibly
1655 high-performance features. In this image, the red car in the center appears to be a sports vehicle due to its
1656 vibrant color and sporty design. There are no other vehicles in the image that clearly fit the description of a
1657 sports vehicle. Therefore, based on the visible evidence, there is one sports vehicle in the image.
Answer: 1
1658
1659
LLaVA-v1.6-34B: There is one sports vehicle in the image, which is the red car parked on the left side of
1660 the street.
1661 Answer: 1
1662
1663 InternVL2-40B: There is only a single sports vehicle visible in the image.
Answer: 1
1664
1665 GPT-4o: The image shows one sports vehicle, which is the red car located near the center-left of the image.
1666 Answer: 1
1667
1668
Figure 21: Most VLMs have the ability to identify the type of car from RS images. The MiniCPM
1669
series models compare the common features of sports cars with the vehicles in the image during the
1670 reasoning process.
1671
1672
1673

31

You might also like