fMoW-mm: Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing

1MIT 2MIT Lincoln Laboratory 3DAF MIT AI Accelerator
*Correspondence to mloui [at] mit.edu.
Leveraging multimodal LLMs to enhance vision-language remote sensing datasets enables models like CLIP to excel in complex, domain-specific tasks such as few-shot object detection, showcasing the utility of fMoW-mm.

Abstract

Vision language models have achieved impressive results across various fields. However, adoption in remote sensing remains limited, largely due to the scarcity of paired image-text data. To bridge this gap, synthetic caption generation has gained interest, traditionally relying on rule-based methods that use metadata or bounding boxes. While these approaches provide some description, they often lack the depth needed to capture complex wide-area scenes. Large language models (LLMs) offer a promising alternative for generating more descriptive captions, yet they can produce generic outputs and are prone to hallucination. In this paper, we propose a new method to enhance vision-language datasets for remote sensing by integrating maps as external data sources, enabling the generation of detailed, context-rich captions. Additionally, we present methods to measure and mitigate hallucinations in LLM-generated text. We introduce fMoW-mm, a multimodal dataset incorporating satellite imagery, maps, metadata, and text annotations. We demonstrate its effectiveness for automatic target recognition in few-shot settings, achieving superior performance compared to other vision-language remote sensing datasets.

Rule-based captions using metadata and bounding boxes are limited in detail while unimodal LLM captions are fluid but often generic. We leverage LLMs and the semantic density of maps to generate comprehensive and detailed captions for complex wide-area scenes.

Multimodal Dataset Curation

The multimodal LLM-based curation process involves four main steps:

  1. Gather satellite images and metadata: The Functional Map of the World (fMoW-rgb) dataset consists of remote sensing images across 83,412 unique locations, featuring objects from 63 categories. Each image comes with corresponding metadata such as category label, latitude, longitude, timestamp, ground sampling distance (GSD), and bounding box.
  2. Perform OSM Mapbox query to retrieve map tiles: We use the bounding box coordinates from the fMoW-rgb metadata to query the corresponding OSM Static Image tiles through the Mapbox API, which allows map style customization.
  3. Generate captions with GPT-4o: To generate captions, we use the GPT-4o API from OpenAI, which accepts visual and text inputs. For each sample, we input the fMoW-rgb satellite image, metadata and OSM tile. The input metadata includes the category label, location, latitude, longitude, and GSD. We prompt GPT-4o to describe the remote sensing scene and to include landmarks, relative positions, sizes, colors, and quantities, while leveraging the metadata and map for context. Other LLMs, including open-source options, can be substituted for GPT-4o, as long as they accept visual inputs.
  4. Combine elements to create multimodal dataset: We combine the satellite image, metadata, OSM tile, and the GPT-4o generated caption to create 83,412 tuples of (satellite, metadata, map, text). The full dataset is available here.

Measuring Hallucinations

In our approach, hallucinations often occur when the LLM infers incorrect landmarks during caption generation. To quantify these hallucinations, we compute the false discovery rate (FDR), inspired by BLEU precision, which measures the proportion of false positives in the generated text. Unlike BLEU, which evaluates n-gram overlaps, we calculate precision over variable-length proper nouns and define FDR as 1-precision:

where C is the candidate list of size K, R is the reference list, and the indicator function counts the number of elements in C that are found in R. FDR reflects the proportion of false positives among all predicted positives, effectively quantifying the rate of hallucinations in the generated (candidate) captions. Low FDR scores are desirable.

Ablations

We perform ablations to evaluate how components of our curation pipeline affect hallucination rates (FDR) and measure the percentage of uncertain words as a proxy for LLM uncertainty.

We use 1024 resolution, the landmarks-only map type, and 3 ensembled prompts for the final dataset. This configuration results in the best balance, reducing hallucinations while limiting uncertainty.

Few-Shot Object Detection with CLIP

We continually pretrain the CLIP ViT-L/14 model using the fMoW-mm dataset. The model was continually trained for 50 epochs with a batch size of 125. We compare performance with vision-language baselines: CLIP, OpenCLIP, GeoRSCLIP, and RemoteCLIP.

We evaluate the learned visual representation on few-shot object detection based on OVDSAT, using the DIOR dataset, averaged over 5 splits. Our model demonstrates improved performance across all n-shots, showing its viability for data-scarce scenarios. Although the fMoW-mm dataset is much smaller than the datasets used for GeoRSCLIP (~5M) and RemoteCLIP (~150k), it achieves superior performance, highlighting the benefits of increased semantic density in the generated captions.

mAP50 scores for 5, 10, and 30-shot object detection on the DIOR dataset using various visual backbones shows superior performance using fMoW-mm.

BibTeX

@misc{anderson2025measuringmitigatinghallucinationsvisionlanguage,
      title={Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing},
      author={Madeline Anderson and Miriam Cha and William T. Freeman and J. Taylor Perron and Nathaniel Maidel and Kerri Cahoy},
      year={2025},
      eprint={2501.14905},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.14905}
}