Skip to main navigation Skip to search Skip to main content

DIVSCENE: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes

  • Zhaowei Wang
  • , Hongming Zhang
  • , Tianqing Fang
  • , Ye Tian
  • , Yue Yang
  • , Kaixin Ma
  • , Xiaoman Pan
  • , Yangqiu Song
  • , Dong Yu

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

Abstract

Large Vision-Language Models (LVLMs) have achieved significant progress in tasks like visual question answering and document understanding. However, their potential to comprehend embodied environments and navigate within them remains underexplored. In this work, we first study the challenge of open-vocabulary object navigation by introducing DIVSCENE, a large-scale dataset with 4,614 houses across 81 scene types and 5,707 kinds of target objects. Our dataset provides a much greater diversity of target objects and scene types than existing datasets, enabling a comprehensive task evaluation. We evaluated various methods with LVLMs and LLMs on our dataset and found that current models still fall short of open-vocab object navigation ability. Then, we fine-tuned LVLMs1 to predict the next action with CoT explanations. We observe that LVLM’s navigation ability can be improved substantially with only BFS-generated shortest paths without any human supervision, surpassing GPT-4o by over 20% in success rates.

Original languageEnglish
Title of host publicationEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025
EditorsChristos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
PublisherAssociation for Computational Linguistics (ACL)
Pages9666-9686
Number of pages21
ISBN (Electronic)9798891763357
DOIs
Publication statusPublished - 2025
Event30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - Suzhou, China
Duration: 4 Nov 20259 Nov 2025
https://aclanthology.org/volumes/2025.emnlp-main/

Publication series

NameEMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025

Conference

Conference30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025
Abbreviated titleEMNLP 2025
Country/TerritoryChina
CitySuzhou
Period4/11/259/11/25
Internet address

Bibliographical note

Publisher Copyright:
©2025 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'DIVSCENE: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes'. Together they form a unique fingerprint.

Cite this