Abstract
Large Vision-Language Models (LVLMs) have achieved significant progress in tasks like visual question answering and document understanding. However, their potential to comprehend embodied environments and navigate within them remains underexplored. In this work, we first study the challenge of open-vocabulary object navigation by introducing DIVSCENE, a large-scale dataset with 4,614 houses across 81 scene types and 5,707 kinds of target objects. Our dataset provides a much greater diversity of target objects and scene types than existing datasets, enabling a comprehensive task evaluation. We evaluated various methods with LVLMs and LLMs on our dataset and found that current models still fall short of open-vocab object navigation ability. Then, we fine-tuned LVLMs1 to predict the next action with CoT explanations. We observe that LVLM’s navigation ability can be improved substantially with only BFS-generated shortest paths without any human supervision, surpassing GPT-4o by over 20% in success rates.
| Original language | English |
|---|---|
| Title of host publication | EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025 |
| Editors | Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 9666-9686 |
| Number of pages | 21 |
| ISBN (Electronic) | 9798891763357 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 - Suzhou, China Duration: 4 Nov 2025 → 9 Nov 2025 https://aclanthology.org/volumes/2025.emnlp-main/ |
Publication series
| Name | EMNLP 2025 - 2025 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2025 |
|---|
Conference
| Conference | 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 |
|---|---|
| Abbreviated title | EMNLP 2025 |
| Country/Territory | China |
| City | Suzhou |
| Period | 4/11/25 → 9/11/25 |
| Internet address |
Bibliographical note
Publisher Copyright:©2025 Association for Computational Linguistics.
Fingerprint
Dive into the research topics of 'DIVSCENE: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver