Peter Stone's Selected Publications

• Classified by Topic • Classified by Publication Type • Sorted by Date • Sorted by First Author Last Name • Classified by Funding Source •

SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation.
Michael J. Munje, Chen Tang, Shuijing Liu, Zichao Hu, Yifeng Zhu, Jiaxun Cui, Garrett Warnell, Joydeep Biswas, and Peter Stone.
In Conference on Robot Learning, September 2025.

Download

[PDF]22.5MB

Abstract

Robot navigation in dynamic, human-centered environments requiressocially-compliant decisions grounded in robust scene understanding. RecentVision-Language Models (VLMs) exhibit promising capabilities such as objectrecognition, common-sense reasoning, and contextual understanding—capabilitiesthat align with the nuanced requirements of social robot navigation. However, itremains unclear whether VLMs can accurately understand complex social navigationscenes (e.g., inferring the spatial-temporal relations among agents and humanintentions), which is essential for safe and socially compliant robot navigation.While some recent works have explored the use of VLMs in social robot navigation,no existing work systematically evaluates their ability to meet these necessaryconditions. In this paper, we introduce the Social Navigation Scene UnderstandingBenchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset andbenchmark designed to evaluate VLMs for scene understanding in real-world socialrobot navigation scenarios. SocialNav-SUB provides a unified framework forevaluating VLMs against human and rule-based baselines across VQA tasks requiringspatial, spatiotemporal, and social reasoning in social robot navigation. Throughexperiments with state-of-the-art VLMs, we find that while the best-performingVLM achieves an encouraging probability of agreeing with human answers, it stillunderperforms simpler rule-based approach and human consensus baselines,indicating critical gaps in social scene understanding of current VLMs. Ourbenchmark sets the stage for further research on foundation models for socialrobot navigation, offering a framework to explore how VLMs can be tailored tomeet real-world social robot navigation needs.

BibTeX Entry

@InProceedings{michael_munje_corl2025,
  author   = {Michael J. Munje and Chen Tang and Shuijing Liu and Zichao Hu and Yifeng Zhu and Jiaxun Cui and Garrett Warnell and Joydeep Biswas and Peter Stone},
  title    = {{S}ocial{N}av-{SUB}: Benchmarking VLMs for Scene Understanding in Social Robot Navigation},
  booktitle = {Conference on Robot Learning},
  year     = {2025},
  month    = {September},
  location = {Seoul, Korea},
  abstract = {Robot navigation in dynamic, human-centered environments requires
socially-compliant decisions grounded in robust scene understanding. Recent
Vision-Language Models (VLMs) exhibit promising capabilities such as object
recognition, common-sense reasoning, and contextual understandingâ€”capabilities
that align with the nuanced requirements of social robot navigation. However, it
remains unclear whether VLMs can accurately understand complex social navigation
scenes (e.g., inferring the spatial-temporal relations among agents and human
intentions), which is essential for safe and socially compliant robot navigation.
While some recent works have explored the use of VLMs in social robot navigation,
no existing work systematically evaluates their ability to meet these necessary
conditions. In this paper, we introduce the Social Navigation Scene Understanding
Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and
benchmark designed to evaluate VLMs for scene understanding in real-world social
robot navigation scenarios. SocialNav-SUB provides a unified framework for
evaluating VLMs against human and rule-based baselines across VQA tasks requiring
spatial, spatiotemporal, and social reasoning in social robot navigation. Through
experiments with state-of-the-art VLMs, we find that while the best-performing
VLM achieves an encouraging probability of agreeing with human answers, it still
underperforms simpler rule-based approach and human consensus baselines,
indicating critical gaps in social scene understanding of current VLMs. Our
benchmark sets the stage for further research on foundation models for social
robot navigation, offering a framework to explore how VLMs can be tailored to
meet real-world social robot navigation needs.
  },
}

Generated by bib2html.pl (written by Patrick Riley ) on Mon Dec 01, 2025 02:06:33