首页 - VLA/VLM 论文资讯站

multimodal LLM top_conference ACL

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We presen…

arXiv 摘要 PDF

vision-language model top_conference ICRA

Give me scissors: Collision-Free Dual-Arm Surgical Assistive Robot for Instrument Delivery

During surgery, scrub nurses are required to frequently deliver surgical instruments to surgeons, which can lead to physical fatigue and decreased focus. Robotic scrub nurses provide a promising solution that can replace repetitive tasks a…

arXiv 摘要 PDF

vision-language model top_conference ICLR

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-…

arXiv 摘要 PDF

vision-language model

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion pat…

arXiv 摘要 PDF

vision-language model VLM

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and…

arXiv 摘要 PDF

vision-language model top_conference CVPR

Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) ca…

arXiv 摘要 PDF

vision-language model

OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of c…

arXiv 摘要 PDF

multimodal LLM

Seeing Clearly without Training: Mitigating Hallucinations in Multimodal LLMs for Remote Sensing

Multimodal large language models (MLLMs) suffer from pronounced hallucinations in remote sensing visual question-answering (RS-VQA), primarily caused by visual grounding failures in large-scale scenes or misinterpretation of fine-grained s…

arXiv 摘要 PDF

VLM

Structure-Aware Text Recognition for Ancient Greek Critical Editions

Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates st…

arXiv 摘要 PDF

vision-language model

Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models

Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges…

arXiv 摘要 PDF

vision-language model

Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models

Large Vision Language Models (LVLMs) exhibit strong Chain-of-Thought (CoT) capabilities, yet most existing paradigms assume full-video availability before inference, a batch-style process misaligned with real-world video streams where info…

arXiv 摘要 PDF

vision-language model

Semi-Supervised Few-Shot Adaptation of Vision-Language Models

Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few…

arXiv 摘要 PDF

vision-language model VLM

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation

Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch: VLMs are primarily pretrained on static, disembodied vision-language tasks, which fundamenta…

arXiv 摘要 PDF

vision-language-action VLA

Chain of World: World Model Thinking in Latent Motion

Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future fram…

arXiv 摘要 PDF

vision-language model

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where gene…

arXiv 摘要 PDF

vision-language model top_conference ICLR

Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping

The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to di…

arXiv 摘要 PDF

vision-language-action vision-language model

Utonia: Toward One Encoder for All Point Clouds

We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder…

arXiv 摘要 PDF

最新论文