(a) Example of outputs from a reasoning
model and a non-reasoning model on a perception task. Red highlights indicate
visual hallucination. Multimodal reasoning models are generally more prone to
amplifying hallucinations during the reasoning process compared to their
non-reasoning counterparts. (b) Performance of different models on reasoning
and perception tasks in the RH-Bench dataset. Better performing models are
positioned in the upper right corner. Baseline non-reasoning models of varying
scales typically exhibit weaker reasoning capabilities and fewer hallucination,
whereas reasoning models display the opposite trend. Credit: Liu et al.
Over
the past decades, computer scientists have introduced increasingly
sophisticated machine learning-based models, which can perform remarkably well
on various tasks. These include multimodal large language models (MLLMs),
systems that can process and generate different types of data, predominantly
texts, images and videos.
Some of these models, such as OpenAI's
GPT4 with Vision (GPT-4V), DeepSeek-R1 and Google Gemini, are now widely used
by users worldwide to create specific multi-modal content, including images for
social media posts or articles, as well as texts tailored for specific uses.
While the reasoning abilities of these models have improved
considerably in recent years, allowing them to solve mathematical and reasoning
problems, studies showed that they sometimes respond to things that are not
grounded in the input data, for instance, by describing details that do not
actually exist in an input image.
These hallucinations have been linked to
language priors and internal biases that a model may have acquired during
training while it was analyzing large text datasets. These biases can override
the visual information fed to the model (i.e., input images), causing the model
to incorrectly complete the tasks assigned to it.
Researchers at UC Santa Cruz, Stanford
University and UC Santa Barbara have recently developed a metric and a
diagnostic benchmark that could help to study these hallucinations,
specifically focusing on the relationship between the reasoning of MLLMs and
their tendency to hallucinate when asked to describe what is portrayed in an
input image. These new research tools, presented in a paper on
the arXiv preprint server,
could contribute to the assessment and advancement of MLLMs.
"Test-time compute has empowered
multimodal large language models to generate extended reasoning chains,
yielding strong performance on tasks such as multimodal math reasoning,"
wrote Chengzhi Liu, Zhongxing Xu and their colleagues in their paper.
"However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors."
The
researchers first assessed the performance of MLLMs on complex reasoning tasks
and found that as reasoning chains (i.e., sequences of logical steps required
to solve a problem) grew in length, the models' tendency to hallucinate also
increased. They suggested that these hallucinations emerged due to reduced
attention to visual stimuli and a greater reliance on language priors.
"Attention analysis shows that
longer reasoning chains lead to reduced focus on visual inputs, which
contributes to hallucination," wrote Liu, Xu and their colleagues.
"To systematically study this
phenomenon, we introduce RH-AUC, a metric that quantifies how a model's
perception accuracy changes with reasoning length, allowing us to evaluate
whether the model preserves visual grounding during reasoning. We also release
RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks,
designed to assess the trade-off between reasoning ability and
hallucination."
RH-AUC and RH-Bench, the metrics and
benchmarks developed by Liu, Xu and his colleagues, could soon be used by other
researchers to evaluate the interplay between the reasoning abilities of
specific MLLMs and the risk of hallucinating. Moreover, the observations
presented in the team's paper could guide future efforts aimed at developing
models that can reliably tackle complex reasoning tasks without becoming prone
to hallucinations.
"Our analysis reveals that larger
models typically achieve a better balance between reasoning and perception and that this
balance is influenced more by the types and domains of training data than by
its overall volume," wrote Liu, Xu and their colleagues. "These
findings underscore the importance of evaluation frameworks that jointly
consider both reasoning quality and perceptual fidelity."
Written for you by our author Ingrid Fadelli, edited by Gaby Clark, and fact-checked and reviewed by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive. If this reporting matters to you, please consider a donation (especially monthly). You'll get an ad-free account as a thank-you.
Source: Benchmarking hallucinations: New metric tracks where multimodal reasoning models go wrong
No comments:
Post a Comment