Even with all the recent advances
in the ability of large language models (like ChatGPT) to help us think,
research, summarize, and learn complex and technical texts, how do they fare in
understanding storytelling and literature? These questions around interpretive
nuance remain.
Columbia Engineering researchers
are addressing these issues through a novel, ethically grounded evaluation framework. Their work, published on the arXiv preprint server, was recognized
with the Best Paper Award in 2025 at the Transactions of the Association of
Computational Linguistics (TACL), highlighting its methodological rigor and
contribution to the field.
"Before we can place real
trust in LLMs' analytical abilities, we need careful evidence of what they can
and cannot do," said Kathleen McKeown, the Henry and Gertrude Rothschild
Professor of Computer Science at Columbia Engineering. She and Associate
Professor Lydia Chilton led the team that worked on this research project.
"If LLMs are to serve as tools
for human inquiry, we must first understand the depth and the limits of their
analytical capabilities, including in domains like narrative and
literature."
A new evaluation framework
The study evaluated the performance
of state-of-the-art language models—GPT-4, Claude-2.1, and LLaMA-2-70B—on the
task of summarizing short fiction. Unlike many prior evaluations that relied on
publicly available texts that may be included in model training data, this
project introduced a controlled, original dataset.
The researchers collaborated
directly with published authors, who contributed their previously unpublished
short stories. These authors then assessed the quality of the summaries
generated by the models.
Using both quantitative and
qualitative methods informed by narrative theory, the analysis revealed that
all three models generated faithfulness errors in over 50% of cases and
consistently struggled with specificity and the interpretation of complex subtext
or nonlinear narrative structures.
"Models can seem like they
understand a story, but their outputs are ultimately unpredictable because they
rely on probabilities," said Melanie Subbiah, the lead author of the
paper, and a sixth-year Ph.D. student at Columbia in the McKeown lab.
"A trained human literary
analyst would produce consistently strong insights, but even the best model is
only about 50/50—essentially a coin flip—in giving a reliable analysis for any
given story."
The findings underscore the
limitations of current LLMs in intellectual and creative contexts that demand
close reading and interpretive sensitivity.
While such systems can serve as
useful tools, the researchers caution against relying on them for nuanced
literary analysis or other tasks requiring deep contextual understanding.
Subbiah believes their work "reinforces the value of human-centered, expert-informed
evaluation."
Beyond the empirical findings
Ethical considerations were
integral to the study. Participating authors were provided full transparency
regarding the use of their work and feedback, were compensated for their
contributions, and had their intellectual property carefully protected. The project
deliberately focused on narrative understanding and analysis rather than text
generation, reflecting "a commitment to responsible and respectful
research practices."
The project presents a novel
methodology for evaluating language models on content that is guaranteed to be
absent from their training data.
By working
directly with domain experts, in this case, professional authors, the study
demonstrates an approach that enables more reliable assessment of a model's
interpretive and analytical capabilities. This framework offers a replicable
model for future research on narrative understanding and other forms of
expert-driven evaluation.
"The hope is that expert human insight will guide how we evaluate LLMs, keeping people at the center of technology development," said Subbiah.
Source: Can AI understand literature? Researchers put it to the test

No comments:
Post a Comment