ERIK MARTIN WILLÈN: Can AI understand literature? Researchers put it to the test - Computer Sciences

Tuesday, March 31, 2026

Can AI understand literature? Researchers put it to the test - Computer Sciences - Machine learning & AI

Credit: Pixabay/CC0 Public Domain

Even with all the recent advances in the ability of large language models (like ChatGPT) to help us think, research, summarize, and learn complex and technical texts, how do they fare in understanding storytelling and literature? These questions around interpretive nuance remain.

Columbia Engineering researchers are addressing these issues through a novel, ethically grounded evaluation framework. Their work, published on the arXiv preprint server, was recognized with the Best Paper Award in 2025 at the Transactions of the Association of Computational Linguistics (TACL), highlighting its methodological rigor and contribution to the field.

"Before we can place real trust in LLMs' analytical abilities, we need careful evidence of what they can and cannot do," said Kathleen McKeown, the Henry and Gertrude Rothschild Professor of Computer Science at Columbia Engineering. She and Associate Professor Lydia Chilton led the team that worked on this research project.

"If LLMs are to serve as tools for human inquiry, we must first understand the depth and the limits of their analytical capabilities, including in domains like narrative and literature."

A new evaluation framework

The study evaluated the performance of state-of-the-art language models—GPT-4, Claude-2.1, and LLaMA-2-70B—on the task of summarizing short fiction. Unlike many prior evaluations that relied on publicly available texts that may be included in model training data, this project introduced a controlled, original dataset.

The researchers collaborated directly with published authors, who contributed their previously unpublished short stories. These authors then assessed the quality of the summaries generated by the models.

Using both quantitative and qualitative methods informed by narrative theory, the analysis revealed that all three models generated faithfulness errors in over 50% of cases and consistently struggled with specificity and the interpretation of complex subtext or nonlinear narrative structures.

"Models can seem like they understand a story, but their outputs are ultimately unpredictable because they rely on probabilities," said Melanie Subbiah, the lead author of the paper, and a sixth-year Ph.D. student at Columbia in the McKeown lab.

"A trained human literary analyst would produce consistently strong insights, but even the best model is only about 50/50—essentially a coin flip—in giving a reliable analysis for any given story."

The findings underscore the limitations of current LLMs in intellectual and creative contexts that demand close reading and interpretive sensitivity.

While such systems can serve as useful tools, the researchers caution against relying on them for nuanced literary analysis or other tasks requiring deep contextual understanding. Subbiah believes their work "reinforces the value of human-centered, expert-informed evaluation."

Beyond the empirical findings

Ethical considerations were integral to the study. Participating authors were provided full transparency regarding the use of their work and feedback, were compensated for their contributions, and had their intellectual property carefully protected. The project deliberately focused on narrative understanding and analysis rather than text generation, reflecting "a commitment to responsible and respectful research practices."

The project presents a novel methodology for evaluating language models on content that is guaranteed to be absent from their training data.

By working directly with domain experts, in this case, professional authors, the study demonstrates an approach that enables more reliable assessment of a model's interpretive and analytical capabilities. This framework offers a replicable model for future research on narrative understanding and other forms of expert-driven evaluation.

"The hope is that expert human insight will guide how we evaluate LLMs, keeping people at the center of technology development," said Subbiah.

Provided by Columbia University

by Bernadette O. Young, Columbia University

edited by Sadie Harley, reviewed by Robert Egan

Source: Can AI understand literature? Researchers put it to the test