Skip to main content
in Publication

New publication

01.06.2026 1 min read

AI agents deliver results – but do they reason scientifically?

A research team co-led by Kevin Maik Jablonka from the Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena) and N. M. Anoop Krishnan from the Indian Institute of Technology Delhi has developed Corral, a new benchmark for AI agents in science. The preprint “AI scientists produce results without reasoning scientifically” has been published on arXiv (https://doi.org/10.48550/arXiv.2604.18805). The analysis shows that current systems can execute scientific workflows and deliver results; however, they often do not follow the basic principles of scientific testing and reasoning.

Artificial intelligence is expected not only to write texts or analyse data, but also to plan scientific experiments, analyse results and generate new knowledge. But when can an AI system truly be said to be doing science? Is it enough for the final result to be correct – or must the path to that result also meet scientific standards? This question is addressed in a new preprint by Jablonka’s team.

With Corral, the researchers developed a benchmark that evaluates AI-based scientific agents not only by their results, but also by how they arrive at them. To do this, the team analysed more than 25,000 agent runs across eight scientific domains – ranging from molecular simulations and materials data analysis to spectroscopic structure elucidation and hypothesis-driven chemical tests. The evaluation examined not only whether a task was solved, but also whether the systems take evidence into account, generate and test hypotheses, and revise their assumptions when confronted with contradictory results.

“We need to be clearer about what kind of scientific reasoning we expect from such AI systems,” says Jablonka. “When it comes to epistemic rigor, better training procedures may help. But in areas where we need reliable guarantees about the reasoning process, we will probably need different systems – for example, systems with symbolic and formally verifiable components.”

Regarding the preprint, Helmholtz AI has already published a detailed background article on the work: https://www.helmholtz.ai/detail/do-ai-scientists-actually-do-science-new-benchmark-probes-the-reasoning-behind-the-results-featuring-dr-kevin-maik-jablonka-helmholtz-ai-associate/