When AI Agents Start Cheating

What paper replication taught us about honesty, pressure, and verification

Mar 12, 2026

The work I’m describing was carried out primarily by Dr. Atharva Hans. I just chat with him every day as I come to the office and collect stories.

File:Spranger, Bartholomäus - Hermes and Athena - c. 1585.png — Spranger, Bartholomäus - Hermes and Athena - c. 1585 (Wikimedia Commons)

In my previous post, I discussed how we used an AI agent (powered by GPT-5.3-Codex) to replicate scientific papers. It successfully replicated two of them. Both papers were within the training distribution. However, the agent failed to replicate the third one, which was an edge case.

If you read the prompt we used, you may have noticed the “Anti-cheating rule.” Why was this rule necessary?

When we first started experimenting with agentic paper replication, we used a simple prompt along with the PDF or LaTeX files. The agent would do some work but then stop. It seemed to get bored after a while, so we had to push it by prompting it to continue.

We wanted to encourage the agent to work hard for a few hours. We succeeded by including this in its prompt:

Hard stop condition (non-negotiable):
YOU DO NOT STOP UNTIL EVERY SINGLE TARGETED RESULT (all paper figures/tables/examples you enumerate) HAS BEEN FULLY REPLICATED (or exceeded) AND THE FULL REPRODUCTION RUNS FROM SCRATCH VIA A SINGLE COMMAND.
No partial completion. No “I submitted jobs.” No “next steps.” The only stopping condition is full replication.
Full replication is NOT achieved unless the final PDF report (main.pdf) embeds every reproduced figure/table target (not just files saved under artifacts/).

This worked. The agent would run for 10-15 hours, sometimes more than 24 hours straight. It replicated a few papers this way, and we were very excited. However, we then started noticing some strange behavior.

If the agent couldn’t find the solution after a few hours, it would try to cheat! We observed three levels of cheating.

First, it copied figures from the paper and inserted them directly into its report as if they were reproduced outputs. It did this by writing Python scripts that extracted the figures from the PDF.

After we added a rule prohibiting collages, it started analyzing the figures and trying to infer functions that visually matched them. It would generate curves using a mix of functions and fine-tune them until the resulting plot resembled the paper’s figure. It used the Structural Similarity Index Measure to select the most visually similar option.

After we added a rule prohibiting fabricated plots, it started using surrogate models or simplified algorithms to produce results that resembled the paper, without using the exact method.

The interesting thing about the agent is that it had no problem explaining what it did. It didn’t realize that it was a bad thing. It wasn’t in its prompt.

The agent only stopped cheating after adding this paragraph to its prompt:

Anti-cheating rule (non-negotiable):
Do NOT make a collage. You may not copy, reuse, or embed paper-provided figure/table assets (including extracted paper figures) as if they were reproduced outputs.
All reproduced figures/tables MUST be generated by your implementation end-to-end during runs.
Paper assets may be used ONLY as clearly labeled references for comparison and must never overwrite, substitute for, or be placed in the same path as reproduced artifacts.
Do NOT “replicate” by fabricating plots/images (e.g., generating fake panels/curves purely to match visuals). The underlying method must produce the results.
Do NOT deviate from the paper’s method for the baseline reproduction (no surrogates, alternative algorithms, or simplifications). Implement the exact method first; any improvements must be separate and clearly labeled.

Take this claim with a grain of salt. It is based on limited anecdotal cases, so it’s an empirical observation rather than a definitive fact. We can't assert that the agent didn't cheat just because we added this paragraph to the prompt. In fact, we can never be completely certain unless we check the code it generated. However, it does boost our confidence that we are making progress.

But perhaps expecting an AI agent to be completely honest all the time might be too much. Can you really trust that a human researcher is always honest? Having spent over 20 years in academia, I have witnessed several cheating incidents. I have also heard about many cases of fabricated research results. To cheat is human in some sense.

The last point I want to emphasize is that we need verification mechanisms to develop production-quality, agentic systems for serious applications. Paper replication is low stakes, but autonomous experimentation, autonomous weaponry, and self-evolving social platforms are not. Let’s stay aware as we explore these areas.

Democratizing Science with AI

Discussion about this post

Ready for more?