AI Detection Startup Uncovers 100 Fake Citations in Prestigious NeurIPS Conference Papers
By admin | Jan 21, 2026 | 2 min read
AI detection firm GPTZero recently analyzed all 4,841 papers accepted at the prestigious Conference on Neural Information Processing Systems (NeurIPS), held last month in San Diego. Acceptance at NeurIPS is a notable career milestone in AI research, and with these being leading experts, one might expect them to use large language models for tedious tasks like formatting citations. However, several important caveats apply to the findings.
The scan identified 100 confirmed hallucinated citations spread across 51 papers, a number that is not statistically significant. Each paper includes dozens of references, meaning that among tens of thousands of citations, this figure is effectively zero. It is also crucial to recognize that an inaccurate citation does not automatically invalidate the paper’s core research. As NeurIPS stated, “Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves [is] not necessarily invalidated.”
That said, a fabricated citation is not without consequence. NeurIPS emphasizes its commitment to “rigorous scholarly publishing in machine learning and artificial intelligence,” and each paper undergoes peer review with explicit instructions to flag hallucinations. Citations also function as a form of academic currency, reflecting a researcher’s influence and contributing to career advancement. When AI generates false references, it diminishes their value.
Peer reviewers can hardly be blamed for missing a handful of AI-generated citations given the overwhelming volume of material. GPTZero highlights this point, explaining that the analysis aimed to provide concrete data on how AI-generated inaccuracies infiltrate research through what it calls “a submission tsunami,” which has “strained these conferences’ review pipelines to the breaking point.” The startup also references a May 2025 paper titled “The AI Conference Peer Review Crisis,” which examined this issue at top conferences like NeurIPS.
Still, a question remains: why didn’t the researchers themselves verify the accuracy of the LLM’s output? Presumably, they know which sources they actually used. Ultimately, this situation reveals a striking irony: if the world’s foremost AI experts—with their reputations on the line—cannot guarantee precision in their own use of language models, what does that imply for everyone else?
Comments
Please log in to leave a comment.
No comments yet. Be the first to comment!