I’ve been running an informal study for the past month using a set of 40 images, a mix of AI-generated and human-created, across different styles and subjects. I ran this set through four different AI image detection tools and then also asked a group of colleagues (academics, not imaging specialists) to make their best guess with no tool assistance.
The results were humbling.
The detection tools disagreed with each other on 31% of the images. That’s not a small number. On those contested images, the tools’ confidence scores also varied significantly. One tool flagging an image as 94% likely AI-generated while another returns 38% on the same image is not useful information. It’s noise that looks like signal.
My colleagues, the human group, performed about at chance on photographic images but noticeably better on illustrations and digital art, where stylistic cues were apparently more recognizable. The tools showed the inverse pattern, performing somewhat better on photographs than on stylized images.
What this suggests to me is that neither human judgment nor current detection tools are reliable enough to use as sole arbiters of authenticity. They should probably be used together, and even then with significant caution about what claims you’re willing to make based on the output.
The implications for publishing and academic contexts are real. If you’re using an image detection tool to make decisions about submitted work, you should understand what the tool actually tells you and what it doesn’t. A detection result is a probability estimate, not a determination.
Has anyone else been running systematic comparisons? Curious whether the disagreement rates I’m seeing hold up across different datasets.
This matches what I’d expect and what I’ve seen anecdotally. The honest answer is that detection across all modalities is a probabilistic problem, and people consistently treat it as a binary one. A tool that returns 72% AI likelihood is being asked to do the work of a judgment it was never designed to deliver on its own. Your methodology here is exactly the right framing.
The disagreement rate across tools is interesting from a technical angle. These models are trained on different datasets with different labeling approaches, so disagreement isn’t surprising. What would concern me more is if any of the tools are systematically wrong in a consistent direction. Consistently underdetecting or consistently overdetecting is arguably worse than random noise.
The stylized versus photographic split you found is interesting. I wonder if part of what’s happening is that AI-generated art has visually colonized certain aesthetic spaces so thoroughly that human-created art in those styles now ‘reads’ as AI even when it isn’t. The training data problem goes both directions.
For content work, the practical implication is that I can’t rely on any single tool result and can’t stake a professional claim on the output. I’ve stopped trying to definitively identify AI images and moved to disclosure-first workflows where the provenance question is answered before publication rather than investigated after the fact.
I’m in real estate and AI-generated property images are becoming a real issue in listing photography. The detection tools I’ve tried don’t perform well enough to flag listings reliably, which means it mostly falls on agents with good pattern recognition. Your study makes me feel less alone about how inconsistent this problem is.