AI art detectors: tested five of them on the same 50 images

For a class project I put together a test set of 50 images: 25 AI-generated across different models and styles, 25 human-created across photography, illustration, and digital art. I ran all 50 through five different AI art detectors and recorded the results. Here’s what the data showed.

Overall accuracy across all five tools: ranged from 61% to 79% correct classification. That’s meaningfully better than chance but not reliable enough to use as a binary classifier. The best-performing tool was wrong 21% of the time.

Agreement between tools: the five tools agreed on 58% of the images. For the remaining 42%, at least one tool diverged from the majority classification. On some images, tools were evenly split.

Style effects: all tools performed substantially better on photorealistic AI images than on stylized or illustrative content. For the photography subset, average accuracy was around 82%. For stylized digital art, it dropped to 67%. Human-created digital art in styles that overlap with common AI aesthetics was the hardest category for every tool.

False positive rate: 18% of human-created images were flagged as AI by at least one tool. 11% were flagged by a majority of tools. For images in styles that AI generators commonly produce, that rate went up significantly.

My takeaway: these tools are useful for generating hypotheses, not conclusions. A classification from any single tool should be treated as ‘worth looking at more carefully’ not ‘confirmed AI.’ Using them to make consequential decisions about authorship requires understanding that the error rates are high enough to matter.

Happy to share the full methodology if anyone wants to replicate.

This is methodologically sound and the results are consistent with what I’ve seen in text detection. The false positive rate on human-created work in AI-adjacent styles is the finding that institutions should be paying attention to. Any policy that relies on these tools as determinative has to account for an error rate in the 11-18% range on plausible human work.

What most teams miss here is that the 79% accuracy ceiling on the best tool means you’re looking at roughly one wrong classification in five under optimal conditions. At the scale most content operations run, that’s a meaningful error volume. The tools are useful as one input into a workflow. They’re not useful as a final answer.

The ‘generates hypotheses not conclusions’ framing is exactly right and it applies to text detection tools too. The confusion happens when people treat a probability estimate as a verdict. Your data makes the gap between those two things very concrete.

The stylized art result tracks with what I’d expect from a creative standpoint. Certain aesthetics have been colonized by AI generation to the point where human artists working in those styles are going to get flagged systematically. That’s a real harm to human creators in specific communities and it’s not being discussed enough.

The 42% disagreement between tools on individual images is striking. That’s not a small margin. In my editorial context, a result that five experts disagreed on 42% of the time would not be considered a usable instrument. Worth being explicit about what that number means for anyone using these tools to make decisions.