For a class project I put together a test set of 50 images: 25 AI-generated across different models and styles, 25 human-created across photography, illustration, and digital art. I ran all 50 through five different AI art detectors and recorded the results. Here’s what the data showed.
Overall accuracy across all five tools: ranged from 61% to 79% correct classification. That’s meaningfully better than chance but not reliable enough to use as a binary classifier. The best-performing tool was wrong 21% of the time.
Agreement between tools: the five tools agreed on 58% of the images. For the remaining 42%, at least one tool diverged from the majority classification. On some images, tools were evenly split.
Style effects: all tools performed substantially better on photorealistic AI images than on stylized or illustrative content. For the photography subset, average accuracy was around 82%. For stylized digital art, it dropped to 67%. Human-created digital art in styles that overlap with common AI aesthetics was the hardest category for every tool.
False positive rate: 18% of human-created images were flagged as AI by at least one tool. 11% were flagged by a majority of tools. For images in styles that AI generators commonly produce, that rate went up significantly.
My takeaway: these tools are useful for generating hypotheses, not conclusions. A classification from any single tool should be treated as ‘worth looking at more carefully’ not ‘confirmed AI.’ Using them to make consequential decisions about authorship requires understanding that the error rates are high enough to matter.
Happy to share the full methodology if anyone wants to replicate.