Blind Test With Faculty Panel: 5 Human Essays vs 5 ChatGPT Essays — Could We Reliably Tell Before Checking Originality.ai?

We ran a small internal experiment last month.

Ten short argumentative essays.
Five written by actual undergraduates.
Five generated using ChatGPT from the same prompts.

We removed names, formatted everything identically, and asked six faculty members to label each as “Human” or “AI” before running anything through Originality.ai.

Before revealing results, we asked them to explain what they were looking for.

Common answers:

  • Overly balanced paragraph structure
  • Repetitive transition phrases
  • Safe, non-controversial claims
  • Lack of specific lived detail

After the blind vote, we checked the results.

Confidence was high. Accuracy wasn’t.

Two human essays were labeled AI by a majority.
One ChatGPT essay was confidently labeled human by four out of six reviewers.

When we later checked Originality.ai, one of the human essays also scored unexpectedly high probability.

The takeaway wasn’t that detectors are useless. It was that both human intuition and software have detector limitations.

The human vs AI writing comparison is getting harder, not easier.

Has anyone else run structured blind tests like this?

What stands out to me is the confidence gap.

People often believe they can “just tell” based on AI writing patterns. But perception isn’t proof.

As generative tools improve, stylistic markers of AI text become less obvious. Especially when prompts are detailed.

The risk is overconfidence leading to premature accusations.

There’s also convergence happening.

Students are subconsciously adopting AI-like structure — clean symmetry, clear topic sentences, controlled tone — because that’s what high-scoring writing often looks like.

So when we perform a human vs AI writing comparison, we’re not comparing two distant categories anymore. The distributions overlap.

That makes detector limitations inevitable.

As a teacher, this is exactly why I avoid relying on a single signal.

If both humans and Originality.ai can misclassify texts, then no single metric should trigger discipline.

Process matters. Prior writing samples matter. Context matters.

Blind tests like this are valuable because they expose our assumptions.

From an editorial perspective, I look for friction.

AI-generated essays often read smoothly from start to finish. Very little hesitation. Very few rough edges.

But I’ll admit — some of the strongest junior writers I’ve worked with also produce that kind of controlled prose.

Which makes me cautious about treating smoothness as evidence.