How to check if writing is AI-generated -- what methods are people actually using beyond the obvious tools?

I’ve been on the Academic Integrity Committee at my school for two semesters and I want to share what I’ve learned about detection approaches because I think a lot of people are over-relying on tools when the human methods are sometimes more reliable.

Automated tools are the starting point for most people and they have real limitations – high false positive rates for certain writing styles, inconsistency across platforms, and a tendency to produce a score without explaining what triggered it. I use them as a first pass, not a conclusion.

What I’ve found more useful:

Comparison against in-class writing. If a student has submitted three pieces in my class, I have a baseline for their voice, vocabulary range, and structural habits. A sudden, unexplained shift in any of these is a signal worth examining.

Asking the student to talk about their work. This is the most reliable check I have. A student who wrote something can talk about it – the choices they made, what they were trying to say, where they got stuck. A student who generated something often can’t go deeper than the surface of the text.

Looking for the absence of specific examples. AI tends to make claims with generic support. Real student writing, even when it’s bad, usually has specific examples that are particular to their experience, their class discussion, their reading. The generic-example problem is one of the most consistent tells.

None of these are perfect. But they’re not producing false positives for international students or students with formal writing styles.

The baseline comparison approach is the one I keep coming back to as the most defensible. If you have prior work from a student, you have something to compare against. The shift matters more than the score. A single piece scoring 40% is less informative than a student who has never scored above 15% suddenly submitting something at 60%.

The conversation approach is what I’d use for any case where I was considering a formal process. You can’t action a number. You can action a conversation where a student cannot demonstrate understanding of their own submission.

The generic-example tell is something I’ve observed in my own field. A literature review written by a researcher contains specific papers, specific debates, specific methodological disputes. A literature review that summarizes the general state of a field without engaging any specific sources at depth reads differently even before you run a tool on it.

The tool score and the reading intuition often point the same direction but not always. When they diverge, I trust the reading.

the conversation approach is the cleanest and the least scalable. at 30 students it’s feasible. at 300 it isn’t. that’s the real structural problem – the methods that actually work require time and relationship that most instructors don’t have for every submission.

that’s not an argument against using those methods. it’s an argument that the institutions deploying detection tools as a solution to an at-scale problem aren’t actually solving it – they’re outsourcing a judgment call to a classifier and calling it a system.

from a student perspective the methods you’re describing are much less stressful than automated detection because they’re actually about understanding, not pattern matching. if i wrote something i can talk about it. if the test is “can you engage with your own work in a conversation” that’s a test i’m not afraid of.

automated detection is stressful specifically because you can be flagged for something you can’t control – your writing style, your writing register – and the appeal process is opaque. the human methods at least feel like a fair test of the actual question.

The scale problem is real and I don’t have a complete answer for it. What I’ve done in my own class is triage: use tools as a quick sort to identify submissions that might warrant closer attention, then apply human methods to that subset rather than to everything. It’s not a perfect system but it’s more proportionate than running every submission through a full review.

The honest answer is that there’s no low-effort, high-reliability detection method. The methods that are reliable require time. Institutions should be honest about that rather than implying that a tool score is sufficient.