I tested how to reduce AI detection score in my writing -- here's what actually moved the needle

i want to be clear about what this is and isn’t. this is not a thread about gaming a system. it’s a record of what editing changes actually affected detection scores on my own content, with the goal of understanding what the tools are measuring.

i write B2B content. i use AI for drafts. my clients’ internal review processes sometimes include running content through detection tools, so i have a practical need to understand this.

i ran the same 300-word draft through a detection tool before and after each of the following single edits, resetting to the original between each test:

  1. Broke 4 compound sentences into 6 shorter ones: score dropped 18 points
  2. Added 2 contractions where none existed: score dropped 4 points
  3. Replaced 3 generic transition phrases (“furthermore,” “in addition”) with no transition or a shorter one: score dropped 11 points
  4. Added one specific example with a real number: score dropped 9 points
  5. Cut the opening meta-sentence (“In this piece, I will…”): score dropped 7 points
  6. Added one intentional fragment for emphasis: score dropped 6 points

Combined edit applying all of the above: score dropped from 71% to 22%.

the most impactful single change was sentence fragmentation. the least impactful was contractions. the specific example was more impactful than i expected.

none of this tells you anything definitive about all tools or all content types. it’s one test on one tool on one piece. but it suggests the tools are heavily weighted toward sentence length variance and transition pattern recognition.

This is the most systematic breakdown of this I have seen shared publicly. The sentence fragmentation result makes sense – sentence length variance is one of the cleaner signals these tools use and it’s relatively easy to measure. The specific example result is interesting because it suggests the tools are picking up on something semantic, not just syntactic.

The combined edit result is the important one. A 49-point drop from editing changes that also improve the writing quality is not a hack. It’s a demonstration that good editing and lower scores move in the same direction.

The transition phrase result stands out to me. “Furthermore” and “in addition” are such strong AI tells at this point that removing them is almost table stakes for anyone doing this seriously. The question is what you replace them with – nothing, a shorter connector, a structural break, the next thought starting where the last one ended. Each of those choices affects both the score and the quality differently.

the contractions result being the weakest is interesting because that’s often the first thing people recommend for humanizing AI writing. your data suggests it matters much less than sentence structure changes.

to be fair, contractions might matter differently at different content types or for different tools. but the hierarchy you found – structure first, then specificity, then surface-level word choices – seems intuitively right to me.

the methodological approach here is actually clean – isolating single variables rather than making multiple changes at once. that’s how you’d design this as an experiment and it’s more rigorous than most of the “here’s what I did” posts on this topic.

the obvious caveat is single tool, single piece, single pass. would be interesting to replicate across different tools and content types. the sentence fragmentation finding might not hold the same way for technical or academic writing where fragments are stylistically unusual.

good caveat on the single-tool problem. i’d be curious whether the hierarchy holds across tools or whether different tools are weighted toward different signals. the tools i’ve tested informally do seem to vary on which edits move them most.

anyway. do with this what you will. the main takeaway i’d stand behind: the edits that moved the score most were also the edits that made the writing better. if those two goals are mostly aligned, the framing of “detection vs. quality” might be the wrong frame.