Study reveals AI detection tools may unfairly target non-native English writers in academic publishing. (CREDIT: CC BY-SA 4.0)
Artificial intelligence has transformed how scholars write, think, and share knowledge. In late 2022, OpenAI released ChatGPT, quickly followed by Google’s Bard, now known as Gemini. Within months, these large language models (LLMs) became everyday tools. People used them to brainstorm ideas, edit drafts, clean data, and even write full paragraphs for academic papers.
Many researchers embraced this technology. For non-native English speakers, LLMs offered a lifeline. English dominates the academic world. Journals demand polished writing, often forcing authors to pay for costly editing services. LLMs became a cheaper, faster alternative, helping scholars improve clarity and style while saving money.
However, this rapid adoption raised ethical questions. Some authors copied and pasted AI text without saying so. Others listed AI as a co-author, sparking heated debates about responsibility and originality. Eventually, journals agreed that LLMs can’t be authors but can assist with language editing if used transparently.
Despite this clarity, not everyone discloses AI use. Some think it’s unnecessary for grammar edits. Others fear stigma, worrying their work will seem unoriginal if AI is involved.
Examples of AI text detection results by GPTZero. (CREDIT: PeerJ Computer Science)
The Problem With AI Detection Tools
As AI-generated writing spread, detection tools emerged to catch undisclosed use. Schools, publishers, and reviewers wanted to ensure academic honesty. Tools like GPTZero, ZeroGPT, and DetectGPT claim to spot AI-written text with high accuracy.
Yet a new study published in PeerJ Computer Science reveals a darker side to these tools. Titled The Accuracy-Bias Trade-Offs in AI Text Detection Tools and Their Impact on Fairness in Scholarly Publication, the paper shows that these tools often misidentify human writing, especially when enhanced with AI.
Researchers found that high accuracy doesn’t mean fairness. Ironically, the tool with the best overall accuracy showed the greatest bias against certain groups. Non-native English speakers were hit hardest. Their abstracts were flagged as AI-generated more often, despite being original or only lightly edited.
“This study highlights the limitations of detection-focused approaches and urges a shift toward ethical, responsible, and transparent use of LLMs in scholarly publication,” the research team noted.
Inside the Study
The team wanted to answer three questions:
- How accurate are AI detection tools with human, AI, and AI-assisted texts?
- Is there a trade-off between accuracy and fairness?
- Do certain groups face disadvantages?
They tested popular tools using abstracts from peer-reviewed articles. The dataset included 72 abstracts from three fields: technology and engineering, social sciences, and interdisciplinary studies. Authors came from native English-speaking countries like the US, UK, and Australia, as well as countries where English is neither official nor widely spoken.
Examples of AI text detection results by ZeroGPT. (CREDIT: PeerJ Computer Science)
The researchers generated AI versions of these abstracts using ChatGPT o1 and Gemini 2.0 Pro Experimental. They also created AI-assisted versions by running original abstracts through these models to improve readability without changing meaning.
Key Findings
The first test compared human-written abstracts with AI-generated ones. Here, detection tools performed best because the difference was clearer. Metrics included:
- Accuracy: How often the tool classified abstracts correctly.
- False positive rate: How often human abstracts were wrongly labeled as AI.
- False negative rate: How often AI texts were missed.
- False accusation rate: Percentage of human abstracts with any false positive.
- Majority false accusation rate: Percentage with more false positives than correct classifications.
Even in this clear-cut test, non-native speakers faced higher false accusation rates.
Examples of AI text detection results by DetectGPT. (CREDIT: PeerJ Computer Science)
The second test examined AI-assisted texts, where human writing was enhanced by AI. This hybrid text is common in real life but poses challenges for detectors. Metrics included:
- Summary statistics: Distribution of AI detection scores.
- Under-Detection Rate (UDR): How often AI-assisted texts were marked as purely human.
- Over-Detection Rate (ODR): How often they were flagged as fully AI-written.
Detection tools struggled here. Many AI-assisted texts were labeled as 100% AI-generated, disregarding human effort. This creates real risks for scholars using AI responsibly.
The Impact on Non-Native Authors
Historically, non-native English speakers have faced barriers in academic publishing. Professional editing costs are high. LLMs help bridge this gap, offering near-instant language improvement at minimal cost.
Density plots of AI-generated probability scores for AI-assisted abstracts from each AI text detection tool by author status. (CREDIT: PeerJ Computer Science)
However, if journals use AI detectors to police writing, these same authors may be unfairly targeted. Their improved writing style, aided by AI, looks “too perfect,” triggering false positives. This could mean more rejections or accusations of dishonesty, harming their careers.
Different academic disciplines also face risks. Humanities and social sciences use nuanced, interpretive language. AI models and detection tools, trained on simpler data, may misinterpret such texts, reinforcing biases against certain fields.
Furthermore, LLMs tend to reproduce patterns in their training data. This risks amplifying existing inequalities by promoting uniform language and ideas while silencing diverse voices.
Beyond Detection: A Call for Change
The study emphasizes that detection tools alone can’t solve ethical issues around AI in writing. The tools operate as black boxes. They don’t explain why they classify a text as AI or human. This lack of transparency makes it hard to challenge their decisions.
Density plots of AI-generated probability scores for AI-assisted abstracts from each AI text detection tool by discipline. (CREDIT: PeerJ Computer Science)
Moreover, the line between human and AI writing is becoming blurred. Researchers may write drafts themselves, use AI for edits, then revise again manually. Others co-write entire sections with AI input. Detection tools struggle to assess these real-world practices accurately.
The team urges journals, universities, and policymakers to rethink their reliance on AI detectors. Ethical guidelines should encourage honest disclosure while recognizing the benefits of AI, especially for non-native speakers. Blanket bans or harsh detection policies may do more harm than good.
Moving Forward
AI tools will continue evolving. The study used the most advanced models available in late 2024, but newer versions will emerge. Detection tools must adapt, but fairness should remain central.
The authors call for more research into biases in AI detection and how they affect underrepresented groups. They also recommend creating standards for responsible AI use in academia, balancing integrity with equity.
For now, it’s clear that AI detection is not a magic solution. It is another tool, with strengths and flaws. To build a fair academic system, human judgment, transparency, and inclusivity matter just as much as technology.