Outrageous! Latest research: 61% of English papers written by Chinese will be judged as AI-generated by ChatGPT detector

65e1361ec43d724325cd8e09e5df8303.jpegXi Xiaoyao Technology said sharing
source | Xinzhiyuan

After ChatGPT became popular, there are so many usages. Some people use it to seek life advice, some simply use it as a search engine, and some use it to write papers. Papers... I'm not happy to write them. Some universities in the United States have banned students from using ChatGPT to write homework, and have developed a bunch of software to identify whether the papers submitted by students are generated by GPT. Herein lies the problem. Some people's papers were written badly, and the AI ​​that judged the text thought it was written by a colleague. What's more, the probability of English papers written by Chinese being judged by AI as AI-generated is as high as 61%.

d9725f75b2901170bbf5d27a390baa15.png

What... what does this mean? Shaking and cold!

Non-native speakers are not worthy?

Currently, generative language models are developing rapidly and have indeed brought great progress to digital communication. But there is a lot of abuse. Although researchers have proposed many detection methods to distinguish AI and human-generated content, the fairness and stability of these detection methods still need to be improved. To this end, the researchers evaluated the performance of several widely used GPT detectors using writing by native and non-English-speaking authors. The results showed that these detectors consistently misidentified non-native-speaker writing samples as AI-generated, while native-language writing samples were mostly accurately identified. Furthermore, the researchers demonstrated that this bias can be mitigated with a few simple strategies and effectively bypass GPT detectors. What does this mean? This shows that the GPT detector looks down on authors with poor language skills, which makes people angry. I can’t help but think of the game of judging whether an AI is a real person. If the opponent is a real person but you guess it’s an AI, the system will say, “The other party may think you’re offending.”

Not complex enough = AI generated?

The researchers obtained 91 TOEFL essays from a Chinese education forum, and extracted 88 essays written by American eighth graders from the dataset of the Hewlett Foundation in the United States to detect seven widely used GPT detectors .

d8abbeb070b35e726546bfd26f510032.png

The percentages in the graph represent the proportion of "false positives". That is, it was written by a human, but the detection software thinks it was generated by AI. It can be seen that the data are very different. Among the seven detectors, the highest probability of being misjudged is only 12% for compositions written by eighth graders in the United States, and there are two GPTs with zero misjudgments. The probability of misjudgment of TOEFL essays on Chinese forums is basically more than half, and the highest misjudgment probability can reach 76%. Eighteen of the 91 TOEFL essays were consistently identified as AI-generated by all seven GPT detectors, while 89 of the 91 were misjudged by at least one GPT detector.

f5dd33ea39aca479f71e1080c897195f.png

From the figure above, we can see that the TOEFL composition misjudged by all 7 GPTs is significantly lower in complexity than other papers. This confirms the conclusion at the beginning - the GPT detector will have a certain bias against authors with limited language skills. Therefore, the researchers believe that the GPT detector should read more articles written by non-native speakers, and more samples can eliminate bias. Next, the researchers threw the TOEFL composition written by non-native speakers into ChatGPT to enrich the language and imitate the word habits of native speakers. At the same time, as a control group, the compositions written by American eighth-grade children were also thrown into ChatGPT, the language was simplified, and the writing characteristics of non-native speakers were imitated. The figure below is the revised judgment result.

db2df8263a0b15f5e61b51c454bd19e4.png

We can find that the situation has changed drastically, and the misjudgment rate of the TOEFL composition enriched in language has plummeted, as low as 3%, and the highest is only 19%. It's not like the previous misjudgment rate that was basically more than half. On the contrary, the misjudgment rate of compositions written by eighth-grade children has skyrocketed, and there is even a GPT detector with a misjudgment rate as high as 99%, basically all mistakes. Because the variable of article complexity has changed.

c088c7a5b38c2daabfcfb54ad41beb30.png

Here the researchers came to a conclusion that non-native speakers write things that are not authentic, low in complexity, and easily misjudged. This also leads to a technical problem, but also a value problem. Whether it is reasonable, comprehensive, and rigorous to judge whether AI or human is the author based on complexity. The result is clearly not. When judging by complexity, non-native speakers suffer a lot, because they are non-native speakers (nonsense).

AI polish = written by humans? ?

The researchers believe that enhancing language diversity can not only reduce prejudice against non-native speakers, but also allow GPT-generated content to bypass GPT detectors. In order to prove this point, the researchers selected the admission essay topics for American university applications from 2022 to 2023 and input them into ChatGPT-3.5, generating a total of 31 fake essays. The GPT detector worked well at first, but failed in the second round. This is because, in the second round, the researchers threw these papers back into ChatGPT for polishing, improving the quality of the text with some literary language. As a result, the accuracy of the GPT detector dropped from 100% to 0%. As shown below:

960ae55aef578823b8b8d6c4331deeae.png

The complexity of polished articles also increased accordingly.

a3a2f4b0ca42cb94d5070f59e79b71f2.png

At the same time, the researchers used the topics of 145 final project reports from Stanford University and asked ChatGPT to generate summaries.

6ba0ef05822a88a416bd0cac9baad71f.png

After the abstract was polished, the accuracy of the detector judgment continued to decline. The researchers once again concluded that polished articles are easily misjudged because they are all generated by AI, and two rounds are better than one round.

GPT detector? still not practiced

ec08b965e9c780337be7fd8208524ae8.png

All in all, all in all, various GPT detectors still seem to fail to capture the most essential difference between AI generation and human writing. People's writing is also divided into grades, grades, and grades. Judging by complexity alone is not very reasonable. Bias factors aside, the technology itself is in dire need of improvement.

e86ff807c0d21fdeeabbf8da06ba4613.png 2bc0db92e9a2df1ba0f46d358dd12dd0.png

References

[1] https://arxiv.org/pdf/2304.02819

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/130538082