Mengxin Learning-Simple text similarity detection and plagiarism judgment

foreword

The purpose of this article is to record Ben Mengxin's experience in doing the hand-practice project summary. It is mainly aimed at beginners . The concepts and techniques introduced will be relatively basic, so as to provide some ideas for solving practical problems (don't stick to the concepts and algorithms used in it, It can be done better in details and replaced with other more advanced and cutting-edge technologies), and I will focus on the technical details that I think need to be paid attention to.

Notice

  1. The sample data in this article cannot be shared. If necessary, please crawl it yourself.
  2. Basic operations are not discussed here, if necessary, please refer to the relevant documents.
  3. Related concepts: TF-IDF , naive bayes , k-means clustering .

Problem Description

If you are a staff member of a news agency in China, and you find that other media plagiarized articles on your platform, now you have received a task to find out the articles suspected of plagiarism by other media, and compare them with the original text to locate the place of plagiarism.

Solution process

1. Data cleaning

newsWe first read the dataframe named data, the data fields are as follows

id author source content feature title url
89617 NaN Fast technology In addition, since this week (June 12), except for 15 models such as Xiaomi Mi 6, the rest of the models have been suspended for update release (including development version / {"type":"科技","site":"cnbeta","commentNum":"37"... The first batch of Xiaomi MIUI 9 models exposed: a total of 15 models http://www.cnbeta.com/articles/tech/623597.htm
89616 NaN Fast technology The Snapdragon 835 is the only ARM processor certified by the Windows 10 desktop platform. {"type":"科技","site":"cnbeta","commentNum":"15"... Snapdragon 835's performance on Windows 10 is expected to improve http://www.cnbeta.com/articles/tech/623599.htm
89613 Hu Shuli_MN7479 Shenzhen event (Original title: A 44-year-old woman was refused a date by a netizen in Shenzhen, and ran naked in a rainstorm...)\r\n@Shenzhen Traffic Police Weibo said: Yesterday the clear .. {"type":"News","site":"Netease Popular","commentNum":"978",.. The 44-year-old woman asked a netizen to be rejected by the traffic police in the torrential rain http://news.163.com/17/0618/00/CN617P3Q0001875...

We need to train the model according to the content field, so look at the samples whose content field is NaN. After viewing, there are not many samples, so they can be removed directly.

#show nans in the dataset
news[news.content.isna()].head(5)
#drop the nans
news=news.dropna(subset=['content'])

Then define a simple function (using jieba word segmentation) to prepare for word segmentation of content, remove some symbols and Chinese punctuation before word segmentation, filter out some stop words after word segmentation, which punctuationincludes all Chinese punctuation, stopwordsis a list that contains some stop words Word (Baidu search can be downloaded, you can also edit as needed). Here I just show a feasible method. If you feel that there is room for improvement, you don't have to do this. Maybe you can use pos of tag to filter the words you want according to the part of speech, or you need phrase detection or even word2vec to represent it.

def split_text(text):return ' '.join([w for w in list(jieba.cut(re.sub('\s|[%s]' % (punctuation),'',text))) if w not in stopwords])

The test function is probably like this:

split_text(news.iloc[1].content)
#out:
'''骁龙 835 唯一 Windows10 桌面 平台 认证 ARM 处理器 高通 强调 不会 只 考虑 性能 屏蔽掉 小 核心 相反 正 联手 微软 找到 一种 适合 桌面 平台 兼顾 性能 功耗 完美 方案 报道 微软 已经 拿到 一些 源码 Windows10 更好 理解 big little 架构 资料 显示 骁龙 835 一款 集成 CPUGPU 基带 蓝牙 Wi Fi SoC 传统 Wintel 方案 节省 至少 30% PCB 空间 按计划 今年 Q4 华硕 惠普 联想 首发 骁龙 835Win10 电脑 预计 均 二合一 形态 产品 当然 高通 骁龙 未来 也许 见到 三星 Exynos 联发科 华为 麒麟 小米 澎湃 进入 Windows10 桌面 平台'''

Now you can apply the function to the entire column content field! The method using pandas is shown here, and in the full code example I use a more pythontic method.

news['content_split'] = news['content'].apply(split_text)

Similarly, we can use a similar method to create labels (for example, I now assume that news sources contain the word Xinhua as a positive example)

news['is_xinhua'] = np.where(news['source'].str.contains('新华'), 1, 0)

At this point, our data cleaning work is complete! :D

2. Data preprocessing

To use a machine learning algorithm, we have to convert the text into a form that the algorithm can understand. Now we need to use sklearn to construct a TF-IDF matrix to represent the text. TF-IDF is a simple and effective way to represent text. If you don't know what it is Please click the link.

tfidfVectorizer = TfidfVectorizer(encoding='gb18030',min_df=0.015)
tfidf = tfidfVectorizer.fit_transform(news['content_split'])

When creating a TfidfVectorizer , pay attention to specifying the encoding parameter (the default is utf-8). Here min_df=0.015 means that when creating the thesaurus, words whose document frequency is lower than the set threshold are ignored. This setting is because my machine cannot calculate too many features. If the computing resources are sufficient, you can set max_features=30000, which will take the words ranked in the top 30000 words as features (columns of the tfidf matrix), so that the model effect will be better.

3. Train the prediction model

Before training the model we need to split the data into a training set (70%) and a test set (30%).

#split the data
lable = news['is_xinhua'].values
X_train, X_test, y_train, y_test = train_test_split(tfidf.toarray(),label,test_size = 0.3, random_state=42)

Now you can train your model with Naive Bayes!

clf = MultinomialNB()
clf.fit(X=X_train,y=y_train)

Now, how do we know if our model fits well? You can apply cross-validation (cross-validation) to output the metrics you care about. Here I choose precision, recall, accuracy, and f1 for 3-folds (3-folds) cross-validation (actually you need to focus on different issues.) Choose different metrics, and if you don't know these metrics, be sure to check the relevant sources.), and compare the performance with the test set.

scores=cross_validate(clf,X_train,y_train,scoring=('precision','recall','accuracy','f1',cv=3,return_train_score=True)
print(scores)
#out:
'''{'fit_time': array([0.51344204, 0.43621135, 0.40280986]),
 'score_time': array([0.15626907, 0.15601063, 0.14357495]),
 'test_precision': array([0.9599404 , 0.96233543, 0.96181975]),
 'train_precision': array([0.96242476, 0.96172716, 0.96269257]),
 'test_recall': array([0.91072205, 0.91409308, 0.90811222]),
 'train_recall': array([0.91286973, 0.91129295, 0.91055894]),
 'test_accuracy': array([0.88475361, 0.88981883, 0.88415715]),
 'train_accuracy': array([0.88883419, 0.88684308, 0.88706462]),
 'test_f1': array([0.93468374, 0.93759411, 0.9341947 ]),
 'train_f1': array([0.93699249, 0.93583104, 0.9359003 ])}'''
 
 y_predict = clf.predict(X_test)
 
 def show_test_reslt(y_true,y_pred):
    print('accuracy:',accuracy_score(y_true,y_pred))
    print('precison:',precision_score(y_true,y_pred))
    print('recall:',recall_score(y_true,y_pred))
    print('f1_score:',f1_score(y_true,y_pred))
    
show_test_reslt(y_test,y_predict)
#out:
'''
accuracy: 0.8904162040050542
precison: 0.9624150339864055
recall: 0.9148612694792855
f1_score: 0.9380358534684333
'''

First of all, look at the results of cv. The 3-fold measurement indicators are not very different and relatively stable, and the results of the test set and cv are also very similar, indicating that the model fitting effect is acceptable. If more features are used in this data, the accuracy can be improved. close to 1.

So far, we have established a model that predicts whether the source is a news platform given a text, and now we can locate plagiarized articles.

4. Locate plagiarized articles

At this point, we can predict the full text (or newly added text, you may need to encapsulate a pipline when using it, which will not be demonstrated here) according to the results predicted by the model. For those predicted to be positive but actually negative. It means that their texts are similar to your platform’s writing style before they are wrongly judged. These texts are likely to be plagiarized texts or original quotations. First, take out this part of the “candidates”.

prediction = clf.predict(tfidf.toarray())

labels = np.array(label)

compare_news_index = pd.DataFrame({'prediction':prediction,'labels':labels})

copy_news_index=compare_news_index[(compare_news_index['prediction'] == 1) & (compare_news_index['labels'] == 0)].index

xinhuashe_news_index=compare_news_index[(compare_news_index['labels'] == 1)].index

Now we have to compare these suspected plagiarized texts with the original text, and take out the texts with high similarity for further analysis, but if the brute force search algorithm is quite complex, just two nested loops are already O(n ^2), this approach is too inefficient.

So we need a more efficient way to search for similar text, here I use k-means clustering (of course there are better methods, you can improve). First perform k-means clustering on all texts, we can get an id-cluster dictionary, and create a cluster-id dictionary based on this dictionary, so that given a specific text, I can know which cluster this text belongs to, and then use it Compare with other texts in the cluster to find the most similar top n texts for reanalysis, which greatly reduces the search range.

normalizer = Normalizer()
scaled_array = normalizer.fit_transform(tfidf.toarray())

kmeans = KMeans(n_clusters=25,random_state=42,n_jobs=-1)
k_labels = kmeans.fit_predict(scaled_array)

id_class = {index:class_ for index,class_ in enumerate(k_labels)}

class_id = defaultdict(set)
for index,class_ in id_class.items():
    if index in xinhuashe_news_index.tolist():
        class_id[class_].add(index)

It should be noted here that the k-means algorithm in sklearn only supports the calculation of similarity based on Euclidean distance. In the similarity comparison between text and text, we generally use cosine distance. Before using k-means, we need to convert the tfidf matrix Normalize into unit length (unit norm), because after doing so, the Euclidean distance and the cosine distance are linearly related (why? Look here ), so that the cosine distance is used to measure the similarity when clustering.

Another point to talk about is the choice of the number of k-means centers (n_clusters), here I choose to simply cluster into 25 classes. In fact, you can choose the number of centers based on your knowledge of the data. For example, if you know that your data contains news about sports, military, and entertainment, you can choose the number of centers based on your experience. Of course, the premise is that you are very familiar with the data. Another method is to observe the elbow value and select the number of centers according to some indicators such as SSE, silhouette, etc. There are detailed examples here .

Now we can apply the results of the clustering to search for similar texts

def find_similar_text(cpindex,top=10):
    dist_dict={i:cosine_similarity(tfidf[cpindex],tfidf[i]) for i in class_id[id_class[cpindex]]}
    return sorted(dist_dict.items(),key=lambda x:x[1][0],reverse=True)[:top]
    
print(copy_news_index.tolist())

#random choice a candidate to show some results
fst=find_similar_text(3352)
print(fst)
#out:
'''
 id   , cosine_similarity 
[(3134, array([[0.96849349]])),
 (63511, array([[0.94619604]])),
 (29441, array([[0.94281928]])),
 (3218, array([[0.87620818]])),
 (980, array([[0.87535143]])),
 (29615, array([[0.86922775]])),
 (29888, array([[0.86194742]])),
 (64046, array([[0.85277668]])),
 (29777, array([[0.84882241]])),
 (64758, array([[0.73406445]]))]
'''

After finding similar texts, more carefully, you can split the sentences of the text according to certain features (specific length, specific delimiter), or here I simply split the text with ".", and calculate the difference between sentences of similar text separately Sort and locate specific similarities after edit distance .

def find_similar_sentence(candidate,raw):
    similist = []
    cl = candidate.strip().split('。')
    ra = raw.strip().split('。')
    for c in cl:
        for r in ra:
            similist.append([c,r,editdistance.eval(c,r)])
    sort=sorted(similist,key=lambda x:x[2])
    for c,r,ed in sort:
        if c!='' and r!='':
            print('怀疑抄袭句:{0}\n相似原句:{1}\neditdistance:{2}\n'.format(c,r,ed))
            
find_similar_sentence(news.iloc[3352].content,news.iloc[3134].content)

Summarize

This paper mainly provides a framework of ideas for solving practical problems. It decomposes a practical plagiarism detection problem into a text classification problem and a similar text search problem. The idea of ​​solving practical problems combined with machine learning is worthy of reference.

At the same time, many parts of this article only adopt simple methods. The inspired students welcome continuous optimization. My further optimization ideas and experiences will continue to be updated.

Full sample code click here

Thanks

Thank you for your patience in reading my article. Comments and corrections are welcome. I hope to communicate with you and make progress together.

Thanks to my advisor, Mr. Gao, and my classmates and friends who actively discussed and solved the problem!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325982067&siteId=291194637