A Little Research on Impurity Filtration

1. Problem description

The online content collected by crawlers contains a lot of useless information (impurities), which need to be automatically filtered by computer to keep the really useful content. Filtering itself is a process of category judgment or classification.

2. Solutions

In general, the main sources of impurities are:

  • Keyword misjudgment: wrongly hit and collect keywords;
  • Useless and spam websites: such as "Kunshan News Network" information website, etc.;
  • Pornography/gambling/drug/game information: pornography/gambling/drug/game posts published in some forums;
  • Others: For example, specific unnecessary websites such as postgraduate entrance examination training and recruitment websites;

Generally speaking, it is necessary to observe the marked corpus, try to summarize the general classification and proportion of impurities, and give priority to the types of impurities with a large proportion.

There are generally two options for impurity filtration:

  • Rule method: With the help of professionals, a large number of inference rules are defined for each category. If a document can satisfy these inference rules, it can be determined to belong to this category.
  • Statistical method: Combine the training corpus with a certain training model, and use the model to predict the new corpus. Corresponding algorithms such as: naive Bayesian, word2vec, random forest, SVM, etc.

3. Method selection

For different types of impurities, choose different filtration methods, and pay attention to the "cost performance" of filtration methods.

Generally speaking, machine learning methods based on statistics require a large amount of training corpus support, and there is a certain cost threshold. It is a priority to choose simple and practical rule filtering. After multiple rule iterations, it is a priority to consider using machine learning algorithms according to the effect.

4. Specific implementation

4.1 Misjudgment of keywords

Description: Hit collection keywords, such as "CIIC Technology Co., Ltd."/"CIIC (China and Chile)", but this keyword does not refer to CIIC of Guanaitong;

Ideas:

  1. Add exclusion words, for example: CIIC | CIIC Technology Co., Ltd., etc.;
  2. Add rules, for example: China-Chile.*? Relations between the two countries, China-Chile.*? Economy and trade, etc.;
  3. Chinese word segmentation, such as "Hanzhong Zhizhi", can be filtered out if the word segmentation results "Hanzhong" and "Smartness" do not contain "Zhongzhi".

4.2 Useless and spam filtering

Description: During the observation of impurities, it is found that some impurities come from certain specific spam websites, such as "Kunshan News Network", and the information collected from these websites is all impurity information. Corresponding to these impurities, they can be excluded by domain name filtering.

Ideas:

  1. Judgment of useless and spam websites: If all or most of the information captured from this domain name (for example, more than 95%, please communicate with the data department personnel) is impurity, then the website is a useless and spam website;
  2. Find out the domain names of these useless and spam websites as impurity features.

Domain name filtering is simple and direct, with good flexibility and scalability.

4.3 Pornography/gambling/drug/game impurities

Description: This type of information generally appears in forums, post bars, etc., and cannot be removed by domain name filtering.

Idea: This type of impurity generally contains obvious characteristic words, such as "Breaking the Sky", "Hot Blood Rivers and Lakes", etc., but this keyword may appear in both the title and content, so it is necessary to introduce a combination of title + content based Filtering mechanism for excluding keywords.

step:

  1. Added a title detection mechanism, based on the list of excluded keywords, the data containing excluded keywords in the title is directly judged as impurities;
  2. For each excluded keyword, based on the marked corpus, the accuracy of impurity identification is counted to ensure that the keyword will not cause non-impurity to be filtered out.

4.4 Other impurities

Description: Some can be filtered by other features contained in trash content, such as poetry:

Can the frosty wind not shine, Zhong Zhisheng is full of birds in the sky.

Although I didn't go down the road from Diwangshan, the most worrying thing is the slow strings.

——Han Lu "Han Lu Poems"

#原创诗词# #七绝#

It can be filtered out by features such as "#原创诗诗# #七终#" in the information.

The last thing left is the impurities that are difficult to filter through the summary rules. At this time, machine learning algorithms need to be introduced.

4.5 Machine Learning Algorithms

There are many classification algorithms available, such as Naive Bayes, Logistic Regression, SVM, Random Forest, etc., which are famous in spam filtering. Let's use "Word2vec + Random Forest" to illustrate:

Idea: Based on word2vec, select document feature words; vectorize documents according to the extracted feature words; use random forest classification algorithm to train model and classify.

step:

  1. By training the word vector with word2vec, a word vector space can be obtained. Each word is a point in the space. The document is composed of words. After the document is segmented, the document can also be projected into the word vector space by calculating the average value. In, by calculating the cosine similarity between the word and the document, as the contribution of the word to the document (the weight of the word in the document)
  2. Into the random forest training pre-preparation, each document corresponds to a number of weighted words, these words are potential feature words
  3. Summarize and standardize all weighted words by category, and extract a certain amount of keywords based on the principle that best represents the category (the specific number needs to be tested experimentally, too large or too small is not suitable)
  4. Therefore, each document can be represented as a feature word vector and added to the random forest training prediction

4.6 Summary

  1. Consider "cost performance" when choosing a method;
  2. Observation and summary first, can summarize simple rules, try to filter through the rules first;
  3. Rules can be based on title, domain name, content, etc., as long as the test effect is good, there is no need to stick to the form;
  4. If there are still too many remaining impurities, combine with machine learning classification algorithm.

5. Experiment

This experiment mainly deals with the following types of impurities:

  • Useless and spam websites: such as "Kunshan News Network" information website, etc.
  • Pornography/gambling/drug/game information: Pornography/gambling/drug/game posts published by some forums
  • Redundancy problem: For example, the source code in the web page http://www.scxiantan.com/bjpp/48819.html contains the following:

CollegeWuxi Vocational and Technical CollegeChengdu University of TechnologyCollege of Engineering TechnologyShijiazhuang Vocational and Technical CollegeZhengzhou Railway Vocational and Technical CollegeXi’an Railway Vocational and Technical CollegeXingtai Vocational and Technical CollegeQingdao Harbor Vocational and Technical CollegeChangsha Civil Affairs Vocational and Technical CollegeZhengzhou Industrial Applied Technology CollegeQingdao Hotel Management Vocational and Technical College Shandong Information Vocational and Technical College Wuhan Shipbuilding Vocational and Technical College Wuhan Engineering Vocational and Technical College Qinhuangdao Vocational and Technical College Taishan Vocational and Technical College Dezhou Vocational and Technical College Hubei Communication Vocational and Technical College Yunnan Communication Vocational and Technical College Jinan Engineering Vocational and Technical College Shanxi Engineering and Technical College Zhengzhou Engineering Technology College Zhangzhou Vocational and Technical College Shandong Electronic Vocational and Technical College Foshan Science and Technology College Tianjin Bohai Vocational and Technical College Tianjin Electronic Information Vocational and Technical College Shaanxi Industrial Vocational and Technical College Jiangxi Industrial Vocational and Technical College Nanchang University Science and Technology College Yantai Engineering Vocational and Technical College Xi'an Aviation Vocational and Technical College Shijiazhuang Vocational and Technical College of Posts and Telecommunications Huanggang Vocational and Technical College Ningbo Vocational and Technical College Dalian Vocational and Technical College Nanjing Information Vocational and Technical College Hebei Communications Vocational and Technical College Guangzhou Civil Aviation Vocational and Technical College Jiangsu Engineering Vocational and Technical College Hunan Applied Technical College Guangdong Communications Vocational and Technical College Shaanxi Vocational and Technical College College Guangdong Light Industry Vocational and Technical College Hangzhou Vocational and Technical College Jiangxi Communications Vocational and Technical College Ningbo University Science and Technology College Anhui Vocational and Technical College Zhejiang Industry Vocational and Technical College Yangling Vocational and Technical College Shanghai Electronic Information Vocational and Technical College Xiangyang Vocational and Technical College Chengdu Aviation Vocational and Technical College Harbin Vocational and Technical College Laiwu Vocational and Technical College Shenzhen Information Vocational and Technical College

adapting methods:

  • Useless, Spam: URL Filtering
  • Pornography/gambling/drug/game information: rule filtering + URL filtering
  • Duplicate word repetition problem: machine learning algorithm (using word density filtering, counting the words that appear multiple times, introducing a penalty mechanism, the places where they appear in the article are not far apart, and the word frequency counts are accumulated, that is: the first word frequency +1 , the word frequency +2 appeared for the second time not far away, and the word frequency +3 appeared for the third time, and so on. When the distance is far away, the accumulated word frequency count is reset to +1, and finally the maximum word density is calculated = word frequency/number of sentences)

result:

  1. The following experimental results are only for the above three types of impurities. It is correct to filter out other types of impurities, and it is not wrong to not filter them out.
  2. The experiment took a total of 4443 pieces of data for one month, of which the program judged to filter out 3259 pieces and not filter 1184 pieces.
  3. Program operation related: an average of 26.88ms to process a piece of data, the CPU occupies about 30%, and the running memory occupies about 100M.
  4. After manual labeling and relevant filtering process, most impurities can be filtered, and the overall accuracy rate is over 95%.

The annotation statistics are as follows:

category

program judgment

correct

Correct rate(%)

filter

1184

1145

96.71

no filter

3259

3078

94.45

Guess you like

Origin blog.csdn.net/u012998680/article/details/117366331