How to accurately, effectively and quickly extract the HTML data crawled by crawlers?

Some students used crawlers to crawl webpage data similar to news websites, but they never found a good data extraction method. When I wrote crawlers before, I remembered that there was an extraction method that was more accurate, effective, and fast. It is "General Webpage Text Extraction Algorithm Based on Line Block Distribution Function: cx-extractor". The author Chen Xin was a researcher at the Information Retrieval Research Center of Harbin Institute of Technology before, and this algorithm paper was also written at that time.

I can't find the original article, but I found relevant information on Google's project hosting: https://code.google.com/archive/p/cx-extractor/ .

CSDN download: https://download.csdn.net/download/yilovexing/85064917

おすすめ

転載: blog.csdn.net/yilovexing/article/details/123904077