Python crawling knowledge network (CNKI) data stepped on the pit

Preface

My friend asked me to help me crawl the information on HowNet. After I agreed, I would try to crawl CNKI's papers. Before crawling HowNet, I googled whether there are any existing HowNet crawlers on the Internet (want to steal a lazy) and found the one on GitHub Knowledge network crawlers were all years ago, so I prepared to write one by myself.

text

Python tool for crawling: selenium, after simulating browser behavior (including opening the browser, opening the HowNet search page, checking the response column on the left, entering keywords, and setting the search time period), click "Search", Get the cookie value of the browser, and then press F12 to view the url of the search result (you can directly press F12 and see it on the front-end page). By the way, HowNet does have an anti-crawler mechanism. When each step of the written script is directly operated too fast, it will be disconnected. So after each step of interactive operation, set a time.sleep() and set the parameters randomly. Wait for 1, 2, 5 seconds, so that you will not be disconnected.

Next, here comes the pit !

        When I found out that the URL on the front-end page of the crawled search result entry was different from the actual opened URL! The URL in the screenshot below is JS-encoded (encryption is also OK), and it is done in the background. The front-end script can only see the URL before encoding. If you visit directly, you will be linked to the CNKI homepage! Those on the Internet who said that they crawled to the link to the paper did not try to open the link. They said that they crawled to the URL, which is really funny.

        Therefore, it is impossible to retrieve the URL of the paper by crawling the website. If you just crawl the front-end page of the paper name and database, citations, and source information, it is okay; but if you want to crawl the URL of the paper, then crawl the detailed information (abstract, institution, keywords, references, etc.) of the paper. impossible.

 

I guess this is why only people crawl some simple information (paper name, author, source, time, citations, etc.) on the front-end of HowNet, instead of crawling the URL of the paper to retrieve more by crawling HowNet. The reason for the information.

I hope everyone will not waste time stepping on the pits I have stepped on. If there is a great god who can fill in the pits I mentioned above, please reply to me and comment. We can communicate.


Organizing is not easy...

Guess you like

Origin blog.csdn.net/Ryan_lee9410/article/details/95584907