[selenium (chrome) + python] n-level reference/referenced literature crawling & crawler library introduction

Reptile Library

There are crawler libraries that I have come into contact with so far scrapy, requests + Beautifulsoup or etreeas well as this new one selenium. I don't know the essential difference, but from the way of calling, it can be classified as follows:

  • scrapyBy implementing class inheritance, the crawling action is realized by rewriting its own methods. Advantages: fast, very fast; Disadvantages: difficult to debug, not easy to write; Applicable situation: large crawler projects
  • seleniumThe idea itself is to mobilize the browser itself to operate, and will load various script renderings, etc., so its disadvantages are very obvious: slow, very slow, really slow. But just because all the transmitted information will be handled, and by mobilizing the huge browser structure, seleniumany human operation can be completed, including clicking and inputting to load dynamic web pages. Applicable situation: The structure of the web page is changeable, and some information needs to be interacted to obtain.
  • requests + ()It is often used when writing small experiments, its grammar is relatively simple, and the logic is easy to sort out. Compared with the above two, requeststhe package is suitable for static web pages. It is not as seleniumflexible as it is but faster than it. On the other hand, it is slower than it scrapybut easy to write. Generally speaking, it is a library suitable for writing small experiments.

Specific experiments can refer to this page .

This task

  • root paper: Highly accurate protein structure prediction with Alphafold An article that has had a relatively large impact on Nature in recent years, the general core is the application of deep learning in the structure of biological proteins.
  • Task description: Take the root paper as the origin, search upwards (referring to the documents of the root paper) for several generations of documents, and downwards (Reference) for several generations of documents. It is essentially a form similar to a tree structure, but since the root paper is the starting point, documents at the same level may have mutual references to form a closed loop, so it is not a strict tree from the definition of graph theory. But after all, it is engineering, and mathematics does not need to be so rigorous, expressing ideas is the most important.
  • Crawling websites: It is not recommended that you choose Web Of Science for crawling, because its anti-crawling is very good, and various gaps are filled very quickly. When I was doing it, a port that was available on March 17th was blocked on April 9th. I suggest that you find some other websites. I don’t know what to do. After all, there are only a few large-scale literature search webpages. Try it yourself and you will know.

specific situation

The webpage I want to crawl is characterized by the fact that each article does not necessarily list the literature or citations on the website. If you are lucky or famous, you may put its Reference and Cited by on the webpage for viewing. Most of them don’t know Famous small articles generally do not have this kind of treatment. This is also a shortcoming of a small website. Although it is easy to crawl, it is relatively poorly maintained. Some of the problems encountered while climbing are summarized here:

  1. Be sure to use sentences when sending any requests and searching through, otherwise, more than half of the crawl will be caused by a momentary convulsion, or the page will not be updated in time due to network speed, causing a crash, and there will be nowhere seleniumto webdrivercry (get).(find_element)try...except
  2. driverXPath and CSS are supported when locating a certain data, but I have to remind one thing, and I think it is very bad when designing and naming. There are two methods: find_elementand , which find_elementscan be understood as the difference scrapy shellbetween and . The former takes the first retrieved element that meets the retrieval requirements, and the latter returns all found elements. It's really not good to just use one s to distinguish, and you must pay attention when doing debugging.responseextract_firstextract
  3. ! ! important! ! When locating by Xpath, if the Xpath is found by right-clicking in f12 -> Copy -> Copy XPath, it usually has something similar to the position in the array, such as [3] and [1 in it //*[@id="b_results"]/li[3]/div[1]/h2/a. ]. I don't recommend indexing in this way, especially when your web pages are dynamic and there are more or less differences in structure between different web pages, it is very likely that you will not be able to retrieve or retrieve what you want elements of the situation. Therefore, it is recommended that you remove these things when you take them in the first step, and select class idthe information you want to use for re-screening.

some methods

driverThere are several methods of retrieval, such as find_element_by_xpathor find_element_by_css, but when calling, it will warn that these should not be used. Use this find_element(by=,value=), and then call By.XPATHto call the same function. My evaluation is to take off your pants and fart. trouble.
I will write about some new ideas later, until the source code is submitted

the code

Our group homework will be taught this Tuesday, please bear with me for the time being, and share the source code after handing in the homework\doge

Guess you like

Origin blog.csdn.net/Petersburg/article/details/124069580