Python Reptile FAQs

The first question: JS how to break the encryption

(1) master the Chrome developer tools for each function, Elements, Network, Source

(2) careful observation, good at thinking. Network View loading process, looking for suspicious xhr request, xhr set breakpoints, trace back process is performed by js Call Stack, side back side view context code. Js can read, know js-related knowledge, such as the inside of the window js variable.

(3) above is found by the code debug js js encryption and decryption, and then re-implemented by Python, this process is very long, it may cost you a few days, once the site js algorithm to change your Python implementation can not be used.

(4) with a simple Selenium can break, and the site just does not matter. The only regret is that poor operating efficiency of Selenium. However, as a js use encryption to protect the data sites, the operating efficiency of the unit price should be sufficient to meet the frequency of access restriction site. At this time, more thought is how to increase the resource (IP, account number) to improve crawl efficiency.


Select the second question, multithreading, coroutines, multi-process

(1) crawler is IO-intensive tasks, most of the time spent online access, it is not suitable for multi-process web crawler, and multi-threaded, asynchronous IO coroutine is more suitable, and asynchronous IO is the most suitable, it is compared to more than threads, switching cost between coroutine smaller, we advocate the use of asynchronous IO and not multi-threaded. Asynchronous IO modules are mainly: aysncio, aiohttp, aiomysql and so on.

(2) After climbing down the page to extract the desired data is CPU-intensive, this time can be extracted with concurrent multi-process.

(3) Our strategy is recommended reptiles, reptiles just climb up and climb down to save html, saved to the database. Then alone write extraction data extractor, extractor run separately. Benefit is that the extraction does not affect the crawl, climb higher efficiency and extraction procedures can be modified at any time, there is no need to re-crawl when new extraction requirements. For example, want to extract data when originally wrote two pages of reptiles, after running for some time, found another 3 data is also useful if you save the html, just change to change extractor re-run again just fine.


The third question, if you want to keep the original picture or bold position, only through the mining law to write regular expressions to targeted handle it?

Web data extraction are two main ways: regular expressions, xpath. Xpath can be obtained through a html tags node. For example, a blog page, its contents are subject to a label which may be a div. Xpath get with this div, converted to html, format and that contains some pictures that you save this html code instead of plain text just fine.


The fourth question, incremental crawling reptiles, Resume climb, deduplication, etc.

(1) through the concept of URLs pool to manage all of the URL

(2) Incremental crawling is not re-downloading already downloaded, so remember those URLs pool URL already downloaded before;

(3) Resume crawling, crawling is not the last of this URL and then crawl, let those who remember URL URL pool not being crawled

(4) crawler to re-make the URL a URL pool record state to avoid duplicate crawling.


The fifth question, deployment issues reptiles, the company is not distributed crawler system will involve the deployment of more

Crawler deployment, not necessarily distributed. Massive reptiles, breaking the limit of the target site crawler will involve distributed, the benefits are distributed crawl rate increase, but management would be more complicated.


The sixth question, automatically parses the web page? This topic contains many sub-tasks: how to automatically extract the contents of the article, how to deal with a variety of time format, how to handle page

Extraction (1) content of the article, it is essential that each page to create a template extraction (regular expressions), the advantage is to extract precise, the downside is the heavy workload, once a little revision will fail. By establishing a single algorithm extraction procedure, basically it can be extracted, but there may be written impurities, such as at the end of the text related to reading. Benefit is that once and for all, without limitation revision.

Extraction (2) the time, in addition to the regular expression than did not seem a particularly effective method.

(3) page, and if only crawl the page url extracted continue to grasp; how to make multiple pages at the time of the merger to extract content into a web page, it would have to special treatment.


Question 7, while climbing a news site, how to do the same news, various websites reproduced each other, when the text to re-crawling

More well-known algorithms, Google's simhash, but in practice is more complicated. Network transmission Baidu's practice is the longest sentence (or sentences) do hash of the article, the only representation of the hash value is the article (fingerprint), this method is very high accuracy rate, but the recall rate is relatively low, once it most long a few words will not change a word recall; I improved the method, the longest sentence of n words do hash respectively, a paper (finger fingerprint is not the same person as shown) by the n fingerprints to determine uniqueness. Precision and recall rates are pretty good.


The eighth question, reptiles asynchronous design

(1) A good URL management strategy, see related article on the Web site pool ape school;

URL pool is a "producer - consumer" model, reptiles removed from url to download, download html to extract new url into the pool, just out of the pool told url url whether the download was successful; and then removed from the pool url download. . . url pool is a core member, which record different states of url:

(A) successfully downloaded

(B) download failed n times

(C) Downloading

Every pond should be checked url to state in the pool when you add url, avoid duplication download.

(2) a good asynchronous coroutine management strategies, see the massive reptile asynchronous news on the school website ape article.

Urlpool extracted from each of n url, downloaded asynchronously generate n coroutine, coroutine number (i.e. the number of the page being downloaded) through a variable record.

News Asynchronous massive reptiles: to achieve a powerful, simple-to-use Web site pool (URL Pool)

News Asynchronous massive reptiles: Asynchronous reptiles with asyncio

Asynchronous URL management asynchronously see the two URL

Useful, then remember a point endorsed Ha Hey

Guess you like

Origin www.cnblogs.com/qingdeng123/p/11329746.html