Getting started with Scrapy to give up 01: opening the crawler 2.0 era

foreword

Scrapy is coming!!

After writing seven basic articles on reptiles, I finally wrote about Scrapy, which I have been thinking about. Scrapy has opened the era of crawler 2.0, allowing crawlers to be presented to developers in a new form.

I started to get in touch with Scrapy during my internship in 2018. It took me a month to learn Scrapy by combining theory with practice. This article does not write code operations, only the cause and effect and theory, I hope you understand Scrapy.

Native crawlers face problems

Whether you use Java's Jsoup or Python's requests, the development of crawlers will face the following problems:

1. Distributed

Generally, a crawler program only runs on one host. If the same crawler program is deployed on different hosts, it is an independent crawler program. If you want to create a distributed crawler, the usual idea is to divide the crawler program into two parts: url collection and data collection.

Now crawl the url and put it into the database, and then use the where condition to restrict, or directly use the list structure of redis, so that the crawler programs on different hosts can read different urls, and then crawl the data.

2. URL deduplication

When crawling data, you will often encounter repeated urls. Is it a waste of time to crawl repeatedly? The idea of ​​deduplication through url is: put the crawled url into the collection, and judge whether the url exists in the collection every time it is crawled. Then, if the program stops halfway, the collection in memory will no longer exist, and if you start the program again, you will not be able to judge which ones have already been crawled.

Then use the database to insert the crawled url into the database, so that even if the program is restarted, the crawled url will not be lost. But if I just want to start crawling again, do I have to manually clear the url table in the database. The time it takes to query the database each time needs to be considered.

3. Continue to climb

If there are 1,000 pages to be crawled, and the 999th page is crawled, when the progress bar is about to be full, the program hangs up with a click, just one short, but it still hasn’t finished crawling, what’s wrong? I choose to restart the program, so how can I start crawling directly from the 999th?

Let me talk about the first crawler I wrote here: crawling the poi information of 10+ cities.

Internship, the first time I developed a crawler, I didn't know there was a Gaode poi interface, so I found a website to crawl poi information. At that time, the website was estimated to be still in its infancy, the server bandwidth should not be high, the access speed was really slow, and it was constantly stopped for maintenance, so my program had to be stopped accordingly. If the crawl is restarted every time it is started, it is estimated that the crawl will not be finished in a few years, so I thought of a way.

I first manually input the number of data items of all districts and counties under all prefectures and cities (available on the website) into the database table first, and every time the crawler program is restarted, first count the items that have been crawled by each district and county in the result data table number, compared with the total number of entries. If it is less than that, it means that the crawling has not been completed, and then calculate the number of pages I have crawled to this district and county based on the number of crawled items in a certain district/county / the number of displayed items on each page of the website , and then use the remainder to locate the page I have crawled to The number of this page. Through this method, 163w pieces of data were finally crawled without loss.

Another way of thinking is to put the crawled url in the table, and when restarting the program to start crawling the url, first judge whether the url exists in the data table, if it exists, do not crawl, this can also achieve breakpoint continuous crawling . It also follows the idea of ​​deduplication of the original url.

4. Dynamic loading

In the sixth chapter of the fund, I wrote a dynamic loading of jsonp, which is relatively simple, as long as you find the request interface to obtain the data and process it. The seventh article wrote the js encryption of eval() of TV cat, which is a very complicated dynamic loading. The parameters of the request interface are encrypted, and it takes a lot of time to analyze the dense js to calculate the 186-bit parameter.

So, is there a way for me to not only break away from reading and analyzing js, but also bypass dynamic loading?

sure! ! First of all, regarding dynamic loading, it can be understood that the browser kernel renders data on the front end by executing js . Then we build a browser kernel in the program, can't we just get the page data rendered by js directly?

Usually selenium + chrome, phantomjs, pyvirtualdisplay are used to handle dynamic loading, but there will be more or less performance problems.

So much has been said above, according to the usual routine, everyone should know what I'm going to say next.

About Scrapy

The feeling that Scrapy brings to me is: clear modules, structural encapsulation, and powerful functions.
Scrapy

WHAT

Scrapy is a distributed crawler framework, which I compare to Spring in the crawler world. reqeusts is like a servlet, and all kinds of functional logic need to be implemented by itself, while Spring has integrated it, and the bottom layer is transparent to users.

Just as we know, Spring initializes beans in the application configuration file and defines database operations in the mapper, and users don't need to care about how Spring reads these configuration files for various operations. Similarly, Scrapy also provides such a functional configuration.

So, Scrapy is a crawler framework, requests is a crawler module , there is a difference between the two.

WHY

My political teacher once said: There is no love without reason, and there is no hate without reason . Based on my personal experience, let me tell you why I recommend Scrapy so much.

  1. Performance : Asynchronous request based on Twisted, so fast!
  2. Configurable : define request concurrency, delay, number of retries, etc. through configuration files
  3. Rich plug-ins : provide dynamic loading, breakpoint continuous crawling, distributed solutions, a few lines of configuration are ready to use
  4. Command line operation : through the command line, you can generate, start and stop, monitor the crawler status, etc.
  5. Web interface operation : integrated web interface to start, stop and monitor crawlers
  6. Provide a test environment : provides a shell interactive test environment

HOW

Scrapy is a framework, and its functions are so powerful, is it difficult to learn?

This kind of worry is unnecessary. The installation of Scrapy is the same as the installation of ordinary python modules. As long as you understand the functions of the four modules, getting started is extremely simple. The development logic of the Scrapy crawler program has fewer codes and a clearer hierarchy, which is much simpler than requests.

Application Scenario

Scrapy as a framework, some people think that scrapy is too heavyweight, not as easy to use as requests. It can only be said here that the application scenarios and emphases are different .

The development of Scrapy is more like an engineering project development. It is usually used to integrate crawler data from multiple data sources , such as integrating video, novels, music, comics and other information data into a data table. Developers only need to agree on the data fields in advance to carry out multi-person collaborative development, because scrapy can put data into the database through the yield keyword, without having to call any methods explicitly.

Requests is more suitable for the development of a single crawler program that does not require unified management or distributed deployment.

epilogue

In fact, the first article should write about the architecture and installation of Scrapy, but I think it is necessary to understand the functions and application scenarios of this technology before using a technology, so I wrote this article on theoretical knowledge.

I wrote this article twice. After the first time, for some reason, it was overwritten in the editor, so I had to write it again. Fortunately, the screenshot of the middle part was sent to a friend, so I can write a part less. I also finally understood a feeling that used to be circulated on the Internet: the homework was torn up by the dog after I finished it, and I didn't want to write it again.

I hope this article can give you a deeper understanding of the theoretical knowledge of reptiles, and look forward to the next encounter.



Post-95 young programmers, write about personal practice in daily work, from the perspective of beginners, write from 0 to 1, detailed and serious.
The article will be published on the public account [ Getting Started to Give Up Road ], looking forward to your attention.

Thanks for every attention

Guess you like

Origin blog.csdn.net/CatchLight/article/details/115935191