Scrapy crawler small demo summary

1. Scrapy enters the pit.

a) Installation of Scrapy.

This is nothing to talk about, there are a lot of them online.

 

       The problem to pay attention to, maybe I downloaded 32-bit python, and there is a situation where pywin32 cannot be used. Just pip install pypiwin32 directly.

b) Installation verification.

scrapy genspider baidu www.baidu.com Build a crawler that crawls Baidu.

 

scrapy crawl baidu

 

Data crawled successfully.

c) Scrapy project creation:

First enter the directory where you want to create the project: cd xxx

Then create: scrapy startproject demo1 (the project name is demo1)

Enter the project directory and set the grab target: cd demo1;

                                                 scrapy genspider demo1demo quotes.toscrape.com

After the above two steps are set up, there is the following structure:

 

 

Then I use the pycharm compiler, ,, at the command line. here

 

Enter the command to crawl,

 

 

Note,,, I used Chinese comments in the generated files, and then reported an encoding error, so the file summary must ensure an English environment.

Second, the basic use of Scrapy

After the above tossing, scrapy can finally run, let's see how to use it (emmmm, to be honest, I don't use it well, let's talk about it, just record this annoying little demo).

       Well, let's start with the first step of use. First, set up the crawling URL:

 

(The code is so beautiful, ^_<)

Then there is the function callback, the scrapy.Request() method. This method is the core of the scrapy crawler. The first parameter is the url to be accessed, and the second is the callback function, which is the function that will be called when the information is returned after the visit. This function You need to write it yourself, just write one, but the parameter must have a response, that response can be used in this function, this response is the returned response, you can print responses.text, and the result is the data of the web page, so we need to All it does is parse this web page data. So how to parse this web page data? Well, look at the picture, it's too abstract.

 

As above, these methods are parsed through the dom tree, and after the parsing is completed, you can continue to climb with the corresponding connection. Hey, there's not much to say about this part.

3. Scrapy file description.

 

This is the entire scrapy file, init doesn't say it (it's not good, it's your own food) So what are middlewares? It is middleware, in which some intermediate calling classes are written. I will not forcibly explain this. After all, I know the general idea, and I will not adjust it if there is a problem. Pipelines is used to write the database or something. The last settings, you can see that it is a setting at a glance. For example, setting delay can be written in it. Of course, some constants can also be written in it.

 

See clearly, the header should also be written in it, the request header. , emmmm that's all.

 

Fourth, the use of Scrapy attention. (Just what I should pay attention to, it has nothing to do with other people, after all, I am a novice)

By the way, I copied this crawler in the tutorial, and I made a small modification according to my needs, so I probably understand it, bah, probably not at all. What is in the tutorial to maintain a cookie pool. I bought hundreds of trumpets to take turns to catch them. I am not so rich, and I didn't buy trumpets. Therefore, I only have a cookie, but I have to imitate others and go to the cookie pool to get it. , otherwise I can't afford to run, I still can't solve it, so I use springboot (this is a microservice architecture of java) to set up a simple server, and then every time I request this server, it returns a cookie json string to me, Then I can use him recklessly. I grabbed 200 pieces of data for a cookie and it was blocked. It’s embarrassing. In order to be a civilized crawler, I set a delay so that I didn’t get blocked. It’s really thrilling. time.sleep(), thread sleep.

5. Crawling data processing.

There is no need to say this, it is not scrapy, because it is a demo, so it is recorded here. When I crawl, I use a .json file to save it, because json is the most tidy and has no garbled characters. I like it very much.

Then for the processing of json files, it is simple, read directly, and then use the dictionary type to calculate and take a small picture, and I will not say nonsense.

 

Hmm, then it's alright, it's been a tiring few days.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325066443&siteId=291194637