Python web crawler combat (a) Quick Start

This series describes how to write from scratch Python web crawler, the web crawler is easy and the problems encountered, such as having anti-climb, the encrypted site, as well as reptiles can not get the data, and login authentication and other issues, will be accompanied by a large number of websites to combat reptiles.

The main purpose of our writing web crawler is crawling data you want to go there by reptiles automate some things we want to do in the site.

Starting today I will explain how to go from the ground to complete the things you want to do through the web crawler.

First look at a simple piece of code.

import requests #导入requests包
url = 'https://www.cnblogs.com/LexMoon/'
strhtml = requests.get(url) #get方式获取网页数据
print(strhtml.text)

The first is import requeststo import network requests related package, and then define a string that is the landing page url, then we will use the imported package requests to request content of this page.

Here used requests.get(url), this getis not 拿取that get, but a method on network requests.

There are many methods of network requests, the most common are get, and postother such as put, deleteyou almost never see.

requests.get(url)It is to send get request (request), and then returns a result that is a response to the request url this page.

Response message into the response header and the response content.

This visit is your first response is not successful, you get back what type of data, there are a lot number.

Response page content is that you get the source code.

Okay, so even if you are introductory Python reptiles, but there are still many problems.

1. get and post requests What is the difference?

2. Why do some web crawling to me, and there was no data I want?

3. Why do some sites I climb down the website content and what I see is not the same true?

get and post requests What is the difference?

The main difference between get and post is the location parameter, for example there is a need to log user's website, when we click login account password should be placed where.

get a request parameter is the most visible manifestation of the request to put the URL.

For example, you Baidu Python keyword, you can find it at the following URL:

https://www.baidu.com/s?wd=Python&rsv_spt=1

There's dw=Pythonis one of the parameters, get the request parameters ?beginning with &separating.

If we need to enter a password with the site get request, our personal information is not very likely to expose you, so you need to post requests.

In the post request, the request parameter in vivo.

The following example is a request when I log on the W3C Web site, you can see Request Method is a post way.

1

In the following request as well as login information from us, which is after the account password to encrypt, send to each other to test the server.

2

Why do some web crawling to me, and there was no data I want?

Our crawler sometimes possible to climb down a Web site, view the data inside will find the time, the goal is to climb down the page, but which we did not want the data.

When this problem occurs in the majority of the target data type is a list of those pages, for example, a few days before class students asked me a question, he climbed Ctrip flight information, in addition to not climb down the page to get flight information, other places you can get.

website link:https://flights.ctrip.com/itinerary/oneway/cgq-bjs?date=2019-09-14

As shown below:

11

This is a very common problem, because he requests.get time, is that the URL above to get my release, but this website, although this address, but he is not the data inside this address.

Sounds like a difficult, but from a designer's point of view this website Ctrip, this part of the list of flights to load information may be very large, if you are inside directly on this page, we may require the user to open this page for a long time, I believe that hung up and then close the page, so the designers at this URL request only put the main frame, allowing users to quickly enter web page, and the main flight data is then re-loaded, so that users will not wait very for a long time and quit.

4

After all how to do it is to the user experience, then we should be how to solve this problem?

If you have studied the front, you should know that Ajax asynchronous requests, do not know all right, after all, we are not here talking about front-end technology.

We just need to know that we originally requested https://flights.ctrip.com/itinerary/oneway/cgq-bjs?date=2019-09-14on this page there is a js script, after the execution will go to this web page request, and the purpose of this script is to request flight information we have to climb.

At this time we can open a browser-based console, we recommend using Google or Firefox, press F to enter the tank, no, press F12 to enter the browser console, and then click NetWork.

Here we can see all the network requests and responses occur on this page a.

5

In there we can find the requested flight information is actually https://flights.ctrip.com/itinerary/api/12808/productsthis URL.

6

Why do some sites I climb down the website content and what I see is not the same true?

The last question is why some sites I climb down the website content and what I see is not the same true?

The main reason is that you are not reptiles 登录.

As we usually browse the Web, some of the information need to log in to access, reptiles as well.

This involves a very important concept, we usually watch page is based on Http request, which is a stateless Http request.

What is stateless? You can understand that it does not recognize people, which means that your request to the server where the other party, the other server is not who you know in the end yes.

That being the case, we can then log on why it long continue to access this page?

This is because although Http is stateless, but the other server gave us arranged 身份证, which is the cookie.

When we first entered this page, if you have not visited, the server will give us a cookie, we ask after any operation on this page, you must go into the cookie. So that the server can identify who we are based on the cookie.

7

For example, we know almost inside to find relevant cookie.

For such sites, we get directly from your browser into the cookie already used in the code, requests.get(url,cookies="aidnwinfawinf")but also allows to simulate the reptile to log on to this site to get the cookie.

Guess you like

Origin www.cnblogs.com/LexMoon/p/pyspider.html