Python crawler from entry to abandonment (2) - the deep principle of crawler

Last time I talked about the basic principles of reptiles, this time I will learn more about the deep nature.

Because I belong to Xiaobai, I have been in contact with ETL, data warehouse, and BI in the past two years, so I reprint it here as a professional crawler principle, with the original address at the end.

A crawler is an automated program that requests a website and extracts data. Among them , request, extraction, and automation are the key to the crawler!

The basic process of crawler

Initiate a request
to initiate a request to the target site through the HTTP library, that is, send a Request, the request can contain additional headers and other information, and wait for the server to respond

Get the response content
If the server can respond normally, it will get a Response. The content of the Response is the content of the page to be obtained. The type may be HTML, Json string, binary data (picture or video) and other types.

The content obtained by parsing
the content may be HTML, which can be parsed with regular expressions and page parsing libraries, or Json, which can be directly converted to Json object parsing, or binary data, which can be saved or further processed

Save data
save in various forms, can be saved as text, can also be saved to a database, or save a file in a specific format

What is Request, Response

The browser sends a message to the server where the URL is located. This process is called HTTP Request

After the server receives the message sent by the browser, it can perform corresponding processing according to the content of the message sent by the browser, and then send the message back to the browser. This process is HTTP Response

After the browser receives the Response information from the server, it will process the information accordingly, and then display

What is included in the Request?

request method

There are mainly: GET/POST two types are commonly used, and there are also HEAD/PUT/DELETE/OPTIONS
The difference between GET and POST is: the requested data GET is in the url, and POST is stored in the header

GET: Make a "show" request to the specified resource. Using the GET method should only be used to read data, and should not be used in operations that produce "side effects", such as in a Web Application. One of the reasons is that GET may be accessed arbitrarily by web spiders etc.

POST: Submit data to the specified resource and request the server to process it (such as submitting a form or uploading a file). The data is included in the request text. This request may create new resources or modify existing resources, or both.

HEAD: Like the GET method, it is a request for the specified resource to the server. It's just that the server will not return the text portion of the resource. Its advantage is that using this method you can get "information about the resource" (meta-information or metadata) without having to transmit the entire content.

PUT: Upload its latest content to the specified resource location.

OPTIONS: This method enables the server to return all HTTP request methods supported by the resource. Use '*' instead of the resource name and send an OPTIONS request to the web server to test whether the server function is working properly.

DELETE: Requests the server to delete the resource identified by the Request-URI.

request URL

URL, that is, Uniform Resource Locator, which is what we call web site, Uniform Resource Locator is a concise representation of the location and access method of resources that can be obtained from the Internet, and is the address of standard resources on the Internet. Every file on the Internet has a unique URL that contains information indicating the file's location and what the browser should do with it.

The format of the URL consists of three parts:
The first part is the protocol (or service mode).
The second part is the IP address (and sometimes the port number) of the host where the resource is stored.
The third part is the specific address of the host resource, such as directory and file name.

When a crawler crawls data, it must have a target URL to obtain the data. Therefore, it is the basic basis for the crawler to obtain data.

request header

Contains the header information of the request, such as User-Agent, Host, Cookies and other information. The following figure shows all the request header information parameters when requesting Baidu.

The request body
request is the data carried, such as form data (POST) when submitting form data

What is included in the Response

所有HTTP响应的第一行都是状态行,依次是当前HTTP版本号,3位数字组成的状态代码,以及描述状态的短语,彼此由空格分隔。

响应状态

有多种响应状态,如:200代表成功,301跳转,404找不到页面,502服务器错误

  • 1xx消息——请求已被服务器接收,继续处理
  • 2xx成功——请求已成功被服务器接收、理解、并接受
  • 3xx重定向——需要后续操作才能完成这一请求
  • 4xx请求错误——请求含有词法错误或者无法被执行
  • 5xx服务器错误——服务器在处理某个正确请求时发生错误 常见代码: 200 OK 请求成功 400 Bad Request 客户端请求有语法错误,不能被服务器所理解 401 Unauthorized 请求未经授权,这个状态代码必须和WWW-Authenticate报头域一起使用 403 Forbidden 服务器收到请求,但是拒绝提供服务 404 Not Found 请求资源不存在,eg:输入了错误的URL 500 Internal Server Error 服务器发生不可预期的错误 503 Server Unavailable 服务器当前不能处理客户端的请求,一段时间后可能恢复正常 301 目标永久性转移 302 目标暂时性转移

响应头

如内容类型,类型的长度,服务器信息,设置Cookie,如下图

响应体

最主要的部分,包含请求资源的内容,如网页HTMl,图片,二进制数据等

能爬取什么样的数据

网页文本:如HTML文档,Json格式化文本等
图片:获取到的是二进制文件,保存为图片格式
视频:同样是二进制文件
其他:只要请求到的,都可以获取

如何解析数据

  1. 直接处理
  2. Json解析
  3. 正则表达式处理
  4. BeautifulSoup解析处理
  5. PyQuery解析处理
  6. XPath解析处理

关于抓取的页面数据和浏览器里看到的不一样的问题

出现这种情况是因为,很多网站中的数据都是通过js,ajax动态加载的,所以直接通过get请求获取的页面和浏览器显示的不同。

如何解决js渲染的问题?

分析ajax
Selenium/webdriver
Splash
PyV8,Ghost.py

怎样保存数据

文本:纯文本,Json,Xml等

关系型数据库:如mysql,oracle,sql server等结构化数据库

非关系型数据库:MongoDB,Redis等key-value形式存储




原文地址:http://www.cnblogs.com/zhaof/p/6898138.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326123117&siteId=291194637
Recommended