Introduction to Python crawling web pages

Basic knowledge of crawling webpages-------http protocol

1. http is the standard (TCP) for client (user) and server (website) requests and responses.
2. http working process
(1) The client establishes a connection with the server.
(2) Send http request
(3) After receiving the request, the server will give response information.
(4) Release the TCP connection
(5) The client receives the information returned by the server, and the browser parses and displays the web page.
3. Crawler process
(1) Initiate a request Initiate a request
to the target site through the http library, and send a request. The request can include additional headers and other information.
(2) Get the response content
If the server can respond normally, you will get a response, and the page content will be HTML, json string, and binary data.

(3) Parsing the content The content
obtained may be HTML, which can be parsed with regular expressions and web page parsing libraries. It may be Json, which can be directly converted to Json object analysis, or it may be binary data, which can be saved or further processed.
(4) Save data. There
are various forms of saving, which can be saved as text, saved to a database, or saved in a specific format.
4. Request and Response
Insert picture description here
(1) The browser sends a message to the server where the website is located. This process is called HTTP Request.
(2) After the server receives the message sent by the browser, it can process the message according to the content of the message sent by the browser, and then send the message back to the browser. This process is called HTTP Response.
(3) After the browser receives the Response information from the server, it will process the information accordingly and then display it.

.Basic knowledge of crawling web pages-HTTP request method

HTTP1.0 defines three request methods: GET, POST and HEAD methods.
HTTP1.1 adds six new request methods: OPTIONS, PUT, PATCH, DELETE, TRACE and CONNECT methods.

Insert picture description here
The data submitted by GET will be placed after the URL, that is, in the request line. The URL and the transmission data are separated by ?, and the parameters are connected by &, such as EditBook?name=test1&id=123456. (The content-type in the request header is done This parameter form will be discussed later) The POST method puts the submitted data in the request body of the HTTP packet.
The size of the data submitted by GET is limited (because the browser has restrictions on the length of the URL), while the data submitted by the POST method does not Limitations.
GET and POST requests have different ways of obtaining request data on the server side, that is, the way we get the request data on the server side is different. This is a nonsense.

The first line of all HTTP responses is the status line, followed by the current HTTP version number, the 3-digit status code, and the phrase describing the status, separated by spaces.
Insert picture description here

Basic knowledge of crawling web pages-URL

URL is the abbreviation of Uniform Resource Locator, that is, Uniform Resource Locator, which is also a web address.
URL complies with a standard syntax, which consists of six parts: protocol, host name, domain name, port, path, and file name.
Insert picture description here

HTML and JavaScript basics-web page structure

1. Web pages are generally composed of three parts, namely HTML (hypertext markup language), CSS (cascading style sheets) and JScript (active scripting language).
2. HTML is the structure of the entire web page, which is equivalent to the frame of the entire website. The tags with "<" and ">" symbols are HTML tags, and the tags appear in pairs.
3. Common tags are as follows:

<html>..</html> 表示标记中间的元素是网页
<body>..</body> 表示用户可见的内容
<div>..</div> 表示框架
<p>..</p> 表示段落
<li>..</li>表示列表
<img>..</img>表示图片
<h1>..</h1>表示标题
<a href="">..</a>表示超链接


4.
CSS CSS means style, <style type="text/css"> means that a CSS is quoted below, and the appearance is defined in CSS.
5. JScript means function. The interactive content and various special effects are in JScript, which describes various functions in the website.
If you use the human body as a metaphor, HTML is the human skeleton and defines where the human mouth, eyes, ears, etc. should grow. CSS is the appearance details of a person, such as what the mouth looks like, whether the eyes are double eyelids or single eyelids, whether the eyes are big or small, and whether the skin is black or white. JScript represents human skills, such as dancing, singing, or playing musical instruments.

Every website has a document named robots.txt, of course, some websites do not set robots.txt. For websites that do not have robots.txt set, data without password encryption can be obtained through a web crawler, that is, all page data of the website can be crawled. If the website has a robots.txt file, it is necessary to determine whether there is data that is forbidden to visitors.
Insert picture description here

5. The urllib library for crawling web pages

1. The
urllib library Python3.x standard library urllib provides four modules: urllib.request, urllib.response, urllib.parse and urllib.error, which well supports the function of reading web content. Combined with Python string methods and regular expressions, some simple web content crawling can be completed, and it is also the basis for understanding and using other crawler libraries.

  1. Use urllib library to get web page information
    Use urllib.request.urlopen() function to open a website, read and print web page information.
    urllib.urlopen(url, data[, proxies]) The
    urlopen() function returns the response object
    . The parameter url of the function represents the path of the remote data; data represents the data submitted to the url; proxies is used to set the proxy.
    The method of the response object
    info() method: Return a httplib.HTTPMessage object.
    getcode() method: returns the HTTP status code. If it is an HTTP request, 200 means that the request was successfully completed, and 404 means that the URL was not found.
    geturl(): Returns the requested url.
//1.读取并显示网页内容

>>> import urllib.request
>>> fp = urllib.request.urlopen(r'http://www.python.org')
>>> print(fp.read(100))              #读取100个字节
>>> print(fp.read(100).decode())     #使用UTF8进行解码
>>> fp.close()                       #关闭连接

//2.提交网页参数
//1)下面的代码演示了如何使用GET方法读取并显示指定url的内容。
>>> import urllib.request
>>> import urllib.parse
>>> params = urllib.parse.urlencode({
    
    'spam': 1, 'eggs': 2, 'bacon': 0})
>>> url = "http://www.musi-cal.com/cgi-bin/query?%s" % params
>>> with urllib.request.urlopen(url) as f:
    print(f.read().decode('utf-8'))
//2)使用POST方法提交参数并读取指定页面内容。
>>> import urllib.request
>>> import urllib.parse
>>> data = urllib.parse.urlencode({
    
    'spam': 1, 'eggs': 2, 'bacon': 0})
>>> data = data.encode('ascii')
>>> with urllib.request.urlopen("http://requestb.in/xrbl82xr",
                                data) as f:
    print(f.read().decode('utf-8'))
//3.使用HTTP代理访问页面

>>> import urllib.request
>>> proxies = {
    
    'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.request.FancyURLopener(proxies)
>>> with opener.open("http://www.python.org") as f:
    f.read().decode('utf-8')

Guess you like

Origin blog.csdn.net/weixin_43398418/article/details/110691710