Python web crawler 1

1 Prelude to web crawlers

1.1 Guide to the content of the web crawler course

Insert picture description here

1.2 Selection of Python language development tools

Insert picture description here
This lesson mainly uses the following 4 kinds:
Insert picture description here
IDLE of python is the default and commonly used entry-level authoring tool that comes with python

2 Rules of web crawlers

Insert picture description here

2.1 Getting started with the Requests library

2.1.1 How to install Requests

The Requests library (http://www.python-requests.org) is recognized as the best crawling third-party library. It has 2 characteristics:

  1. simple
  2. Very concise, even with one line of code you can get the corresponding resources from the web page

Installation method: Run command
Insert picture description here
test as an administrator :
Insert picture description here
Insert picture description here

2.1.2 The get() method of the Requests library

Insert picture description here
Insert picture description here
Python case sensitivity
Insert picture description here
The request method is used to encapsulate the
Requests library and provides a total of 7 common methods, the request method is the basic method, and the other 6 methods provide calling the request method to achieve
Insert picture description here
Insert picture description here
r.status_code: check the status code of the request. If it is 200, it means the return is successful.
r.headers: Get the header information of the page. The
Response object includes all the information returned by the server and the request information requested by the service
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
. The resources on the network are encoded. If there is no encoding, it will be unreadable.
r.apparent_encoding is more readable than r.encoding accurate

2.1.3 General code framework for crawling web pages

The
get can be crawled accurately and reliably , because the network connection is risky, and exception handling is very important.
Insert picture description here
Timeout: the exception of the entire process from sending the URL request to obtaining the content.
ConnectTimeout: just connecting to the remote server
Insert picture description here
Insert picture description here
Insert picture description here
common code framework makes the crawling more stable More effective and reliable

2.1.4 HTTP protocol and Requests library method

HTTP protocol

Insert picture description here
The user initiates a request and the server responds.
The stateless request is that there is no correlation between the first and the second. The
application layer protocol means that the protocol works on top of the tcp protocol (?)
Insert picture description here
Insert picture description here
, which is equivalent to the path of the file, except that it exists on the Internet that
Insert picture description here
is provided by the Requests library. Functions

Insert picture description here
The HTTP protocol locates resources through URLs, and manages resources through six commonly used methods. Each operation is independent and stateless.
In the world of HTTP protocol, network channels and servers are black boxes. What he can see is URL and corresponding operations on URL

Insert picture description here

Requests library methods (post(), put())

Insert picture description here
Insert picture description here
Obtain summary information of network resources with very little traffic

The type of request post() method is different, and the result obtained is different:
1. Insert picture description here
2.Insert picture description here
Insert picture description here

2.1.2 Analysis of the main methods of the Requests library

request

Insert picture description here
Insert picture description here
OPTIONS: Get some parameters from the service that the server can interact with the customer service. It is not directly related to obtaining resources, so it is less used.
Insert picture description here
Insert picture description here
Submitting data to the server
Insert picture description here
is also a common data submission
Insert picture description here
Custom HTTP header
Chrome/10: Represents Chrome's 10th This version
can simulate any browser to send access to the server,
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
hide the original IP address from which the user crawled the webpage
Insert picture description here

get

Insert picture description here

head

Insert picture description here

post

Insert picture description here

put

Insert picture description here

patch

Insert picture description here

delete

Insert picture description here

to sum up

Why is it designed like this? In the last six methods, because some methods need to use certain parameters, these parameters are put into the parameters as display-defined parameters, and the others are put into optional parameters. The
get method is most commonly used to crawl the content from the server. Without submitting content

2.1.3 Summary

Insert picture description here
Use get crawler
Use head to get a summary of larger resources
Insert picture description here
Insert picture description here

import requests
import time

def getHTMLText(url):
    try:
        r=requests.get(url)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
    except:
        print("failed")


if __name__=="__main__":
    start=time.perf_counter()
    for i in range(100):
        getHTMLText("http://www.baidu.com")
    end=time.perf_counter()
    print("用时%.5f秒"%(end-start))

Insert picture description here
Insert picture description here
I did not consider whether it successfully crawled 100 times. . .

2.2 The "pirates can also be thefts" of web crawlers

2.2.1 Problems caused by web crawlers

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

2.2.2 Robots Protocol

Insert picture description here
Insert picture description here
"?" "html file in popm directory" is not allowed. . .
Insert picture description here
No robots protocol means that it can be crawled arbitrarily

2.2.3 How to comply with the Robots agreement

Insert picture description here
Insert picture description here
Insert picture description here
That is, the number of visits is small

2.3 Actual Combat of Requests Library Web Crawler (5 Examples)

Look at web content from the perspective of crawlers

2.3.1 Crawling of JD product pages

Insert picture description here
r.encoding: The encoding information can be parsed from the HTTP header, and JD.com provides the encoding

import requests
import time

def getHTMLText(url):
    try:
        r=requests.get(url)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        print(r.text[:1000])
    except:
        print("failed")


if __name__=="__main__":
    getHTMLText("https://item.jd.com/100006349587.html")

What actually jumped out was login
Insert picture description here

2.3.2 Crawling of Amazon product pages

Insert picture description here
Insert picture description here
Being able to crawl back things means that it’s not a network problem.
That is to say, the website rejected the crawler’s request. The
response object r contains the request request,
so the header information needs to be modified.
Insert picture description here
Insert picture description here

2.3.3 Baidu 360 search keyword submission

Insert picture description here
Insert picture description here
After analyzing the content, I will talk about it
Insert picture description here

2.3.4 Crawling and storage of network pictures

Insert picture description here
Insert picture description here
The picture is in binary form, how to save it as a file?

  • Custom name
    Insert picture description here
  • Original name
    Insert picture description here

2.3.5 Automatic query of IP address attribution

Insert picture description here
Insert picture description here
Try to constrain the size of r.text, such as r.text[-500:], otherwise it will affect the use of idle.
Insert picture description here
As long as you know the form of the URL submitted to the background after pressing the button, you can code simulation submission

Guess you like

Origin blog.csdn.net/qq_42713936/article/details/105804104