Reptiles (1) -- Take you to understand the most basic concepts of reptiles, and you can practice them in one article

1. Overview of web crawlers

1.1 Data extraction and acquisition

Definition: A web crawler is acertain rulesautomaticPrograms and scripts that crawl the Internet for information. forsimulationThe person operates the browser to open the webpage and obtain the specified data in the webpage.

1.2 Types of reptiles

Types of reptiles effect
common reptile crawl web pageallsource data
Focus on reptiles crawl web pagespartialdata
incremental crawler It is used to detect the update of website data and crawl to the latest website datarenewThe data
distributed crawler Multi-person crawling, improving the crawling data of website data

The first one: Classify according to the number of crawls:
 ①General crawlers: usually refer to search engine crawlers.
   Universal crawlers are an important part of search engine crawling systems (baidu, goole, yahoo, etc.). The main purpose is to download Internet webpages locally to form a mirror backup of Internet content. (But one big problem is that they are very limited: most of the content is useless - different search purposes, return the same content!)

② Focus on crawlers: crawlers for specific websites.
   It is a web crawler program oriented to the needs of specific topics. The difference between it and general search engine crawlers is that the
focused crawler will process and filter the content when crawling pages, and try to ensure that only webpage information related to needs is crawled!

2. Focus on learning the analysis of the target website (following is a fixed process)

2.1 Preface

Multi-page crawling: Analysis of the target website [use developer tools] -> initiate a request [check – network Network– header Agent, find the corresponding URL] -> get a response – > continue to initiate a request – > get a response and parse it – > Save data.

2.2 What is a cookie

In the browser, we often involve the exchange of data, such as logging in to your mailbox or logging in to a page. We often set remember me for 30 days, or auto-login options at this point. So how do they record information? The answer is cookies. Cookies are set by the HTTP server and stored in the browser, but the HTTP protocol is a stateless protocol. will be closed, each exchange of data needs to establish a new link.

A cookie is a small text file stored on a user's computer or mobile device to track the user's activity and status on a website. When you visit a website, the website sends a cookie to your computer.

Cookies contain information about your visit, such as your browser type, operating system, language preference and other visit information.

Cookies can be used toKeep track of your login status, so that you don't have to enter your username and password again the next time you visit the site. They can also be used fortrack your browsing history, so that the site can provide more personalized content and advertising. However, cookies may also be used to track your private information and activities, which may raise privacy and security concerns. Therefore, most browsers provide options to allow you to control the use of cookies and delete cookies.

F12Open the browser developer tools, and follow the steps as shown in the figure Cookies:

2.3 What is Request Heders

Request headers(Request header) is the metadata included in the request when the client sends an HTTP request to the server. These metadata contain information about the request such asThe type of request (GET, POST, etc.), the source of the request, the content type of the request, the timestamp of the request, etc.

Request headers are usually included in the header of an HTTP request, expressed in the form of key-value pairs . For example, User-Agent indicates the type and version of the browser, Accept indicates the MIME type supported by the client, Referer indicates the URL address of the request source page, and so on.

Request headers are a very important part of the HTTP protocol. They provide additional information about the request, thereby helping the server understand the purpose and content of the request and respond accordingly. At the same time, request headers can also be used to control caching, authentication, and security operations.

2.4 What is User-Agent

User-AgentIt is an HTTP request header field, which contains information about the client (usually a browser) that sends the request, such as browser type, version number, operating system, device type, and so on.

The Web server can use the User-Agent header to determine the software and hardware environment used by the client and provide a more appropriate response. For example, a website might use the User-Agent header to determine how to render pages, serve appropriate content, or respond to different types of devices.

Note: User-Agent header information can be obtained through JavaScript or back-end code, so some websites may use this information to analyze visitor behavior, statistics of browser market share and other information. So this also gives us a chance for reptiles.

  Users can also hide their identity or deceive the server by modifying the User-Agent header, so the User-Agent cannot completely determine the real identity of the visitor.

Usually when we practice crawler code, we need to use the User-Agent that we actually visit, and we can use the agent later.

2.5 What is Referrer

ReferrerIt refers to the URL address from which page is linked to the current page, and is usually used to track and analyze the traffic source of the website. When a user clicks on a link or visits a website through a search engine, the Referrer field will contain the URL address of the source page.

Referrer information is usually sent by web browsers,It tells the web server from which page the visitor linked to the current page

This information is useful for webmasters as it gives them insight into the traffic sources and user behavior of the website.

However, since the Referrer information can contain some sensitive information, such asSearch keywords or the URL address of the last visited page, so some browsers and network security software may prohibit sending Referrer information, or filter or anonymize it when sending it.

3. HTTP

3.1 Why do you need to know HTTP

When learning about crawling, it is very important to understand HTTP, because HTTP is the basic protocol used by web applications. HTTP (Hypertext Transfer Protocol) is an application layer protocol for transferring hypertext (including HTML files, images, audio, video, style sheets, etc.).

When using a crawler to crawl website data, the crawler first needs to send an HTTP request, and then receive an HTTP response from the server. Crawlers communicate with web servers using HTTP, which dictates how data is transmitted and presented. Therefore, understanding the basics of the HTTP protocol is crucial for learning to crawl.

3.2 HTTP response status code: (there is a familiar 404 here!)

The first digit of the HTTP status code defines the type of status code, usually in the following five types:

status code explain
1xx An informational status code indicating that the request has been received and processing continues.
2xx Success status code, indicating that the request has been successfully received, understood, and accepted by the server.
3xx Redirection status code, indicating that further operations are required from the client to complete the request.
4xx The client error status code indicates that there is a problem with the request initiated by the client, and the request cannot be understood or accepted by the server.
5xx Server error status code, indicating that the server encountered an error while processing the request.

4. Acquisition of web data

4.1 The most basic library: the use of requests library

is a concise and simple way to handle HTTP requeststhird party library, his biggest advantage is that the programming process is closer to the normal URL access process.

The official teaching documents are as follows:
https://requests.readthedocs.io/projects/cn/zh_CN/latest/

Download of third-party libraries: Open , import of third-party libraries: requests constructs and returnspython the Request object :cmdpip install + 第三方库名
import

r = requests.get(url)

The Response object contains all the information returned by the server, as well as the requested Request information.
The properties are as follows:

Attributes illustrate
r.status_code The status returned by the HTTP request, 200 means success, 404 means failure
r.text The form of the HTTP response 字符串, the content of the page corresponding to the url
r.content The form of HTTP response content 二进制(picture, video, audio)

Notice:

  • If the image file is stored in js, the decoding needs to be imported into the json library, and then json.loads()decoded to form the required data frame. Otherwise, the source code is a string of strings, and subsequent operations cannot be performed.
  • Functions used for encoding: import from urllib import parse, parse.quote()encode using

4.2 Get header information

  1. Get cookies:
import requests
from fake_useragent import UserAgent
r = requests.get(url,headers = headers)

# 方法一,items打印出列表形式,遍历后依次打印
for key,value in r.cookies.items():
    print(key + "=" + value)

# 方法二,转换为字典形式
cookie = requests.utils.dict_from_cookiejar(r.cookies)
print(cookie)
  1. get headers, can user.headers

  2. To get the response after sending the request url, you can user.url

4.3 Detailed explanation of .get parameters

requests.get()and requests.post() are two commonly used HTTP request libraries in Python, which can be used to send GET and POST requests. Both functions can accept multiple parameters, the following is a detailed explanation of their commonly used parameters:

For requests.get(url, params=None, **kwargs) the function :

  • url: URL address of the request.
  • params: Optional dictionary or byte sequence to pass request parameters.
  • **kwargs: Optional keyword arguments, including:
    • headers: Dictionary type, HTTP request header information.
    • timeout: Set the timeout period.
    • proxies: dictionary type, proxy server address.

1. Setting cookie, the method of obtaining has been mentioned above.
2. Set proxyproxies

Tip: Whether you know the address of the server is used as a criterion for judging: if you know it, it is a forward proxy, if you don’t know it, it is a reverse proxy.

Why do we use proxies?
(1) Let the server think that the same client is not requesting;
(2) Prevent our real address from being leaked and be held accountable.

Usage:
  When we need to use a proxy, the sameConstruct proxy dictionary, passed to the proxies parameter.
  
3. Prohibition of certificate verification vertify
  Sometimes we use a packet capture tool. At this time, since the certificate provided by the packet capture tool is not issued by a trusted digital certificate authority, the verification of the certificate will fail. At this time, we need to close the certificate Verify false.
(3) Setting timeout
  In fact, in most crawler developments, the timeout parameter timeout is used together with the retrying module (refresh)!

The code is reproduced from a great god: click here to jump

import requests
from retrying import retry

headers = {
    
    "User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.2 (KHTML, like Gecko) Chrome/4.0.222.3 "}

@retry(stop_max_attempt_number=3)   # stop_max_attempt_number=3最大执行3次,还不成功就报错
def _parse_url(url):                # 前面加_代表此函数,其他地方不可调用
    print("*"*100)
    response = requests.get(url, headers=headers, timeout=3)    # timeout=3超时参数,3s内
    assert response.status_code == 200                          # assert断言,此处断言状态码是200,不是则报错
    return response.content.decode()


def parse_url(url):
    try:
        html_str = _parse_url(url)
    except Exception as e:
        print(e)
        html_str = None
    return html_str

if __name__ == '__main__':
    # url = "www.baidu.com"         # 这样是会报错的!
    url = "http://www.baidu.com"
    print(parse_url(url))

4.4 Detailed explanation of .post parameters

For the requests.post(url, data=None, json=None, **kwargs) function:

  • url: URL address of the request.
  • data: A dictionary, list of tuples, or sequence of bytes for sending form data to the server.
  • json: data in JSON format, used to send JSON data.
  • **kwargs: Optional keyword arguments, including:
    • headers: Dictionary type, HTTP request header information.
    • timeout: Set the timeout period.
    • proxies: dictionary type, proxy server address.

Content-TypeIn the HTTP request body, the relationship with POST data submission method is as follows:

Content-Type How to submit data
application/x-www-form-urlencoded form data
multipart/form-data Form file upload
application/json Serialize JSON data
text/xml XML data

When crawling the code, be sure to find the HTTP request and construct the correct POST request, otherwise the corresponding content cannot be returned.

More detailed information and more parameters can be viewed in the official documentation:

4.5 The difference between data and params

In Python's requests library, dataand paramsare parameters used to send requests to the server, but their functions and usage scenarios are slightly different.

paramsUsually used to pass query string parameters when sending GET requests. The query string is a key-value pair that appears in the URL. They are included after the question mark (?) in the URL, and multiple key-value pairs are separated by an & symbol. For example, suppose we want to send a GET request to https://example.com/search , and the query string contains q and pagetwo parameters, we can write like this:

import requests

params = {
    
    'q': 'python', 'page': '2'}
response = requests.get('https://example.com/search', params=params)

After sending a request, the requests library willautomaticEncode the parameters in params into the URL's query string,Get the full URLis https://example.com/search?q=python&page=2.

dataIt is usually used to send request body parameters such as form data and JSON data when sending POST requests. They are passed to the data parameter as a dictionary, which the requests library will then encode into the appropriate format to send to the server. For example, suppose we want to send a POST request to https://example.com/login containinguser name and passwordTwo form fields can be written like this:

import requests

data = {
    
    'username': 'alice', 'password': 'secret'}
response = requests.post('https://example.com/login', data=data)

After sending the request, the requests library will automatically encode the parameters in data into application/x-www-form-urlencoded or multipart/form-data format, and send it to the server as the request body.

It should be noted that although paramsand datahave slightly different functions and usage scenarios, they are all optional parameters, and you can choose to use or not use them according to the actual situation.

4.6 Get the response

r.text, to obtain the content returned by the url.

r.josnIf the data we obtain after accessing is in JSON format, then we can use the json() method to directly obtain the data converted into dictionary format. After that, use json.loads()to decode.

4.7 Little knowledge

Knowledge point 1:
  Sometimes we encounter the same url parameter name, but with different values, and the python dictionary does not support the same name of the key, then we can represent the value of the key as a list:

import requests
params = {
    
    'key': 'value1', 'key2': ['value2', 'value3']}
resp = requests.get("http://httpbin.org/get", params=params)
print(resp.url)

Knowledge point two:

Note: Only one of json and data can exist at the same time

5. Technical application of data visualization (framework, components, etc.) 【continuously updated】

Supplement: Difficulties in crawler development (advanced)

Difficulties of reptiles

The difficulty of reptiles is mainly divided into two directions:

  1. The data acquisition
      network public resources are all prepared for users. In order to avoid being collected by crawlers, the server will set up a lot of Turing tests to prevent malicious crawling by crawlers, that is,Anti-climbing measures. In the process of developing crawlers, a large part of our work is to deal with these anti-crawling measures.
  2. The speed of collection
       In the era of big data, a huge amount of data is required, often at the level of tens of millions, or even hundreds of millions. If the collection speed cannot keep up and takes too long, it will not meet the commercial requirements. Generally, we will adopt concurrency and distribution to solve the problem of speed. This is another focus in the crawler development process.

Note: The domain name can determine which computer; and the port number is to determine which application of that computer

regular expression

re.findall(b"From Inner Cluster \r\n\r\n(.*?)",first_data,re.s)what does this code mean

Explanation:
1. The function of this regular expression is first_datato find all substrings starting with "From Inner Cluster" from the string, followed by two carriage returns (\r\n\r\n), and then any character , and returns a list.
2. re.sThe flag is used in the third parameter of the re.findall() method to indicate that the . in the regular expression pattern can matchincluding line breaksany character in the .

Guess you like

Origin blog.csdn.net/qq_54015136/article/details/129728597