python - all-round crawler

Table of contents

1. urllib

1. Introduction to request

1.1 holidays

1.2 data parameter design

1.3 timeout parameter

1.4 Request method

1.5 Advanced Users (Cookies, Proxies)

Authentication:

acting:

Cookie:

2. urllib——error module

2.1 URLerror module

2.2 HTTPError

3. urllib——parse module

3.1 urlparse method

3.2 urlunparse method

3.3 urlsplit method

3.4 urlunsplit method

3.5 urlencode method

3.6 parse_qs method

3.7 parse_qsl method

3.8 quote method

3.9 unquote method

3.10 urljion

4. urllib——robotparser module

4.1 Robots Protocol

4.2 Reptile name

4.3 robotparser

二、 requests

1 request request introduction

2 GET requests

3. Capture web information

4. Capture binary data

5. Add request header

6 POST request

7 responses

8 Upload files

9 Cookie settings

10 sessions maintained

11 SSL certificate authentication-verify parameter

12 timeout settings

13 Authentication

14 Proxy settings - proxies

15 Prepared Request

3. Regular expressions

1 Matching rules for commonly used regular expressions

Regular expression - only keep Chinese/Hanzi characters (filter non-Chinese characters

2 re library

2.1 match

2.2 Matching targets

2.3 Universal matching

2.4 Greedy and non-greedy

2.5 Modifiers

2.6 Transfer matching

2.7 search matching

2.8 findall method

2.9 sub method

2.10 compile method

4. Use of httpx

1 Basic usage of httpx

2 Client object

3 HTTP/2.0

4 Support asynchronous requests

5. Detailed explanation of usage and parameters of logging.basicConfig of Python logging module

1.1. Introduction to logging module

2 logging.basicConfig(**kwargs)

1.4 Use file (filename) to save log files

1.5 Set the time format in the log

6. python——json

1 Introduction to JSON

1.1 json introduction

1.2 json features

1.3 Processing of json files

1.4 json syntax rules

1.5 json

1.6 Writing json files

1.7 Method of reading json files (json. load)

2 json common functions

2.1 loads and loads

Seven python - os library

1 os library introduction

2 os use

3 Common methods of os library

4 Operation directory

5 Operation path

Eight python - sys library

1 sys overview

2 sys use

2.1 sys view

2.2 Commonly used attributes of sys

2.3 Common methods of sys

9. Examples of crawling static web pages

1 request

2 yield usage

3 test code

4 Crawling wallpaper examples (self-written examples, violent regular matching, use break when testing the main function)

10. XPath library

1 XPath common rules

2 XPath use cases

3 matches all nodes

4 child nodes

5 parent node

6 attribute matching

7 Text acquisition

8 Attribute acquisition

9 Attribute multi-value matching

10 Multi-attribute matching

11 Select in order

12 node axis selection

11. Beautiful Soup

1 Beautiful Soup concise

2 interpreter

3 Basic uses of Beautiful Soup

4 node selector

5 Extract information

5.1 Get name

5.2 Get attributes

5.3 Get content

5.4 Nested selection

5.5 Association selection

5.6 Parent nodes and ancestor nodes

5.7 Sibling nodes

5.8 Extract information

6 method selector

6.1 find_all method

7 CSS selectors

7.1 Nested selection

7.2 Get attributes

7.3 Get text

12. pyquery

1 initialization

1.1 URL initialization

1.2 File initialization

2 Basic CSS Selectors

3 Find nodes

4 Find the parent node

5 sibling nodes

6 Traverse nodes

7 Get information

7.1 Get attributes

7.2 Get text

8 node operations

8.1 add_class and remove_class methods

8.2 attr, text and html

8.3 remove method

9 Pseudo class selector

13. parsel library

1 Introduction to parsel

2 parsel initialization

3 matching nodes

4 Extract text

5 Extract attributes

6 Regular extraction

14. Summary


1. urllib

1. Introduction to request

Request is the most basic HTTP request module. It can simulate sending a request. The process is the same as entering URL 1 in the browser and pressing Enter. As long as the URL and additional parameters are passed to the library method, the process of sending a request can be simulated. .

1.1 holidays

The rullib.request module can simulate the browser's request initiation process, and also has functions such as processing authorization verification (authentication), redirection (redirection), and browser cookies.

The basic writing method is as follows:

import urllib.request
response = urllib.request.urlopen("https://www.python.org/")
print(response.read().decode('utf-8'))
 
 

This method is a GET request method. Use the type method to get the type of response:

print(type(response))

Output: <class 'http.client.HTTPResponse'>

So the response is an object of type HTTPResponse.

Use the method to output the response status code and response header information:

print(response.status) #得到响应的状态码
print(response.getheaders()) #得到响应的响应头信息
print(response.getheader("Server")) #获取响应头的键为Server的值

API usage of urlopen:

response = urllib.request.Request(url, data = None, [timeout]*,cafile = None,capath = None,cadefault = False,context = None)
1.2 data parameter design

The data parameter is optional. When adding this parameter, you need to use the bytes method to convert the parameter into content in the byte stream encoding format, that is, the bytes type. If the data parameter is passed, the requester method is GET instead of POST.

Example:

import urllib.request
import urllib.parse
data = bytes(urllib.parse.urlencode({'name':'germey'}), encoding = 'utf-8')
response = urllib.request.urlopen('https://www.httpbin.org/post', data = data)
print(response.read().decode('utf-8'))

got the answer:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Python-urllib/3.11", 
    "X-Amzn-Trace-Id": "Root=1-64997bdd-011711375cc64ba54dba4056"
  }, 
  "json": null, 
  "origin": "1.202.187.118", 
  "ur

Guess you like

Origin blog.csdn.net/longhaierwd/article/details/131743192