Spider data mining-8, scrapy framework (4)

The Request class is the base class of the request request in the scrapy module, and there is also the FormRequest class in the request request class, which inherits and expands the Request class

One, Request

Scrapy.http.Request:

The Scrapy.http.Request class is the base class of request in the scrapy framework. Its parameters are as follows:
url (string)
-the URL of this request, the URL that initiated the request callback (callable)-the callback function
method (string) -the HTTP method of this request. The default is'GET'.
meta (dict)-the initial value of the Request.meta attribute.
body (str or unicode)-request body. If no parameters are passed, the default is an empty string.
headers (dict)-the request headers of this request.
cookies-request cookies.
encoding (string)-The encoding of this request (default is'utf-8'). This encoding will be used to percent-encode the URL and convert the body to str (if unicode is given).
priority (int)-the priority of this request (the default is 0, the priority is different from the pipeline, from high to low).
dont_filter (boolean)-Indicates that the scheduler should not filter this request. (Filtered by default)
errback (callable ) -A function that will be called when any exception is raised while processing the request. (Is a callback function)
flags (list ) -flags sent to the request, which can be used for logging or similar purposes.

a=Request(url,·······)

cb_kwargs ( dict ) – A dict with arbitrary data, which will be passed to the callback as a keyword argument

Properties and methods:

url The string containing the URL of this request. This attribute is read-only. Change the URL replace() used by the request.
method A string representing the HTTP method in the request.
headers A dictionary-like object that contains request headers.
body contains the str of the request body. This attribute is read-only. Change the URL replace() used by the request.
meta A dictionary containing any metadata for this request.
copy() returns a new request, and the change request is a copy of this request. Copy a copy and
replace ([ URL, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback]) Return an updated request

print(a.url)

b=a.replace (~~~~~·······) to update, modify, and then the generated b is the latest request

FormRequest:

Can use the post method more conveniently, rewrite the post, mainly to make the post more usable

Get requests and post requests are the most common requests. The scrapy framework has a built-in FormRequest class,
which extends the base class Request and has the function of processing HTML forms.

It uses the lxml.html form to pre-populate the form fields, which contain the form data from the Response object.

Specific API, let's learn the source code

二、Response

导出:from scrapy.http.response import Response

Parameters: When constructing the response, some of the following parameters, if not written, only request, generally only express the content of the request corresponding to the request.
url (string)
-the URL of this response status (integer)-the HTTP status of the response. The default is 200.
headers (dict)-the response headers of this response. The dict value can be a string (for single-value headers) or a list (for multi-value headers).
body (bytes)-response body. To access the decoded text as str (unicode in Python 2), you can use response.text from an encoding-aware Response subclass, such as TextResponse,

If it is image data, it must be obtained with response.body. If it is not image data, first export from scrapy.http.response import text (the source code of text here is decoded), and then you can respond.text

, Generally use response.text to print out the response content

flags (list)-is a list containing the initial value of the Response.flags property. If given, the list will be shallowly copied.
request (Requestobject)-The initial value of the Response.request property. This means that Request generated the content of this response.

Attributes and methods:
url
status
headers
body
request
meta: In the construction of response, if there is a request, the meta of the response will return the meta
flags of the request.
copy()
replace ([ url, status, headers, body, request, flags, cls] )
Urljoin(url): merge url
follow(url)

Three, log usage

In settings, enter the following to set the log

LOG_FORMAT='%(asctime)s [%(name)s] %(levelname)s:%(message)s'
LOG_DATEFORMAT='%Y'

Generally, only the following two parameters are set in the project:
LOG_FILE ='logfile_name'
LOG_LEVEL ='INFO'

asctime is the ascii code time, the name component name, DATEFORMAT only adds %Y to indicate that only the year information is displayed

In addition to the above two log configuration parameters, the following can be entered:

These settings can be used to configure logging:

LOG_FILE log output file, if it is None, it will be printed on the console, generally None

When printing in the pycharm window, the content is output on the console, mainly because LOG_FILE is closed.

Just set LOG_FILE='file name' to save it in the corresponding file, and it will be created automatically if there is no such file

LOG_ENABLED Whether to enable the log, the default is True, it will not be sent to the console after it is closed, but it can be sent to the file
LOG_ENCODING date code, the default is utf-8
LOG_LEVEL log level, the default debug (debugging meaning): the default is debugging information (debug is the lowest Level debugging)

Debugging information DEBUG <General information INFO<Warning WARNING<General error ERROR<Critical error CRITICAL

LOG_FORMAT log format
LOG_DATEFORMAT log date format
LOG_STDOUT log standard output, the default is False, if True, all standard output will be written to the log, when true, even the things printed by print will also be written into the log, which is good for writing what you want to express Content
LOG_SHORT_NAMES short log name, the default is False, if True, the component name will not be output

Generally, logs are stored in the console, but the storage time is not long and reading is troublesome, so they can be recorded in the following components:

Record in spider:

Scrapy logger provides an instance that can be accessed and used in each Spider instance

Record in other components:

Of course, it can be recorded through python logging.
For example: logging.warning('This is a warning!')

Add self.logging.warning ('~~~~~~·······') to the instance method of the crawler file, which can be written into the log to be a mark, and it can also be entered in the instance method of other components use.

But for later maintenance, we can create different loggers to encapsulate the messages.
And use the name of the component or function to name, see the left pattern example:

Four, Github login

Request process:

Submit data such as post username and password to the session url to get the login page

The specific submission content can be viewed in the form data of the page code.

commit: Sign in
authenticity_token: JU/SqAIMuKu3rPzOQIpRTceVO31XU6Nw68DNPRqdvfoUTDVt1FBpR8lbsmqGLxhvtj4IloWy8EMvZshupa+9yw==
login: [email protected]
password: qweqqweqw
trusted_device: 
webauthn-support: supported
webauthn-iuvpaa-support: supported
return_to: 
allow_signup: 
client_id: 
integration: 
required_field_0a14: 
timestamp: 1611654568362
timestamp_secret: 376cea0b768fde0e5897b8995b2ab6ed62eabc56bd4b5f8ef01cb6a9a2e969a7

Submit the wrong password twice to find the above content, and then compare it. If you find the same, you can ignore it, and keep the difference for construction. If the key name is different but there is no data, you can leave it alone. Timestamp (timestamp: 1611654568362 ) And timestamp_secret need to be constructed

Need to construct the parameter acquisition method:

1. Obtained from the previous request page

Visit the login page to view the source code and get the parameters required by the session

2. Dynamic generation of js code

Not considered at this stage

After the construction is successful, it is more convenient to use the formrequest class to request

mark dic····· source root in as can set the file to the relative reference system of the path in the directory corresponding to the file

Guess you like

Origin blog.csdn.net/qwe863226687/article/details/114117142