[Notes] crawler Python a cookie analog setting process to access the user (2)

Learning textbooks "python network data collection", most of the code for this book.

  Finish processing the request header, the value of the cookie is a way to distinguish between user and machine. So we need to address what cookie, need requests module ado open out.

  python modify cookie 1. In general

First, get cookie

Import Requests 

the params = { ' username ' : ' Ryan ' , ' password ' : ' password ' } # set a dictionary, a user name and password, the request with the host almost 
R & lt = requests.post ( " http://pythonscraping.com /pages/cookies/welcome.php " , params)
 Print ( " Cookie to the SET IS: " )
 Print (r.cookies.get_dict ()) # get cookie, and output
 Print ( " --------- - " )
 Print ( "Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php",cookies=r.cookies)  #发送cookie
print(r.text)

  2. For a change cookie

If the site you are faced with more complex, it is often secretly adjust cookie, or if you do not wish to use a cookie from the start, session Requests function library can solve these problems:

import requests
session
= requests.Session() params = {'username': 'username', 'password': 'password'} s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params) print("Cookie is set to:") print(s.cookies.get_dict()) print("-----------") print("Going to profile page...") s = session.get("http://pythonscraping.com/pages/cookies/profile.php") print(s.text)

The above code does not set a cookie value, which is at the convenience of the session. Session (session) objects (call requests.Session () Gets) keeps track of session information, such as cookie, header, and even information about running the HTTP protocol, HTTPAdapter (provides a uniform interface to HTTP and HTTPS links Session)

  3. cookie-based user operations generated by the script

Because requests module can not execute JavaScript, so it can not handle a lot of the new generation of cookie tracking software, such as Google Analytics, the client only if the script is executed after setting cookie (cookie or web-based event generated when users browse the page, such as clicking a button ). To deal with these actions, you need to use Selenium and PhantomJS package (phantomJS package has been cold, can be used instead of Firefox or Google's)

(1) acquiring cookie

from selenium import webdriver
driver
= webdriver.Firefox() driver.get("https://www.bilibili.com/") driver.implicitly_wait(1)
print(driver.get_cookies())

(2) call delete_cookie (), add_cookie () and delete_all_cookies () method to deal with cookie

In addition, the cookie can be saved for future use other web crawler. The following example demonstrates how to combine these functions:

from selenium import webdriver
driver
= webdriver.Firefox()
driver.get("http://pythonscraping.com") 
driver.implicitly_wait(
1) print(driver.get_cookies())
savedCookies
= driver.get_cookies()

driver2
= webdriver.Firefox()

driver2.get(
"http://pythonscraping.com")
driver2.delete_all_cookies()
for cookie in savedCookies:
  driver2.add_cookie(cookie)
driver2.get(
"http://pythonscraping.com")
driver.implicitly_wait(
1)
print(driver2.get_cookies())

In this example, the first webdriver get a website, print and save them to a cookie variable savedCookies years. The second webdriver loaded with a website (Technical Note: you must first load the site, this cookie Selenium in order to know which sites belong, even if load behavior of the site we did not have any use), remove all of the cookie, and then replace the first webdriver the resulting cookie. When you load the page again, two cookie timestamp, source code and other information should be exactly the same. Google Analytics perspective, second and now first webdriver webdriver exactly the same.

Guess you like

Origin www.cnblogs.com/dfy-blog/p/11518630.html