Sesame HTTP: The use of cookies for Python crawlers

Why use cookies?

Cookie, refers to the data (usually encrypted) stored on the user's local terminal by some websites in order to identify the user's identity and perform session tracking.

For example, some websites need to log in to access a certain page. Before logging in, you are not allowed to crawl the content of a certain page. Then we can use the Urllib2 library to save our logged in cookies, and then crawl other pages to achieve the goal.

Before that, we must first introduce the concept of an opener.

1.Opener

When you get a URL you use an opener (an instance of urllib2.OpenerDirector). In the front, we are all using the default opener, which is urlopen. It is a special opener, which can be understood as a special instance of opener. The incoming parameters are only url, data, and timeout.

If we need to use cookies, only this opener cannot achieve the purpose, so we need to create a more general opener to implement the setting of cookies.

2.Cookielib

The main function of the cookielib module is to provide an object that can store cookies, so that it can be used in conjunction with the urllib2 module to access Internet resources. The Cookielib module is very powerful. We can use the object of the CookieJar class of this module to capture the cookie and resend it in subsequent connection requests, such as the ability to simulate login. The main objects of this module are CookieJar, FileCookieJar, MozillaCookieJar, LWPCookieJar.

Their relationship: CookieJar -- derived -- > FileCookieJar -- derived -- > MozillaCookieJar and LWPCookieJar

1) Get the cookie and save it to a variable

First of all, let's use the CookieJar object to achieve the function of obtaining cookies and store them in variables. Let's feel it first

 import urllib2
import cookielib
#declare a CookieJar object instance to save cookies
cookie = cookielib.CookieJar()
#Use the HTTPCookieProcessor object of the urllib2 library to create a cookie processor
handler=urllib2.HTTPCookieProcessor(cookie)
#Build opener through handler
opener = urllib2.build_opener(handler)
#The open method here is the same as the urlopen method of urllib2, and you can also pass in the request
response = opener.open('http://www.baidu.com')
for item in cookie:
    print 'Name = '+item.name
    print 'Value = '+item.value

 We use the above method to save the cookie into a variable, and then print out the value in the cookie. The results are as follows

 

Name = BAIDUID
Value = B07B663B645729F11F659C02AAE65B4C:FG=1
Name = BAIDUPSID
Value = B07B663B645729F11F659C02AAE65B4C
Name = H_PS_PSSID
Value = 12527_11076_1438_10633
Name = BDSVRTM
Value = 0
Name = BD_HOME
Value = 0

 

 

2) Save cookies to file

In the above method, we saved the cookie to the cookie variable. What if we want to save the cookie to a file? At this time, we will use

FileCookieJar is the object, here we use its subclass MozillaCookieJar to save cookies

 

import cookielib
import urllib2
 
#Set the file to save the cookie, cookie.txt in the same directory
filename = 'cookie.txt'
#declare a MozillaCookieJar object instance to save the cookie and then write to the file
cookie = cookielib.MozillaCookieJar(filename)
#Use the HTTPCookieProcessor object of the urllib2 library to create a cookie processor
handler = urllib2.HTTPCookieProcessor(cookie)
#Build opener through handler
opener = urllib2.build_opener(handler)
#Create a request, the principle is the same as urllib2's urlopen
response = opener.open("http://www.baidu.com")
#Save cookies to file
cookie.save(ignore_discard=True, ignore_expires=True)

 The two parameters of the last save method are explained here:

 

The official explanation is as follows:

ignore_discard: save even cookies set to be discarded. 

ignore_expires: save even cookies that have expiredThe file is overwritten if it already exists

It can be seen that ignore_discard means to save cookies even if they will be discarded, ignore_expires means that if cookies already exist in the file, overwrite the original file to write, here, we set both of them to True. After running, cookies will be saved to the cookie.txt file.

 3) Get cookies from file and access

Then we have saved the cookie to the file. If you want to use it later, you can use the following method to read the cookie and visit the website, feel

 

import cookielib
import urllib2
 
#Create MozillaCookieJar instance object
cookie = cookielib.MozillaCookieJar()
#从文件中读取cookie内容到变量
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
#创建请求的request
req = urllib2.Request("http://www.baidu.com")
#利用urllib2的build_opener方法创建一个opener
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open(req)
print response.read()

 设想,如果我们的 cookie.txt 文件中保存的是某个人登录百度的cookie,那么我们提取出这个cookie文件内容,就可以用以上方法模拟这个人的账号登录百度。

 

 4)利用cookie模拟网站登录

下面我们以学校的教育系统为例,利用cookie实现模拟登录,并将cookie信息保存到文本文件中,来感受一下cookie大法吧!

import urllib
import urllib2
import cookielib
 
filename = 'cookie.txt'
#声明一个MozillaCookieJar对象实例来保存cookie,之后写入文件
cookie = cookielib.MozillaCookieJar(filename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
postdata = urllib.urlencode({
            'stuid':'201200131012',
            'pwd':'23342321'
        })
#登录教务系统的URL
loginUrl = 'http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bks_login2.login'
#模拟登录,并把cookie保存到变量
result = opener.open(loginUrl,postdata)
#保存cookie到cookie.txt中
cookie.save(ignore_discard=True, ignore_expires=True)
#利用cookie请求访问另一个网址,此网址是成绩查询网址
gradeUrl = 'http://jwxt.sdu.edu.cn:7890/pls/wwwbks/bkscjcx.curscopre'
#请求访问成绩查询网址
result = opener.open(gradeUrl)
print result.read()

 以上程序的原理如下

创建一个带有cookie的opener,在访问登录的URL时,将登录后的cookie保存下来,然后利用这个cookie来访问其他网址。

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326457204&siteId=291194637