Python crawler simulates login Renren.com

Simulated login: crawl user information based on certain users.

Requirement 1: Perform simulated login on Renren.

After clicking the login button, a post request will be initiated
The post request will carry the relevant login information entered before the login (username, password, verification code...)
Verification code: changes every time you request

Requirement 2: Crawl relevant user information of the current user (user information displayed on the personal homepage)

http/https protocol features: stateless.

Reasons for not requesting the corresponding page data:

When initiating the second request based on the personal homepage page, the server does not know that the request is based on the request in the login state.

Cookie: used to allow the server to record the state of the client.

Manual processing: Obtain the cookie value through the packet capture tool and encapsulate the value in headers. (not suggested)
Automatic processing:
-Where is the source of the cookie value?
-After simulating the login post request, it is created by the server.

session object:
role:

The request can be sent.
If a cookie is generated during the request, the cookie will be automatically stored/carried in the session object.
- Creating a session object: session = requests.Session ()
- using the login session object to simulate the post request to send (cookie will be stored in the session)
- session subject to personal home page corresponding to get a request to send (carrying cookie )

1. Send a request to http://www.renren.com/ to get the source code of the page below

Insert picture description here

2. Locate the verification code image on the page, get the value of the src attribute in the img tag, and then send a get request to the URL in the src to save the verification code image locally, and then use the Super Eagle coding platform to The verification code image saved to the local for identification

Insert picture description here

3. Click the login button to capture the packet through the browser, and find that the browser sent a post request to the server, the requested url is http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=202112910495, capture this time For the requested data packet, check whether there is a set-cookie in the response header information. If so, the server creates a session object for the client when the request is confirmed, and the cookie is created and returned to the client for storage.

Insert picture description here

Sure enough, set-cookies exist. Therefore, when we use the requests module to simulate login, the requests initiated also need to carry cookies . So how are cookies carried in requests?

The requests module handles cookies in two ways:

The cookie is manually obtained from the packet capture tool, and then encapsulated in the headers of the requests request, and the headers are applied to the request method. (not suggested)

headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
    'Cookie':'xxxxxxxxx'
}

Create a session object, and use the session object to send a request. Because the cookie is automatically carried and processed in the session. (recommend)

#创建会话对象,该会话对象可以调用get和post发起请求
session = requests.Session()
page_text = session.get(url=url,headers=headers).text
......

4. Through the capture of the website login, we found that the requested url is: http://www.renren.com/974713149, and the response we need is the home page after successful login. So send a request to this url, and pay attention to the simulated request header User-Agent, Referer, Cookie

Insert picture description here

5. Send a get request to http://www.renren.com/974713149/profile to get the source code of the following personal homepage:

Insert picture description here

Code demo:

The cookie is manually obtained from the packet capture tool, and then encapsulated in the headers of the requests request, and the headers are applied to the request method. (not suggested)

# 编码流程：
#     1.验证码的识别，获取验证码图片的文字数据
#     2.对get请求进行发送
#     3.对响应数据进行持久化存储

import requests
from lxml import etree
from hashlib import md5


# 封装识别验证码图片的函数
def getCodeText(userName, password, appId, imgUrl):
    class Chaojiying_Client(object):

        def __init__(self, username, password, soft_id):
            self.username = username
            password = password.encode('utf8')

            self.password = md5(password).hexdigest()
            self.soft_id = soft_id
            self.base_params = {
    
    
                'user': self.username,
                'pass2': self.password,
                'softid': self.soft_id,
            }
            self.headers = {
    
    
                'Connection': 'Keep-Alive',
                'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
            }

        def PostPic(self, im, codetype):
            """
            im: 图片字节
            codetype: 题目类型 参考 http://www.chaojiying.com/price.html
            """
            params = {
    
    
                'codetype': codetype,
            }
            params.update(self.base_params)
            files = {
    
    'userfile': ('ccc.jpg', im)}
            r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                              headers=self.headers)
            return r.json()

        def ReportError(self, im_id):
            """
            im_id:报错题目的图片ID
            """
            params = {
    
    
                'id': im_id,
            }
            params.update(self.base_params)
            r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
            return r.json()

    if __name__ == '__main__':
        chaojiying = Chaojiying_Client(userName, password, appId)  # 用户中心>>软件ID 生成一个替换 96001
        im = open(imgUrl, 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
        # print(chaojiying.PostPic(im, 1902))  # 1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()
    return chaojiying.PostPic(im, 1902)



# 1.对验证码图片进行捕获和识别
headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
        'Referer': 'http://www.renren.com/SysHome.do',
        'Cookie': 'anonymid=klgdsqz5n7c6dn; depovince=ZGQT; _r01_=1; JSESSIONID=abcqWHDNhNOVf95ntfjFx; taihe_bi_sdk_uid=926da97ed7bdff5fc3ece47fdd554b0b; taihe_bi_sdk_session=ffa92a5a812142ba8dac302676d881cd; ick_login=426dff64-6952-4319-8c8f-96ea6f498550; first_login_flag=1; [email protected]; ln_hurl=http://hdn.xnimg.cn/photos/hdn421/205/2035/h_main_9aN0_0c1b00037b06195a.jpg; wp_fold=0; jebecookies=c2363801-e587-4f54-8566-24b86aa22659|||||; _de=B3D043F455F38852340E4CEC836F3769696BF75400CE19CC; p=2e69883207d99e253471f621d896037d9; t=1f917c44eaa1178b8bd357e96d7346fc9; societyguester=1f917c44eaa1178b8bd357e96d7346fc9; id=974713149; xnsid=364172ac; loginfrom=syshome'
}
url = 'http://www.renren.com/'
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
img_url = tree.xpath('//*[@id="verifyPic_login"]/@src')[0]
print(img_url)
img_data = requests.get(img_url,headers=headers).content
print(img_data)
with open('./code.jpg','wb') as fp:
    fp.write(img_data)

# 使用超级鹰打码提供的示例代码对验证码图片进行识别
result = getCodeText('用户名','密码', 'appid', '验证码本地存储的路径')
print(result['pic_str'])

# 2.对get请求进行发送

login_url = 'http://www.renren.com/9747139'
login_page_text = requests.get(url=login_url, headers=headers).text
with open('renren.html','w',encoding='utf-8') as fp:
    fp.write(login_page_text)

# 爬取当前用户的个人主页对应的页面数据
detail_url = 'http://www.renren.com/974713149/profile'
detail_page_text = requests.get(url=detail_url, headers=headers).text
with open('zep.html','w',encoding='utf-8') as fp:
    fp.write(detail_page_text)

Save to local renren.html:
Insert picture description here
Save to local zep.html:

2. Create a session object, and use the session object to send the request. Because the cookie is automatically carried and processed in the session. (recommend)

# 编码流程：
#     1.验证码的识别，获取验证码图片的文字数据
#     2.对get请求进行发送
#     3.对响应数据进行持久化存储

import requests
from lxml import etree
from hashlib import md5


# 封装识别验证码图片的函数
def getCodeText(userName, password, appId, imgUrl):
    class Chaojiying_Client(object):

        def __init__(self, username, password, soft_id):
            self.username = username
            password = password.encode('utf8')

            self.password = md5(password).hexdigest()
            self.soft_id = soft_id
            self.base_params = {
    
    
                'user': self.username,
                'pass2': self.password,
                'softid': self.soft_id,
            }
            self.headers = {
    
    
                'Connection': 'Keep-Alive',
                'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
            }

        def PostPic(self, im, codetype):
            """
            im: 图片字节
            codetype: 题目类型 参考 http://www.chaojiying.com/price.html
            """
            params = {
    
    
                'codetype': codetype,
            }
            params.update(self.base_params)
            files = {
    
    'userfile': ('ccc.jpg', im)}
            r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                              headers=self.headers)
            return r.json()

        def ReportError(self, im_id):
            """
            im_id:报错题目的图片ID
            """
            params = {
    
    
                'id': im_id,
            }
            params.update(self.base_params)
            r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
            return r.json()

    if __name__ == '__main__':
        chaojiying = Chaojiying_Client(userName, password, appId)  # 用户中心>>软件ID 生成一个替换 96001
        im = open(imgUrl, 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
        # print(chaojiying.PostPic(im, 1902))  # 1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()
    return chaojiying.PostPic(im, 1902)

#创建会话对象,该会话对象可以调用get和post发起请求
session = requests.Session()

# 1.对验证码图片进行捕获和识别
headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
        'Referer': 'http://www.renren.com/SysHome.do',
        # 'Cookie': 'anonymid=klgdsqz5n7c6dn; depovince=ZGQT; _r01_=1; JSESSIONID=abcqWHDNhNOVf95ntfjFx; taihe_bi_sdk_uid=926da97ed7bdff5fc3ece47fdd554b0b; taihe_bi_sdk_session=ffa92a5a812142ba8dac302676d881cd; ick_login=426dff64-6952-4319-8c8f-96ea6f498550; first_login_flag=1; [email protected]; ln_hurl=http://hdn.xnimg.cn/photos/hdn421/200705/235/h_main_9aN0_0c1b00b06195a.jpg; wp_fold=0; jebecookies=c2363801-e587-4f54-8566-24b86aa22659|||||; _de=B3D043F455F38852340E4CEC836F3769696BF75400CE19CC; p=2e69883207d99e253471f621d896037d9; t=1f917c44eaa1178b8bd357e96d7346fc9; societyguester=1f917c44eaa1b8bd357e96d7346fc9; id=974713149; xnsid=364172ac; loginfrom=syshome'
}
url = 'http://www.renren.com/'
page_text = session.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
img_url = tree.xpath('//*[@id="verifyPic_login"]/@src')[0]
print(img_url)
img_data = session.get(img_url,headers=headers).content
print(img_data)
with open('./code.jpg','wb') as fp:
    fp.write(img_data)

# 使用超级鹰打码提供的示例代码对验证码图片进行识别
result = getCodeText('用户名','密码', 'appid', '验证码图片的路径')
print(result['pic_str'])


login_post_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=202112910495'
data = {
    
    
    'email': '[email protected]',
    'icode': result['pic_str'],
    'origURL': 'http://www.renren.com/home',
    'domain': 'renren.com',
    'key_id': '1',
    'captcha_type': 'web_login',
    'password': '346d050fe82d3cfe090210864d73b65b5608bf90173371b3c10e7df6e533',
    'rkey': '3a7cdde0b042c1ba11169c3378fd5b',
    'f': 'http%3A%2F%2Fwww.renren.com%2F974713149%2Fnewsfeed%2Fphoto'
}
response = session.post(url=login_post_url, headers=headers,data=data)
print(response.text)


# 2.对get请求进行发送

login_url = 'http://www.renren.com/974713149'
login_page_text = session.get(url=login_url, headers=headers).text
with open('renren.html','w',encoding='utf-8') as fp:
    fp.write(login_page_text)


# 爬取当前用户的个人主页对应的页面数据
detail_url = 'http://www.renren.com/974713149/profile'
detail_page_text = session.get(url=detail_url, headers=headers).text
with open('zep.html','w',encoding='utf-8') as fp:
    fp.write(detail_page_text)

zep.html:
Insert picture description here