python网络爬虫基础day03

　　2019.5.17，不知不觉搞到这个点了。搞得有点晚，总结一下就睡觉：

今天主要学了验证码识别和requests模块高级操作：

验证码识别

　　验证码和爬虫之间的爱恨情仇？
　　反爬机制：验证码。识别验证码图片中的数据，用于模拟登陆操作。

　　识别验证码的操作：
 　　   - 人工肉眼识别。（不推荐）
   　　 - 第三方自动识别（推荐）
     　　   - 云打码：http://www.yundama.com/demo.htmL

　　云打码的使用流程：
 　　   - 注册：普通和开发者用户
   　　 - 登录：
   　　     - 普通用户的登录：查询该用户是否还有剩余的题分
     　　   - 开发者用户的登录：
       　　     - 创建一个软件：我的软件 -> 添加新软件 -> 录入软件名称 ->提交（软件id和秘钥）
        　　    - 下载示例代码：开发文档 -> 点此下载：云打码接口DLL -> PythonHTTP示例下载

　　实战：识别古诗文网登录页面中的验证码。

　　使用打码平台识别验证码的编码流程：
   　　 - 将验证码图片进行本地下载
   　　 - 调用平台提供的示例代码进行图片数据识别

 1 # Author:K
 2 import requests
 3 from lxml import etree
 4 from CodeClass import YDMHttp
 5 
 6 def getCodeText(img_path,codeType):
 7     # 用户名
 8     username = 'KisInfinite'
 9 
10     # 密码
11     password = 'KisInfinite'
12 
13     # 软件ＩＤ，开发者分成必要参数。登录开发者后台【我的软件】获得！
14     appid = 7756
15 
16     # 软件密钥，开发者分成必要参数。登录开发者后台【我的软件】获得！
17     appkey = 'bb37e0a2e219630c84e647ff44287869'
18 
19     # 图片文件
20     filename = img_path
21 
22     # 验证码类型，# 例：1004表示4位字母数字，不同类型收费不同。请准确填写，否则影响识别率。在此查询所有类型 http://www.yundama.com/price.html
23     codetype = codeType
24 
25     # 超时时间，秒
26     timeout = 20
27     # result = None
28     # 检查
29     if (username == 'username'):
30         print('请设置好相关参数再测试')
31     else:
32         # 初始化
33         yundama = YDMHttp(username, password, appid, appkey)
34 
35         # 登陆云打码
36         uid = yundama.login();
37         print('uid: %s' % uid)
38 
39         # 查询余额
40         balance = yundama.balance();
41         print('balance: %s' % balance)
42 
43         # 开始识别，图片路径，验证码类型ID，超时时间（秒），识别结果
44         cid, result = yundama.decode(filename, codetype, timeout);
45         print('cid: %s, result: %s' % (cid, result))
46     # return result
47 
48 headers = {
49     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
50 }
51 
52 url = 'https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx'
53 
54 page_text = requests.get(url = url,headers = headers).text
55 tree = etree.HTML(page_text)
56 img_src = tree.xpath('//*[@id="imgCode"]/@src')[0]
57 img_src = 'https://so.gushiwen.org' + img_src
58 
59 img_data = requests.get(url = img_src,headers = headers).content
60 with open('code.jpg','wb') as fp:
61     fp.write(img_data)
62 
63 code_text = getCodeText('code.jpg',1004)
64 print(code_text)

古诗文网站验证码识别例子

requests模块高级操作

　　模拟登录：
    　　- 爬取基于某些用户的用户信息。

　　需求：对人人网进行模拟登录。
  　　  - 点击登录按钮之后会发起一个post请求
   　　 - post请求中会携带登录之前录入的相关的登录信息（用户名，密码，验证码......）
  　　  - 验证码：每次请求都会变化

　　需求：爬取当前用户的相关的用户信息（个人主页中显示的用户信息）

　　http/https协议特性：无状态。

　　没有请求到对应页面数据的原因：
   　　 发起的第二次基于个人主页页面请求的时候，服务器端并不知道该此请求是基于登录状态下的请求。

　　cookie：用来让服务器端记录客户端的相关状态。
   　　 - 手动处理：通过抓包工具获取cookie值，将该值封装到headers中。（不建议）
    　　- 自动处理：
     　　   - cookie值的来源是哪里？
       　　     - 模拟登录post请求后，由服务器端创建。

       　　 session会话对象：
       　　     - 作用：
            　　    1.可以进行请求的发送。
            　　    2.如果请求过程中产生了cookie，则该cookie会被自动存储/携带在该session对象中。

           　　 - 创建一个session对象：session=requests.Session（）
          　　  - 使用session对象进行模拟登录post请求的发送（cookie就会被存储在session中）
           　　 - session对象对个人主页对应的get请求进行发送（携带了cookie）

　　代理：破解封IP这种反爬机制。

　　什么是代理：
　　    - 代理服务器。

　　代理的作用：
   　　 - 突破自身IP访问的限制。
    　　- 隐藏自身真实IP

　　代理相关的网站：
   　　 - 快代理
   　　 - 西刺代理
   　　 - www.goubanjia.com

　　代理ip的类型：
    　　- http：应用到http协议对应的url中
   　　 - https：应用到https协议对应的url中

　　代理ip的匿名度：
   　　 - 透明：服务器知道该次请求使用了代理，也知道请求对应的真实ip
   　　 - 匿名：知道使用了代理，不知道真实ip
   　　 - 高匿：不知道使用了代理，更不知道真实的ip

 1 # Author:K
 2 import requests
 3 from lxml import etree
 4 from CodeClass import YDMHttp
 5 
 6 def getCodeText(img_path,codeType):
 7     # 用户名
 8     username = 'KisInfinite'
 9 
10     # 密码
11     password = 'KisInfinite'
12 
13     # 软件ＩＤ，开发者分成必要参数。登录开发者后台【我的软件】获得！
14     appid = 7756
15 
16     # 软件密钥，开发者分成必要参数。登录开发者后台【我的软件】获得！
17     appkey = 'bb37e0a2e219630c84e647ff44287869'
18 
19     # 图片文件
20     filename = img_path
21 
22     # 验证码类型，# 例：1004表示4位字母数字，不同类型收费不同。请准确填写，否则影响识别率。在此查询所有类型 http://www.yundama.com/price.html
23     codetype = codeType
24 
25     # 超时时间，秒
26     timeout = 20
27     result = None
28     # 检查
29     if (username == 'username'):
30         print('请设置好相关参数再测试')
31     else:
32         # 初始化
33         yundama = YDMHttp(username, password, appid, appkey)
34 
35         # 登陆云打码
36         uid = yundama.login();
37         print('uid: %s' % uid)
38 
39         # 查询余额
40         balance = yundama.balance();
41         print('balance: %s' % balance)
42 
43         # 开始识别，图片路径，验证码类型ID，超时时间（秒），识别结果
44         cid, result = yundama.decode(filename, codetype, timeout);
45         print('cid: %s, result: %s' % (cid, result))
46     return result
47 
48 session = requests.Session()
49 
50 headers = {
51     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
52 }
53 
54 url = 'http://www.renren.com/'
55 page_text = requests.get(url = url,headers = headers).text
56 tree = etree.HTML(page_text)
57 img_src = tree.xpath('//*[@id="verifyPic_login"]/@src')[0]
58 
59 img_data = requests.get(url = img_src,headers = headers).content
60 with open('code.jpg','wb') as fp:
61     fp.write(img_data)
62 
63 result = getCodeText('code.jpg',2004)
64 
65 login_url = 'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=201945230922'
66 
67 data = {
68     'email':'18394177504',
69     'icode':result,
70     'origURL':'http://www.renren.com/home',
71     'domain':'renren.com',
72     'key_id':'1',
73     'captcha_type':'web_login',
74     'password':'1fa94e046b3ecc8a8c9f734255788872c0db323ada85c50b58baa5608b6d36e9',
75     'rkey':'f4a2f1e217358fa5267991c5ccf35e2a',
76     'f':'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DN0gkqFVweEtA-vMIbp4Q-eISSN6PFpiSKfHVR6V0AkbF0zjjDhGGAk9Ih-4gENZ1%26wd%3D%26eqid%3Dceaf8cce0001cd08000000035cdda6b8',
77 }
78 
79 #这里使用session的原因是session对象会携带登陆的cookie
80 response = session.post(url = login_url,data = data,headers = headers)
81 print(response.status_code)
82 
83 
84 detail_url = 'http://www.renren.com/970837645/profile'
85 
86 
87 detail_page_text = session.get(url = detail_url,headers = headers).text
88 
89 with open('detail_page.html','w',encoding = 'utf-8') as fp:
90     fp.write(detail_page_text)

人人网模拟登陆案例

 1 import http.client, mimetypes, urllib, json, time, requests
 2 
 3 
 4 ######################################################################
 5 
 6 class YDMHttp:
 7     apiurl = 'http://api.yundama.com/api.php'
 8     username = ''
 9     password = ''
10     appid = ''
11     appkey = ''
12 
13     def __init__(self, username, password, appid, appkey):
14         self.username = username
15         self.password = password
16         self.appid = str(appid)
17         self.appkey = appkey
18 
19     def request(self, fields, files=[]):
20         response = self.post_url(self.apiurl, fields, files)
21         response = json.loads(response)
22         return response
23 
24     def balance(self):
25         data = {'method': 'balance', 'username': self.username, 'password': self.password, 'appid': self.appid,
26                 'appkey': self.appkey}
27         response = self.request(data)
28         if (response):
29             if (response['ret'] and response['ret'] < 0):
30                 return response['ret']
31             else:
32                 return response['balance']
33         else:
34             return -9001
35 
36     def login(self):
37         data = {'method': 'login', 'username': self.username, 'password': self.password, 'appid': self.appid,
38                 'appkey': self.appkey}
39         response = self.request(data)
40         if (response):
41             if (response['ret'] and response['ret'] < 0):
42                 return response['ret']
43             else:
44                 return response['uid']
45         else:
46             return -9001
47 
48     def upload(self, filename, codetype, timeout):
49         data = {'method': 'upload', 'username': self.username, 'password': self.password, 'appid': self.appid,
50                 'appkey': self.appkey, 'codetype': str(codetype), 'timeout': str(timeout)}
51         file = {'file': filename}
52         response = self.request(data, file)
53         if (response):
54             if (response['ret'] and response['ret'] < 0):
55                 return response['ret']
56             else:
57                 return response['cid']
58         else:
59             return -9001
60 
61     def result(self, cid):
62         data = {'method': 'result', 'username': self.username, 'password': self.password, 'appid': self.appid,
63                 'appkey': self.appkey, 'cid': str(cid)}
64         response = self.request(data)
65         return response and response['text'] or ''
66 
67     def decode(self, filename, codetype, timeout):
68         cid = self.upload(filename, codetype, timeout)
69         if (cid > 0):
70             for i in range(0, timeout):
71                 result = self.result(cid)
72                 if (result != ''):
73                     return cid, result
74                 else:
75                     time.sleep(1)
76             return -3003, ''
77         else:
78             return cid, ''
79 
80     def report(self, cid):
81         data = {'method': 'report', 'username': self.username, 'password': self.password, 'appid': self.appid,
82                 'appkey': self.appkey, 'cid': str(cid), 'flag': '0'}
83         response = self.request(data)
84         if (response):
85             return response['ret']
86         else:
87             return -9001
88 
89     def post_url(self, url, fields, files=[]):
90         for key in files:
91             files[key] = open(files[key], 'rb');
92         res = requests.post(url, files=files, data=fields)
93         return res.text

CodeClass

 1 # Author:K
 2 import requests
 3 
 4 url = 'https://www.baidu.com/s?wd=ip'
 5 
 6 headers = {
 7     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
 8 }
 9 
10 page_text = requests.get(url = url,headers = headers,proxies = {'https':'120.234.63.196:3128'}).text
11 
12 with open('ip.html','w',encoding = 'utf-8') as fp:
13     fp.write((page_text))

代理ip例子

另外，对于人人网的个人主页爬取有点疑问，代码如下：

 1 # Author:K
 2 # ！！！！！！！！！！！！！！！！ 为什么取不到href？？？？？？？？？？
 3 import requests
 4 from lxml import etree
 5 
 6 url = 'http://www.renren.com/970837645'
 7 
 8 headers = {
 9     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
10 }
11 
12 page_text = requests.get(url = url,headers = headers).text
13 
14 tree = etree.HTML(page_text)
15 href = tree.xpath('//*[@id="nxHeader"]/div/div/div/dl/dt/a/@href')
16 print(href)

人人网主页爬取疑问

以上内容参考波波老师的视频，波波老师讲的真的很好。强烈推荐！附上学习链接：https://www.apeland.cn/python/8/449


好了。时间不早了，睡觉。。。。。。脑子转不了了。。。。

python网络爬虫基础day03

猜你喜欢