Disclaimer: This article is for study and research only, and it is forbidden to be used for illegal purposes. Otherwise, you will be at your own risk. If there is any infringement, please notify and delete it, thank you!
Data acquisition of dishonest person on a website
- Project scene:
- solution:
-
-
-
- 1. This website is mainly an identification of the verification code, so if you want to identify the verification code, you must first get his verification code picture.
- 2. The next thing we have to do is to download the verification code and verify the verification code. Let’s first look at the request steps on the web page
- Three, take a look at the verification code request and the verified code
- Fourth, we are going to get the data next, first look at the request link and parameters
- Fifth, the next step is the data of the details page, which also requires the two parameters of pCode and captchaId
- 6. Finally, we combine the entire process. The following is the complete code. Only part of the data is obtained here. If you want more data, you can modify it yourself
-
-
Project scene:
Website: aHR0cDovL3p4Z2suY291cnQuZ292LmNuL3NoaXhpbi8=
Today, I will bring you the access to data from a certain dishonest person. The website is above, and I understand~
solution:
1. This website is mainly an identification of the verification code, so if you want to identify the verification code, you must first get his verification code picture.
We clicked on the verification code to refresh and got a new request. There are two parameters captchaId and random, so we need to find out how to generate these two parameters
We directly click on the familiar refresh from the request stack. After entering, we can see how these two parameters are generated at a glance, and then deduct JS.
function getNum() {
var chars = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A',
'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y',
'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
'x', 'y', 'z'];
var nums = "";
for (var i = 0; i < 32; i++) {
var id = parseInt(Math.random() * 61);
nums += chars[id];
}
return nums;
}
//刷新验证码
function refresh() {
var randomNumber = Math.random();
var uuid = getNum();
return {
randomNumber: randomNumber, uuid: uuid}
}
2. The next thing we have to do is to download the verification code and verify the verification code. Let’s first look at the request steps on the web page
As we can see from the figure below, the request to verify the verification code needs to carry the parameter captchaId and the recognized verification code pCode when obtaining the verification code request, and then we write the code to try it.
Three, take a look at the verification code request and the verified code
Here we are the verification code for manual recognition. Machine learning automatic recognition requires training the marking training model. I will not do this step. I am lazy. There are many machine learning recognition methods on the Internet. Post a [link](https:// blog.csdn.net/qq_26079939/article/details/109050936), you can refer to it.
Note: request verification code and verification verification code need to be performed in the same session
def get_param():
js_str = '''function getNum() {
var chars = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A',
'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y',
'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
'x', 'y', 'z'];
var nums = "";
for (var i = 0; i < 32; i++) {
var id = parseInt(Math.random() * 61);
nums += chars[id];
}
return nums;
}
//刷新验证码
function refresh() {
var randomNumber = Math.random();
var uuid = getNum();
return {randomNumber: randomNumber, uuid: uuid}
}'''
js = execjs.compile(js_str)
return js.call('refresh')
def check_yzm(uuid,randomNumber):
params = (
('captchaId', uuid),
('random', randomNumber),
)
session = requests.session()
# 请求验证码
response = session.get('http://zxgk.court.gov.cn/shixin/captchaNew.do', headers=headers, params=params, verify=False)
with open('yzm.png', 'wb') as f:
f.write(response.content)
print('输入验证码中……')
pCode = input()
params = (
('captchaId', uuid),
('pCode', pCode),
)
# 校验验证码
response = session.get('http://zxgk.court.gov.cn/shixin/checkyzm.do', headers=headers, params=params, verify=False)
if response.text.strip() == '1':
print('识别正确')
return [1,pCode]
else:
print("识别错误")
return [0,pCode]
Run and enter the verification code~
Fourth, we are going to get the data next, first look at the request link and parameters
You can see that pCode and captchaId are the verification code entered before and the parameters of the request verification code
Then we use the code to request, where the ID is used to request the details page
Fifth, the next step is the data of the details page, which also requires the two parameters of pCode and captchaId
6. Finally, we combine the entire process. The following is the complete code. Only part of the data is obtained here. If you want more data, you can modify it yourself
import requests
import execjs
import json
def get_param():
js_str = '''function getNum() {
var chars = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A',
'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y',
'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
'x', 'y', 'z'];
var nums = "";
for (var i = 0; i < 32; i++) {
var id = parseInt(Math.random() * 61);
nums += chars[id];
}
return nums;
}
//刷新验证码
function refresh() {
var randomNumber = Math.random();
var uuid = getNum();
return {randomNumber: randomNumber, uuid: uuid}
}'''
js = execjs.compile(js_str)
return js.call('refresh')
def check_yzm(uuid,randomNumber):
params = (
('captchaId', uuid),
('random', randomNumber),
)
session = requests.session()
# 请求验证码
response = session.get('http://zxgk.court.gov.cn/shixin/captchaNew.do', headers=headers, params=params, verify=False)
with open('yzm.png', 'wb') as f:
f.write(response.content)
print('输入验证码中……')
pCode = input()
params = (
('captchaId', uuid),
('pCode', pCode),
)
# 校验验证码
response = session.get('http://zxgk.court.gov.cn/shixin/checkyzm.do', headers=headers, params=params, verify=False)
if response.text.strip() == '1':
print('识别正确')
return [1,pCode]
else:
print("识别错误")
return [0,pCode]
# 获取列表数据
def get_data(pCode,captchaId):
data = {
'pName': '杭州',
'pCardNum': '',
'pProvince': '0',
'pCode': pCode,
'captchaId': captchaId,
'currentPage': '1'
}
response = requests.post('http://zxgk.court.gov.cn/shixin/searchSX.do', headers=headers, data=data, verify=False)
json_data = json.loads(response.text)[0]
print('\n','总数为:',json_data['totalSize'],'总页数为:',json_data['totalPage'])
for info in json_data['result']:
print(info['id'],info['iname'])
get_detail(info['id'],pCode,captchaId)
# break
# 获取详细数据
def get_detail(id,pCode,captchaId):
params = (
('id', str(id)),
('caseCode', '\uFF082019\uFF09\u6D590108\u62672318\u53F7'),
('pCode', pCode),
('captchaId', captchaId),
)
response = requests.get('http://zxgk.court.gov.cn/shixin/disDetailNew', headers=headers, params=params,verify=False)
print(response.text)
if __name__ == '__main__':
headers = {
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Referer': 'http://zxgk.court.gov.cn/shixin/',
'X-Requested-With': 'XMLHttpRequest',
}
while True:
data = get_param() # 获取请求参数
print(data)
uuid = str(data['uuid'])
randomNumber = str(data['randomNumber'])
check_flag = check_yzm(uuid,randomNumber) #验证码校验
if check_flag[0] == 1:
get_data(check_flag[1],uuid)
break