python3 crawled pages of e-mail address

1, reptiles analysis

Analysis of Results:
http://xxx.com?method=getrequest&gesnum=00000001
http://xxx.com?method=getrequest&gesnum=00000002
http://xxx.com?method=getrequest&gesnum=00000003
returned data crawling

Since a single escape character is present python3 JSON data returned "\" does not handle the process
req = requests.get (url = url, headers = headers, verify = False, timeout = 60) .json ()

Then return is processed by binary data bytes type.
= requests.get REQ (URL = URL, headers = headers, Verify = False, False = allow_redirects, timeout = 60)
Data = json.dumps (bytes.decode (req.content, 'UTF-. 8'))

 

2, python3 write reptiles

#!/usr/bin/python3
#-*- coding:utf-8 -*-

#编写环境  windows 7 x64  Notepad++ + Python3.5.0

import urllib3
urllib3.disable_warnings()
import sys
import requests
import re
import json

cookie = '''JSESSIONID=1B7407076DE01727BC48DCD56FF9BA70; entsoft=entsoft; JSESSIONID=4877B5AC1DF6307E90CF1641D3863A6C; radId=45991FBF-0BC4-3BA4-08E2-00072022FB2C'''

headers ={
    'Accept': 'application/json, text/plain, */*',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cookie': cookie,
}

#输出00000001-00000300存放在num.txt中 
def getNum():
    filename='C:\\Users\\Administrator\\Desktop\\脚本\\num.txt'
    file = open(filename,'w')   
    for i in range(1,300):
        file.write(("%08d" % i)+'\n')
    file.close()
    
    
def main():
    #url ='http://xxx.com?method=getrequest&gesnum=00000001'
    
    getNum()
    
    filename='C:\\Users\\Administrator\\Desktop\\脚本\\num.txt'
    with open(filename,'r') as file:
        for line in file:
            url = ' Http://xxx.com?method=getrequest&gesnum={line} ' .format (Line = Line)
             # Print (URL) 
            
            # REQ = requests.get (URL = URL, headers = headers, Verify = False, timeout 60 =) .json () 
            # encounter problems: dealing with a single escape character data exists in python3 JSON "\" is not solved then use the following way 
            req = requests.get (url = url, headers = headers, verify = False , allow_redirects = False, timeout = 60 ) 
            
            # use json.dumps method, the object may be converted to a string json 
            # Print (req.content) 
            # response.text unicode returns a text data type 
            # response.content return is a type of binary data bytes 
            # Since the return type unicode text data error, use the returned bytes of binary data type
            data= json.dumps(bytes.decode(req.content,'UTF-8'))
            #print(data)
            
            #正则匹配邮箱地址
            emailRegex = r"[-_\w\.]{0,64}@([-\w]{1,63}\.)*[-\w]{1,63}"
            email = re.search(emailRegex,data)
            
            print(email)
       
if __name__ == '__main__':
    main()

3, the output message format is as follows:

<_sre.SRE_Match object; span=(158, 184), match='[email protected]'>
<_sre.SRE_Match object; span=(145, 170), match='[email protected]'>

4, the format of the return mail is processed as follows:

#!/usr/bin/python3
#-*- coding:utf-8 -*-

#编写环境  windows 7 x64  Notepad++ + Python3.5.0
def main():
    
    filename = "C:\\Users\\Administrator\\Desktop\\脚本\\email_handle.txt"
    filename1 = "C:\\Users\\Administrator\\Desktop\\脚本\\email_handle_handle.txt"
    file1 = open(filename1,'w')
      
    with open(filename,'r') as file:
        for line in file:
            data=line[48:]
            print(data)
            file1.write(data)
        
    file.close()
    file1.close()     
   

if __name__ == '__main__':
    main()

5, after processing the message format, the search and replace in a txt '> empty for:

6, reference

Cookie python crawler using two methods 
https://blog.csdn.net/weixin_38706928/article/details/80376572 
Python3 on UnicodeDecodeError / UnicodeEncodeError: 'gbk' codec can not decode / encode bytes text encoding similar problem 
https: / /www.cnblogs.com/worstprogrammer/p/5189758.html 
Python simulated landing (library use requests) 
https://blog.csdn.net/majianfei1023/article/details/49927969 
urllib3 package certificate authentication and warnings of Python disable 
https://blog.csdn.net/taiyangdao/article/details/72825735 
JSON parsing and formatting online verification 
https://www.json.cn/

 

Guess you like

Origin www.cnblogs.com/wmiot/p/11409738.html