Please check, a Python crawler note that makes your annual salary exceed 20W

The main learning content this time is requests\BeautifulSoup\scrapy\re, and I just finished reading it except scrapy. And carry out some small projects such as 58 same-city rental information crawling, Taobao search commodity projects, and now will summarize from three aspects of crawler basic methods, actual combat and encountered problems.

1. Basic method

The first is the requests library, which is the simplest and most practical HTTP library in python and a request library. The main methods are as follows, of which the requests.request() method is the most commonly used for constructing requests and is the sum of several other methods. The other methods are get() to get the HTML webpage, head() to get the head tag of the webpage, post()\pu()t to submit the corresponding request, patch() to make local modifications, and delete() to submit the deletion request.

Please check, a Python crawler note that makes your annual salary exceed 20W!

Focus on the request.get() method, requests.get(url, params=None,**kwargs)

Where url is a page link, params is an additional parameter, dictionary format, **kwargs contains 12 parameters that control access. (params\data\json\headers\cookies\auth\files\timeout\proxies\allow_redirects\stream\verify\cert)

Usually we use the get() method to get the content of the page.

Then introduce the Response object obtained by the request, as shown in the following table.

Please check, a Python crawler note that makes your annual salary exceed 20W!

Add some common code.

(1) Crawling JD.com products

import requestsurl = "https://item.jd.com/2967929.html"try:    
r = requests.get(url)    
r.raise_for_status()   
#如果发送了错误请求,可以抛出异常    
r.encoding = r.apparent_encoding  
#把文本内容的编码格式传递给头文件编码格式    print(r.text[:1000])except:    
print("爬取失败!")

(2) To crawl Amazon, you need to modify the headers field to simulate the request

import requestsurl="https://www.amazon.cn/gp/product/B01M8L5Z3Y"try:    kv = {'user-agent':'Mozilla/5.0'}  
#模拟请求头    r=requests.get(url,headers=kv)    
r.raise_for_status()    
r.encoding=r.apparent_encoding    
print(r.status_code)    print(r.text[:1000])except:    
print("爬取失败")

(3) Baidu search keyword submission -params submit keywords

import requestsurl="http://www.baidu.com/s"try:    
kv={'wd':'Python'}    
r=requests.get(url,params=kv)    
print(r.request.url)    
r.raise_for_status()    
print(len(r.text))    
print(r.text[500:5000])except:    
print("爬取失败")

(4) Image crawling and storage

import requestsimport osurl="http://tc.sinaimg.cn/maxwidth.800/tc.service.weibo.com/p3_pstatp_com/6da229b421faf86ca9ba406190b6f06e.jpg"root="D://pics//"path=root + url.split('/')[-1]try:    
if not os.path.exists(root):        
os.mkdir(root)    
if not os.path.exists(path):        
r = requests.get(url)        
with open(path, 'wb') as f:            
f.write(r.content)   
#r.content为图片            
f.close()            
print("文件保存成功")    
else:        
print("文件已存在")except:    
print("爬取失败")

The BeautifulSoup library is introduced below for parsing web page content.

BeautifulSoup(mk, 'html.parser'), you can use html.parser\lxml\xml\html5lib as the parser, here html.parser is selected.

The main elements are Tag\Name\Attributes\NavigableString\Comment. Among them, the Tag usage method is (soup.a), the attrs usage is (a.attrs['class']), the Navigable (tag.string) is a non-attributed string, and the comment is a comment. ~

The traversal methods of the tag tree are (upward traversal, downward traversal, parallel traversal)

Please check, a Python crawler note that makes your annual salary exceed 20W!

image.png

Please check, a Python crawler note that makes your annual salary exceed 20W!

image.png

In addition, you can use soup.prettify() to output hierarchical paragraphs.

The information extraction method is as follows: find_all is commonly used, specifically soup.find_all('a') for tag search, soup.find_all('p',class='course') for attribute search, and soup.find_all( for string search string='...'), with the regular expression search, soup.find_all(re.compile('link')).

       find() 搜索且返回一个结果,字符串类型    
find_parents() 在先辈节点中搜索,返回一个列表类型    find_parent() 在先辈节点中返回一个结果,字符串类型    find_next_siblings() 在后续平行节点搜索,返回列表类型    find_next_sibling()    
find_previous_siblings()    
find_previous_sibling() 在前序平行节点中返回一个结果,字符串类型                   find_all(name,attrs,recursive,string,**kwargs) 返回一个列表类型,存储查找的结果     
参数:           
name:对标签名称的检索字符串,可以使用列表查找多个标签,find_all(true)所有标签      
attrs:对标签属性值的检索字符串,可标注属性检索 例如find_all('a','href')      
recursive:是否对子孙所有节点搜索,默认值为true,false则值查找当前节点儿子的信息          
string:<></>中字符串区域的检索字符串

Finally, the Re regular expression library is introduced.

The regular expression qualifiers are as follows:

Greedy matching means that there is an infinite amount of matching data, and the so-called non-greedy means that the number of matching is limited. Under normal circumstances, non-greedy only needs to match once. The *, + qualifiers are all greedy because they will match as much literal as possible, and only a ? after them can achieve non-greedy or minimal matching.

image.png

image.png

In the re library, the raw string type, ie r'text', is generally used. Where special characters are encountered, \ should be escaped.

Methods as below

  re.search(pattern,string,flag=0)在一个字符串中搜索匹配正则表达式的第一个位置,返回match对象  
re.match() 在一个字符串的开始位置起匹配正则表达式,返回match对象 注意match为空  
re.findall()搜索字符串,一列表类型返回全部能匹配的子串  re.split()将一个字符串按照正则表达式匹配结果进行分割,返回列表类型  
re.finditer() 搜索字符串,返回一个匹配结果的迭代类型,每个迭代元素是match对象  
re.sub()在一个字符串中替换所有匹配正则表达式的子串,返回替换后的字符串  
re.compile(pattern,flags) 将正则表达式的字符串形式编译成正则表达式对象

There are three selection types in flag = 0, re.I ignores case, re.M matches from the beginning of each line, and re.S matches all characters.
The above are functional usages, but there are also object-oriented usages.

pat = re.compile('')pat.search(text)

Finally, the properties and methods of the match object are introduced, see below.

  1、属性    
1)string 待匹配文本    
2)re 匹配时使用的pattern对象(正则表达式)    
3)pos 正则表达式搜索文本的开始位置    
4)endpos 正则表达式搜索文本的结束为止  
        2、方法    
1).group(0) 获得匹配后的字符串    
2).start() 匹配字符串在原始字符串的开始位置    
3).end() 匹配字符串在原始字符串的结束位置    
4).span() 返回(.start(),.end())元组类型

2. Practical drills

Two examples of Taobao commodity search and 58 same-city rentals are mainly selected. BlogCommendFromBaidu-8&utm_source=distribute.pc_relevant_right.none-task-blog-BlogCommendFromBaidu-8' 'https://cloud.tencent.com/developer/article/1611414'

Taobao search

import requestsimport redef getHTMLText(url):    
headers = {       
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'    
}    
#cookies在元素审查,网络里面刷新,找请求头下面的Cookie    
usercookies = ''       
 #这里需要使用客户端的淘宝登录cookies    
cookies = {}    for a in usercookies.split(';'):        
name, value = a.strip().split('=', 1)        
cookies[name] = value    
print(cookies)   
 try:       
 r = requests.get(url, headers=headers, cookies=cookies, timeout=60)        r.raise_for_status()  
#如果有错误返回异常        
print(r.status_code) #打印状态码        
r.encoding = r.apparent_encoding        
return r.text    except:        
return 'failed'def parsePage(ilt, html):   
 try:        
plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)        
tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)        
for i in range(len(plt)):            
price = eval(plt[i].split(':')[1])  
# 意义是进行分割其冒号            
title = eval(tlt[i].split(':')[1])            
ilt.append([price, title])    except:        
print("")def printGoodsList(ilt):    
tplt = "{:4}\t{:8}\t{:16}"    
print(tplt.format("序号", "价格", "商品名称"))  
# 输出信息    
count = 0    
for g in ilt:        
count = count + 1        
print(tplt.format(count, g[0], g[1]))def main():    
goods = '足球'    
depth = 3    
start_url = 'http://s.taobao.com/search?q={}&s='.format(goods) 
 # 找到起始页的url链接    
infoList = []    for i in range(depth):  
# 进行循环爬去每一页        try:            
url = start_url + str(44 * i)            
html = getHTMLText(url)            
parsePage(infoList, html)        
except:           
 continue    
printGoodsList(infoList)main()

58.com crawls the rental house, this part of the code is more, select the important content to display.

1. Decoding of encrypted fonts

# 获取字体文件并转换为xml文件def get_font(page_url, page_num, proxies):    
response = requests.get(url=page_url, headers=headers, proxies=proxies)    
# 匹配 base64 编码的加密字体字符串    
base64_string = response.text.split("base64,")[1].split("'")[0].strip()    
# print(base64_string)    
# 将 base64 编码的字体字符串解码成二进制编码    
bin_data = base64.decodebytes(base64_string.encode())    
# 保存为字体文件    
with open('58font.woff', 'wb') as f:        
f.write(bin_data)    
print('第' + str(page_num) + '次访问网页,字体文件保存成功!')   
# 获取字体文件,将其转换为xml文件    
font = TTFont('58font.woff')    
font.saveXML('58font.xml')    
print('已成功将字体文件转换为xml文件!')    
return response.text# 将加密字体编码与真实字体进行匹配def find_font():    
# 以glyph开头的编码对应的数字    
glyph_list = {        
'glyph00001': '0',       
'glyph00002': '1',        
'glyph00003': '2',        
'glyph00004': '3',        
'glyph00005': '4',        
'glyph00006': '5',        
'glyph00007': '6',        
'glyph00008': '7',        
'glyph00009': '8',        
'glyph00010': '9'   
 }   
 # 十个加密字体编码    
unicode_list = ['0x9476', '0x958f', '0x993c', '0x9a4b', '0x9e3a', '0x9ea3', '0x9f64', '0x9f92', '0x9fa4', '0x9fa5']    
num_list = []    
# 利用xpath语法匹配xml文件内容    
font_data = etree.parse('./58font.xml')    
for unicode in unicode_list:        
# 依次循环查找xml文件里code对应的name        
result = font_data.xpath("//cmap//map[@code='{}']/@name".format(unicode))[0]        
# print(result)        
# 循环字典的key,如果code对应的name与字典的key相同,则得到key对应的value        
for key in glyph_list.keys():            
if key == result:                
num_list.append(glyph_list[key])    
print('已成功找到编码所对应的数字!')    
# print(num_list)    
# 返回value列表    
return num_list
# 替换掉网页中所有的加密字体编码def replace_font(num, page_response):    
# 9476 958F 993C 9A4B 9E3A 9EA3 9F64 9F92 9FA4 9FA5    
result = page_response
.replace('鑶', num[0])
.replace('閏', num[1])
.replace('餼', num[2])
.replace('驋', num[3])
.replace('鸺', num[4])
.replace('麣', num[5])
.replace('齤', num[6])
.replace('龒', num[7])
.replace('龤', num[8]).
replace('龥', num[9])    
print('已成功将所有加密字体替换!')    
return result

2. Crawling rental information

# 提取租房信息def parse_pages(pages):    
num = 0    
soup = BeautifulSoup(pages, 'lxml')    
# 查找到包含所有租房的li标签    
all_house = soup.find_all('li', class_='house-cell')    
for house in all_house:        
# 标题        
# title = house.find('a', class_='strongbox').text.strip()        
# print(title)        
# 价格        
price = house.find('div', class_='money').text.strip()        
price = str(price)        
print(price)        
# 户型和面积        
layout = house.find('p', class_='room').text.replace(' ', '')        
layout = str(layout)        
print(layout)        
# 楼盘和地址        
address = house.find('p', class_='infor').text.replace(' ', '').replace('\n', '')        
address = str(address)        
print(address)        
num += 1        
print('第' + str(num) + '条数据爬取完毕,暂停3秒!')        
time.sleep(3)        
with open('58.txt', 'a+', encoding='utf-8') as f:          
#这里需encoding编码为utf-8,因网络读取的文本和写入的文本编码格式不一;a+继续在文本底部追加内容。            
f.write(price + '\t' + layout + '\t' + address + '\n')

3. Since 58 will block the crawler IP address, it is necessary to crawl the ip to switch.

def getiplists(page_num):  
#爬取ip地址存到列表,爬取pages页    
proxy_list = []    
for page in range(1, page_num):        
url = "  "+str(page)        
r = requests.get(url, headers=headers)        
soup = BeautifulSoup(r.text, 'lxml')        
ips = soup.findAll('tr')        
for x in range(5, len(ips)):            
ip = ips[x]            
tds = ip.findAll("td")  
#找到td标签            
ip_temp = 'http://'+tds[1].contents[0]+":"+tds[2].contents[0]  
#.contents找到子节点,tag之间的navigable也构成了节点
proxy_list.append(ip_temp)    
proxy_list = set(proxy_list)  
#去重    
proxy_list = list(proxy_list)    
print('已爬取到'+ str(len(proxy_list)) + '个ip地址')    
return proxy_list通过更新proxies,作为参数更新到requests.get()中,可以一直刷新IP地址。           
proxies = {                
'http': item,                
'https': item,            
}

3. Experience summary

The problems encountered during the period are summarized as follows:

1. Most websites need to simulate the request header, user-agent.

2. Taobao needs to simulate cookies login, cookies information can be found in the check element.

3. This method can only crawl static web pages, and new knowledge is required for dynamic web pages where data is written into javascript.

4. It is easy to be blocked by IP during the crawling process. It is necessary to crawl the IP address on the IP proxy website and constantly refresh the IP address. Just add the proxies parameter in the get() method.

The price string of 5.58 is displayed in an encrypted way and needs to be decoded.

6. When writing text, use encoding='utf-8' to avoid errors.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326951744&siteId=291194637