Preface:
The learning data parsing this knowledge!
0x00: Learn data analysis
In ONE that one in, you mentioned focused crawler (crawling pages specified content), most of the reptiles are focused crawler, but we just started crawling are certainly the data of the entire page, how to locate us want that part of the data, it uses data analysis
Data analysis primarily through the following three:
Bold Style
—— 正则表达式
—— BeautifulSoup
—— xpath
Data analysis principles:
—— 解析的局部的文本内容都会在标签之间或者标签对应的属性中进行存储
一 1.进行指定标签的定位
二 2.标签或者标签对应的属性中存储的数据值进行提取(解析)
First to practice crawling picture :
Crawling pictures
#爬取图片
import requests
if __name__ == "__main__":
url = 'http://gaoxiao.jokeji.cn/UpFilesnew/2011/8/20/201182002041961.jpg'
#content返回的是二进制形式的图片数据
img_data = requests.get(url=url).content
#text(字符串) content(二进制数据) json(对象)
with open('imag.jpg','wb') as fp:
fp.write(img_data)
print('Over')
Crawling Success:
0x01: parsing using bs4
bs4数据解析的原理:
—— 1.实例化一个BeautifulSoup对象,并且将页面源码数据加载到该对象中
—— 2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取
如何实例化BeautifulSoup对象:
—— from bs4 import BeautifulSoup
—— 对象的实例化:
—— 1.将本地的html文档中的数据加载到该对象中
fp = open('sogou.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
—— 2.将互联网上获取的网页源码加载到该对象中
page_text = response.text
soup = BeautifulSoup(page_text,'lxml')
—— 提供用于数据解析的方法和属性:
soup.tagName:返回的是文档中第一次出现的tagName对应的标签(单数,只能返回一个)
soup.find('tagName'):等同于soup.tagName(单数)
—— 属性定位:
——soup.find('div',class_/(还可以是其他的属性)='bg-gj-w')
#下划线是为了区分关键字
——soup.find_all('a')(多个数)
#返回要求的所有标签(列表)
—— select:
——soup.select('.dingpai > li > a')
# >表示的是一个层级
——soup.select('.dingpai a')
#空格表示的是多个层级
#层级选择器
——获取标签之间的文本数据:
——soup.a.text/string/get_text()
——区别:
text/get_text():可以获取某一个标签中所有的文本内容
string:只可以获取该标签下面直系的文本内容
——获取标签中属性值:
—— soup.a['href']
Test document:
<div class="dingpai">
<li>
<a id="ding79" href="javascript:ding('79','http://so.gushiwen.org/shiwenv.aspx?id=8dd719a833f0')">有用</a>
<a id="pai79" style=" margin-left:10px;" href="javascript:pai('79','http://so.gushiwen.org/shiwenv.aspx?id=8dd719a833f0')">没用</a>
<a style="width:34px; height:18px; line-height:19px; margin-top:2px; float:right; color:#aeaeae;" href="/jiucuo.aspx?u=%e7%bf%bb%e8%af%9179%e3%80%8a%e8%af%91%e6%96%87%e5%8f%8a%e6%b3%a8%e9%87%8a%e3%80%8b" target="_blank">完善</a>
</li>
</div>
Practice Code:
from bs4 import BeautifulSoup
if __name__ == "__main__":
#将本地的html文档的数据加载到该对象中
fp = open('test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')
# print(soup.a)
# print(soup.find('div',class_='dingpai').get_text())
# print(soup.find_all('a'))
# print(soup.select('.dingpai > li > a')[0])
# print(soup.select('.dingpai a')[0])
# print(soup.select('.dingpai a')[0].text)
# print(soup.select('.dingpai a')[0].get_text())
# print(soup.select('.dingpai a')[0]['href'])
0x02: crawling Three Kingdoms novel
Exercise some basic operations, then use this module crawling about the Three Kingdoms novel:
do is the title and the title of the corresponding content crawling saved locally
Or the first to start crawling the entire page of data (common reptiles):
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
#URL
url = 'http://shicimingju.com/book/sanguoyanyi.html'
#UA伪装
headers = {
'User-Agent' : 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
#获取页面数据
content = requests.post(url=url,headers=headers).text
After crawling entire page and then look at the data, and the corresponding title tag in the URL
div -> ul -> a
Now that you know the hierarchy, the first on the line to resolve the title:
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
#URL
url = 'http://shicimingju.com/book/sanguoyanyi.html'
#UA伪装
headers = {
'User-Agent' : 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
#获取页面数据
content = requests.post(url=url,headers=headers).text
#实例化对象
soup = BeautifulSoup(content,'lxml')
# print(soup)
#解析文章标题和标题对应的url
li_list = soup.select('.book-mulu > ul > li')
for li in li_list:
title = li.a.string
#拼接成完整的url
title_url = 'http://shicimingju.com'+li.a['href']
print(title)
print(title_url)
The result:
The next observation tag details page
found in all content p
under the label, that we are not extract all the p
tags and then processed click on it
Each chapter crawling url here will use the re-instantiate an object in the details page of each chapter, you can climb out of the use of recycled content in all sections
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
#URL
url = 'http://shicimingju.com/book/sanguoyanyi.html'
#UA伪装
headers = {
'User-Agent' : 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
#获取页面数据
content = requests.post(url=url,headers=headers).text
#实例化对象
soup = BeautifulSoup(content,'lxml')
# print(soup)
#解析文章标题和标题对应的url
li_list = soup.select('.book-mulu > ul > li')
fp = open('sanguo.txt','w',encoding='utf-8')
for li in li_list:
title = li.a.string
title_url = 'http://shicimingju.com'+li.a['href']
# print(title)
# print(title_url)
#对详情页发起请求,解析内容
details_text = requests.get(url=title_url,headers=headers).text
#解析内容
detail_soup = BeautifulSoup(details_text,'lxml')
detail_text = detail_soup.find('div',class_='chapter_content')
contents = detail_text.text
fp.write(title +':'+ contents+ '\n')
print(title+':爬取成功')
Crawling success!
to sum up:
Wow this is really the reptile learning more interesting, learning data analysis in the next xpath