Python Reptile Tour _ (data analysis) _bs4

Preface:

The learning data parsing this knowledge!

0x00: Learn data analysis

In ONE that one in, you mentioned focused crawler (crawling pages specified content), most of the reptiles are focused crawler, but we just started crawling are certainly the data of the entire page, how to locate us want that part of the data, it uses data analysis

Data analysis primarily through the following three:
Bold Style

—— 正则表达式
—— BeautifulSoup
—— xpath

Data analysis principles:

—— 解析的局部的文本内容都会在标签之间或者标签对应的属性中进行存储
一 1.进行指定标签的定位
二 2.标签或者标签对应的属性中存储的数据值进行提取(解析) 

First to practice crawling picture :

Crawling pictures

#爬取图片
import requests
if __name__ == "__main__":
    url = 'http://gaoxiao.jokeji.cn/UpFilesnew/2011/8/20/201182002041961.jpg'
    #content返回的是二进制形式的图片数据
    img_data = requests.get(url=url).content
    #text(字符串) content(二进制数据) json(对象)
    with open('imag.jpg','wb') as fp:
        fp.write(img_data)
    print('Over')

Crawling Success:
Here Insert Picture Description

0x01: parsing using bs4

bs4数据解析的原理:
—— 1.实例化一个BeautifulSoup对象,并且将页面源码数据加载到该对象中
—— 2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取
如何实例化BeautifulSoup对象:
—— from bs4 import BeautifulSoup
—— 对象的实例化:
	—— 1.将本地的html文档中的数据加载到该对象中
		fp = open('sogou.html','r',encoding='utf-8')
    	soup = BeautifulSoup(fp,'lxml')
	—— 2.将互联网上获取的网页源码加载到该对象中
		page_text = response.text
		soup = BeautifulSoup(page_text,'lxml')


—— 提供用于数据解析的方法和属性:
	soup.tagName:返回的是文档中第一次出现的tagName对应的标签(单数,只能返回一个)
	soup.find('tagName'):等同于soup.tagName(单数)
	—— 属性定位:
		——soup.find('div',class_/(还可以是其他的属性)='bg-gj-w')
		#下划线是为了区分关键字
		——soup.find_all('a')(多个数)
		#返回要求的所有标签(列表)


—— select:
	——soup.select('.dingpai > li > a')
	# >表示的是一个层级
	——soup.select('.dingpai a')
	#空格表示的是多个层级
	#层级选择器

——获取标签之间的文本数据:
	——soup.a.text/string/get_text()
	——区别:
		text/get_text():可以获取某一个标签中所有的文本内容
		string:只可以获取该标签下面直系的文本内容
——获取标签中属性值:
	—— soup.a['href']

Test document:

<div class="dingpai">
<li>
<a id="ding79" href="javascript:ding('79','http://so.gushiwen.org/shiwenv.aspx?id=8dd719a833f0')">有用</a>
<a id="pai79" style=" margin-left:10px;" href="javascript:pai('79','http://so.gushiwen.org/shiwenv.aspx?id=8dd719a833f0')">没用</a>
<a style="width:34px; height:18px; line-height:19px; margin-top:2px; float:right; color:#aeaeae;" href="/jiucuo.aspx?u=%e7%bf%bb%e8%af%9179%e3%80%8a%e8%af%91%e6%96%87%e5%8f%8a%e6%b3%a8%e9%87%8a%e3%80%8b" target="_blank">完善</a>
</li>
</div>

Practice Code:

from bs4 import BeautifulSoup
if __name__ == "__main__":
    #将本地的html文档的数据加载到该对象中
    fp = open('test.html','r',encoding='utf-8')
    soup = BeautifulSoup(fp,'lxml')
    # print(soup.a)
    # print(soup.find('div',class_='dingpai').get_text())
    # print(soup.find_all('a'))
    # print(soup.select('.dingpai > li > a')[0])
    # print(soup.select('.dingpai a')[0])
    # print(soup.select('.dingpai a')[0].text)
    # print(soup.select('.dingpai a')[0].get_text())
    # print(soup.select('.dingpai a')[0]['href'])

0x02: crawling Three Kingdoms novel

Exercise some basic operations, then use this module crawling about the Three Kingdoms novel:
Here Insert Picture Description
do is the title and the title of the corresponding content crawling saved locally

Or the first to start crawling the entire page of data (common reptiles):

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    #URL
    url = 'http://shicimingju.com/book/sanguoyanyi.html'
    #UA伪装
    headers = {
        'User-Agent' : 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    #获取页面数据
    content = requests.post(url=url,headers=headers).text

After crawling entire page and then look at the data, and the corresponding title tag in the URL
Here Insert Picture Description

div -> ul -> a

Now that you know the hierarchy, the first on the line to resolve the title:

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    #URL
    url = 'http://shicimingju.com/book/sanguoyanyi.html'
    #UA伪装
    headers = {
        'User-Agent' : 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    #获取页面数据
    content = requests.post(url=url,headers=headers).text
    #实例化对象
    soup = BeautifulSoup(content,'lxml')
    # print(soup)
    #解析文章标题和标题对应的url
    li_list = soup.select('.book-mulu > ul > li')
    for li in li_list:
        title = li.a.string
        #拼接成完整的url
        title_url = 'http://shicimingju.com'+li.a['href']
         print(title)
         print(title_url)

The result:
Here Insert Picture Description
The next observation tag details page
Here Insert Picture Description
found in all content punder the label, that we are not extract all the ptags and then processed click on it

Each chapter crawling url here will use the re-instantiate an object in the details page of each chapter, you can climb out of the use of recycled content in all sections

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    #URL
    url = 'http://shicimingju.com/book/sanguoyanyi.html'
    #UA伪装
    headers = {
        'User-Agent' : 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
    #获取页面数据
    content = requests.post(url=url,headers=headers).text
    #实例化对象
    soup = BeautifulSoup(content,'lxml')
    # print(soup)
    #解析文章标题和标题对应的url
    li_list = soup.select('.book-mulu > ul > li')
    fp = open('sanguo.txt','w',encoding='utf-8')
    for li in li_list:
        title = li.a.string
        title_url = 'http://shicimingju.com'+li.a['href']
        # print(title)
        # print(title_url)
        #对详情页发起请求,解析内容
        details_text = requests.get(url=title_url,headers=headers).text
        #解析内容
        detail_soup = BeautifulSoup(details_text,'lxml')
        detail_text = detail_soup.find('div',class_='chapter_content')
        contents = detail_text.text
        fp.write(title +':'+ contents+ '\n')
        print(title+':爬取成功')

Here Insert Picture Description
Crawling success!

to sum up:

Wow this is really the reptile learning more interesting, learning data analysis in the next xpath

Published 71 original articles · won praise 80 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_43431158/article/details/104352993