1. The way of data analysis
- re (regular)
- bs4
- xpath
2. The purpose of data analysis
Accurately obtain the data we want in the webpage
3. Re (regular) way to parse data
1. Crawl all the embarrassment picture data in the embarrassment encyclopedia
import os
import requests
import re
from urllib import request
if not os.path.exists('./qiutu'):
os.mkdir('./qiutu')
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
url = 'https://www.qiushibaike.com/pic/'
page_text = requests.get(url=url,headers=headers).text
#这里使用了正则,将要爬取的内容用()包住,如果有多个括号返回元组
ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
#下面的re.S表示.也匹配换行符等一些字符,不加不行!
img_url = re.findall(ex,page_text,re.S)
for url in img_url:
url = 'https:'+url
img_name = url.split('/')[-1]
img_path = './qiutu/'+img_name
request.urlretrieve(url,img_path)
print(img_name,'下载成功!!!')
2. Review of regular basis
单字符:
. : 除换行以外所有字符
[] :[aoe] [a-w] 匹配集合中任意一个字符
\d :数字 [0-9]
\D : 非数字
\w :数字、字母、下划线、中文
\W : 非\w
\s :所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
\S : 非空白
数量修饰:
* : 任意多次 >=0
+ : 至少1次 >=1
? : 可有可无 0次或者1次
{m} :固定m次 hello{3,}
{m,} :至少m次
{m,n} :m-n次
边界:
$ : 以某某结尾
^ : 以某某开头
分组:
(ab)
贪婪模式: .*
非贪婪(惰性)模式: .*?
re.I : 忽略大小写
re.M :多行匹配
re.S :单行匹配
re.sub(正则表达式, 替换内容, 字符串)
Three. bs4 module analysis
1. Principle
- Instantiate a Beautifulsoup object, and load the page source data into the object
- Use the relevant attributes and methods of the object to realize label positioning and data extraction
2. Installation
pip install bs4
pip install lxml
3. Instantiate the Beautifulsoup object
i. File
Need to get a file handle
fp = open("a.html","r",encoding="utf8")
BeautifulSoup(fp,'lxml')
ii. String
BeautifulSoup(page_text,'lxml'):
4.find method
find can only find the first label that meets the requirements, and what he returns is an object
soup.find('a')
soup.find('a', class_='xxx')
soup.find('a', title='xxx')
soup.find('a', id='xxx')
soup.find('a', id=re.compile(r'xxx'))
5.find_all method
Return a list of all the objects that meet the requirements
soup.find_all('a')
soup.find_all('a', class_='wang')
soup.find_all('a', id=re.compile(r'xxx'))
soup.find_all('a', limit=2) #提取出前两个符合要求的a
6.select method
Use a similar css selector to get elements
i. Commonly used selectors
1. Hierarchical selector **
div h1 a is followed by the previous child node, that is,
div> h1> a must be the previous direct child node
2. Attribute selector input[name='hehe']
select('选择器的')#返回的是一个列表,列表里面都是对象
#find find_all select不仅适用于soup对象,还适用于其他的子对象,如果调用子对象的select方法,那么就是从这个子对象里面去找符合这个选择器的标签
ii. Do not apply pseudo-element principler
When we want to get the first li of ul, we cannot use: first-child, we should use
select("ul li")[0]
7. Get label attributes and text
-
Attributes
soup.a.attrs["href"] #返回一字典,里面是所有属性和值 soup.a['href'] #获取href属性,不使用
-
text
soup.a.string #当前元素的文本 soup.a.text #包括子元素和当前元素的文本 soup.a.get_text()#包括子元素和当前元素的文本
8. The Romance of the Three Kingdoms novels crawled from the ancient poetry net
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=url,headers=headers).text
#数据解析:标题和url
soup = BeautifulSoup(page_text,'lxml')
li_list = soup.select('.book-mulu > ul > li')
fp = open('./sanguo.txt','w',encoding='utf-8')
for li in li_list:
title = li.a.string
detail_url = 'http://www.shicimingju.com'+li.a['href']
#单独对详情页发起请求获取源码数据
detail_page_text = requests.get(url=detail_url,headers=headers).text
soup = BeautifulSoup(detail_page_text,'lxml')
content = soup.find('div',class_="chapter_content").text
fp.write(title+'\n'+content+'\n')
print(title,':下载成功!')
fp.close()
Four.xpath analysis
1. Features
- The analysis efficiency is relatively high
- The most versatile, applicable to other languages
- Commonly used method (most important)
2. Installation
pip install lxml
3. Analysis principle
- Instantiate an etree object and load the page source data to be parsed into the object
- Use the xpath method in the etree object combined with the xpath expression for label positioning and data extraction
4. Instantiate the etree object
etree.parse('本地文件路径')
etree.HTML(page_text)
5.xpath format
i. Label relationship
r = tree.xpath('/html/body/div')# /表示父子
r = tree.xpath('/html//div') # //表示下级
r = tree.xpath('//div')
r = tree.xpath('//div | //section')#两个用 | 隔开
ii. Properties
r = tree.xpath('//div[@class="song"]') #表示class为song的div标签
iii. Get the text
r = tree.xpath('//div[@class="tang"]//li[5]/a/text()')[0]#表示class 为tang的div标签下的第5个li下的a下的文本信息,返回一个列表
r = tree.xpath('//li[7]//text()') #第7个li下的所有文本信息,包括子元素
r = tree.xpath('//div[@class="tang"]//text()')
iv. Get attributes
r = tree.xpath('//div[@class="song"]/img/@src') #获取src属性的值
v. Crawl the name of 58 city 2 hand rooms
import requests
from lxml import etree
if __name__ == '__main__':
url = "https://bj.58.com/hezu/"
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
res = requests.get(url=url)
res.encoding = "utf8"
htm = res.text
et = etree.HTML(htm)
print(et.xpath('//li[@class="house-cell"]/div[@class="des"]/h2/a/text()'))