03.Python crawler data analysis

1. The way of data analysis

  1. re (regular)
  2. bs4
  3. xpath

2. The purpose of data analysis

Accurately obtain the data we want in the webpage

3. Re (regular) way to parse data

1. Crawl all the embarrassment picture data in the embarrassment encyclopedia

import os
import requests
import re
from urllib import request
if not os.path.exists('./qiutu'):
    os.mkdir('./qiutu')
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}

url = 'https://www.qiushibaike.com/pic/'
page_text = requests.get(url=url,headers=headers).text
#这里使用了正则,将要爬取的内容用()包住,如果有多个括号返回元组
ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
#下面的re.S表示.也匹配换行符等一些字符,不加不行!
img_url = re.findall(ex,page_text,re.S)
for url in img_url:
    url = 'https:'+url
    img_name = url.split('/')[-1]
    img_path = './qiutu/'+img_name
    request.urlretrieve(url,img_path)
    print(img_name,'下载成功!!!')

2. Review of regular basis

单字符:
        . : 除换行以外所有字符
        [] :[aoe] [a-w] 匹配集合中任意一个字符
        \d :数字  [0-9]
        \D : 非数字
        \w :数字、字母、下划线、中文
        \W : 非\w
        \s :所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
        \S : 非空白
    数量修饰:
        * : 任意多次  >=0
        + : 至少1次   >=1
        ? : 可有可无  0次或者1次
        {m} :固定m次 hello{3,}
        {m,} :至少m次
        {m,n} :m-n次
    边界:
        $ : 以某某结尾 
        ^ : 以某某开头
    分组:
        (ab)  
    贪婪模式: .*
    非贪婪(惰性)模式: .*?

    re.I : 忽略大小写
    re.M :多行匹配
    re.S :单行匹配

    re.sub(正则表达式, 替换内容, 字符串)

Three. bs4 module analysis

1. Principle

  • Instantiate a Beautifulsoup object, and load the page source data into the object
  • Use the relevant attributes and methods of the object to realize label positioning and data extraction

2. Installation

pip install bs4
pip install lxml

3. Instantiate the Beautifulsoup object

i. File

Need to get a file handle

fp = open("a.html","r",encoding="utf8")
BeautifulSoup(fp,'lxml')

ii. String

BeautifulSoup(page_text,'lxml'):

4.find method

find can only find the first label that meets the requirements, and what he returns is an object

soup.find('a')
soup.find('a', class_='xxx')
soup.find('a', title='xxx')
soup.find('a', id='xxx')
soup.find('a', id=re.compile(r'xxx'))

5.find_all method

Return a list of all the objects that meet the requirements

soup.find_all('a')
soup.find_all('a', class_='wang')
soup.find_all('a', id=re.compile(r'xxx'))
soup.find_all('a', limit=2)   #提取出前两个符合要求的a

6.select method

Use a similar css selector to get elements

i. Commonly used selectors

1. Hierarchical selector **
div h1 a is followed by the previous child node, that is,
div> h1> a must be the previous direct child node
2. Attribute selector input[name='hehe']


select('选择器的')#返回的是一个列表,列表里面都是对象
#find find_all select不仅适用于soup对象,还适用于其他的子对象,如果调用子对象的select方法,那么就是从这个子对象里面去找符合这个选择器的标签

ii. Do not apply pseudo-element principler

When we want to get the first li of ul, we cannot use: first-child, we should use

select("ul li")[0]

7. Get label attributes and text

  • Attributes

    soup.a.attrs["href"] #返回一字典,里面是所有属性和值
    soup.a['href'] #获取href属性,不使用
    
  • text

    soup.a.string #当前元素的文本
    soup.a.text #包括子元素和当前元素的文本
    soup.a.get_text()#包括子元素和当前元素的文本
    

8. The Romance of the Three Kingdoms novels crawled from the ancient poetry net

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=url,headers=headers).text
#数据解析:标题和url
soup = BeautifulSoup(page_text,'lxml')
li_list = soup.select('.book-mulu > ul > li')
fp = open('./sanguo.txt','w',encoding='utf-8')
for li in li_list:
    title = li.a.string
    detail_url = 'http://www.shicimingju.com'+li.a['href']
    #单独对详情页发起请求获取源码数据
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    soup = BeautifulSoup(detail_page_text,'lxml')
    content = soup.find('div',class_="chapter_content").text
    
    fp.write(title+'\n'+content+'\n')
    print(title,':下载成功!')
    
fp.close()

Four.xpath analysis

1. Features

  • The analysis efficiency is relatively high
  • The most versatile, applicable to other languages
  • Commonly used method (most important)

2. Installation

pip install lxml

3. Analysis principle

  • Instantiate an etree object and load the page source data to be parsed into the object
  • Use the xpath method in the etree object combined with the xpath expression for label positioning and data extraction

4. Instantiate the etree object

etree.parse('本地文件路径')
etree.HTML(page_text)

5.xpath format

i. Label relationship

r = tree.xpath('/html/body/div')# /表示父子
r = tree.xpath('/html//div')  # //表示下级
r = tree.xpath('//div')
r = tree.xpath('//div | //section')#两个用 | 隔开

ii. Properties

r = tree.xpath('//div[@class="song"]') #表示class为song的div标签

iii. Get the text

r = tree.xpath('//div[@class="tang"]//li[5]/a/text()')[0]#表示class 为tang的div标签下的第5个li下的a下的文本信息,返回一个列表
r = tree.xpath('//li[7]//text()') #第7个li下的所有文本信息,包括子元素
r = tree.xpath('//div[@class="tang"]//text()') 

iv. Get attributes

r = tree.xpath('//div[@class="song"]/img/@src') #获取src属性的值

v. Crawl the name of 58 city 2 hand rooms

import requests
from lxml import etree

if __name__ == '__main__':
    url = "https://bj.58.com/hezu/"
    headers = {
        'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
    }
    res = requests.get(url=url)
    res.encoding = "utf8"
    htm = res.text
    et = etree.HTML(htm)
    print(et.xpath('//li[@class="house-cell"]/div[@class="des"]/h2/a/text()'))

Guess you like

Origin blog.csdn.net/qq_40837794/article/details/109653589