02 reptiles / data analysis

02 reptiles / data analysis

1. Overview of data analysis

  • What is data analysis, data interpretation can do?

    • Concept: that is, a set of data extracted partial data.
    • Role: Use to implement focused crawler
  • General principles of data analysis

    • Problem: Data can be stored in the display html where?
      • Among the labels
      • Property
    • 1. label positioning
    • 2. Take a text attribute or take
  • Common method of data analysis

    • re

    • bs4

    • xpath

    • pyquery

2. Regular data analysis to achieve

  • Demand: http: //duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/, carried the picture data site crawling

  • How the picture (binary) data crawling

    method one:

    import requests
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    }
    url = 'http://duanziwang.com/usr/uploads/2019/02/3334500855.jpg'
    pic_data = requests.get(url=url,headers=headers).content 
    # content返回的是二进制类型的响应数据
    with open('1.jpg','wb') as fp:
        fp.write(pic_data)

    Method Two: urllib is a low version of the requests

    import urllib
    url = 'http://duanziwang.com/usr/uploads/2019/02/3334500855.jpg'
    urllib.request.urlretrieve(url=url,filename='./2.jpg')
  • What is the difference between two methods for crawling picture is?

    • The method can be disguised UA, not Method 2
  • What is the difference page source capture page source code and developer tools displayed in response tool Element tab shows?

    • Element: source content page displays all of the data corresponding to the current page is loaded (data comprises dynamically loaded)

    • response: the content display data only to the current request is a request (that does not contain data dynamically loaded)

  • The sample code

    • Implementation requirements of: crawling the data of one page
    import re
    import os
    
    url = 'http://duanziwang.com/category/%E6%90%9E%E7%AC%91%E5%9B%BE/'
    page_text = requests.get(url,headers=headers).text  # 页面源码数据
    
    # 新建一个文件夹
    dirName = 'imgLibs'
    if not os.path.exists(dirName):
        os.mkdir(dirName)
    
    # 数据解析:每一张图片的地址
    ex = '<article.*?<img src="(.*?)" alt=.*?</article>'
    img_src_list = re.findall(ex,page_text,re.S)  # 爬虫中使用findall函数必须要使用re.S
    
    for src in img_src_list:
        imgName = src.split('/')[-1]
        imgPath = dirName+'/'+imgName
        urllib.request.urlretrieve(url=src,filename=imgPath)
        print(imgName,'下载成功!!!')
    • Fulfilled requirements: full station data crawling: crawling all pages of picture data
    # 制定一个通用的url模板,不可以被改变
    url = 'http://duanziwang.com/category/搞笑图/%d/'
    
    for page in range(1,4):
        new_url = format(url%page)
        page_text = requests.get(new_url,headers=headers).text  # 页面源码数据
    
        # 新建一个文件夹
        dirName = 'imgLibs'
        if not os.path.exists(dirName):
            os.mkdir(dirName)
    
        # 数据解析:每一张图片的地址
        ex = '<article.*?<img src="(.*?)" alt=.*?</article>'
        img_src_list = re.findall(ex,page_text,re.S)  # 爬虫中使用findall函数必须要使用re.S
    
        for src in img_src_list:
            imgName = src.split('/')[-1]
            imgPath = dirName+'/'+imgName
            urllib.request.urlretrieve(url=src,filename=imgPath)
            print(imgName,'下载成功!!!')

3. bs4 achieve data analysis

  • Environmental installation:
    • pip install bs4
    • pip install lxml
  • ANALYSIS PRINCIPLE
    • Instantiate an object of a BeautifulSoup, the page is about to be resolved source content loaded into the object
    • BeautifulSoup call methods and properties related to the object extracting data label location and herein
  • Examples of object oriented approach BeautifulSoup:
    • BeautifulSoup (fp, 'lxml'): the contents of the local file is loaded into the object data parsed
    • BeautifulSoup (page_text, 'lxml'): the load request to the Internet data to the object data parsed
  • bs4 correlation analysis operations

    • Label location: The return value must be targeted to the tag

      • soup.tagName: tagName positioned on the first label to the singular is returned.
      • Properties Location: soup.find ( 'tagName', attrName = 'value'), to return to a singular
      • find_all ( 'tagName', attrName = 'value') is returned by the plural (list)
      • Location Selector: select ( 'selector'), is returned a list of
        • Level Selector:
          • Greater-than sign: a hierarchical representation
          • Space: identifying a plurality of levels
    • Take text

      • string: Tags can only be removed lineal text
      • text: The entire contents of the label can be removed
    • Take property

      • tag [ 'attrName']
    • The sample code

    from bs4 import BeautifulSoup
    fp = open('./test.html','r',encoding='utf-8')
    soup = BeautifulSoup(fp,'lxml')
    soup.p
    soup.find('div',class_='tang')
    soup.find('a',id='feng')
    soup.find_all('div',class_='tang')
    soup.select('#feng')
    soup.select('.tang > ul > li')
    soup.select('.tang li')
    tag = soup.title
    tag.text
    li_list = soup.select('.tang > ul > li')
    li_list[6].text
    div_tag = soup.find('div',class_='tang')
    div_tag.text
    a_tag = soup.select('#feng')[0]
    a_tag['href']
  • Demand: http: //www.shicimingju.com/book/sanguoyanyi.html novels full content crawling

  • analysis:

    • Home: url parse out the name of the section of the page + details
    • Details page: the content analysis section
  • The sample code

    # 爬取到首页的页面数据
    main_url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    page_text = requests.get(main_url,headers=headers).text
    
    fp = open('./sanguo.txt','a',encoding='utf-8')
    
    # 解析章节名称+详情页的url
    soup = BeautifulSoup(page_text,'lxml')
    a_list = soup.select('.book-mulu > ul > li > a')
    for a in a_list:
        title = a.string   # 章节标题
        detail_url = 'http://www.shicimingju.com'+a['href']
    
        # 爬取详情页的页面源码内容
        detail_page_text = requests.get(url=detail_url,headers=headers).text
        # 解析章节内容
        detail_soup = BeautifulSoup(detail_page_text,'lxml')
        div_tag = detail_soup.find('div',class_="chapter_content")
        content = div_tag.text # 章节内容
        fp.write(title+':'+content+'\n')
        print(title,'下载成功!!!')
    fp.close()

4. xpath resolve

  • Environmental installation:
    • pip install lxml
  • ANALYSIS PRINCIPLE (process)
    • Etree instantiate an object, the parsed data is loaded into the object
    • xpath method needs to call etree different object binding of the label location xpath expressions and text data extraction
  • etree object instantiation
    • etree.parse ( 'filePath'): the data are loaded according to the etree
    • etree.HTML (page_text): the data on the Internet is loaded into the object
  • All html tags are complied with tree-like structure, for us to achieve efficient node traversal or Find (locate)
  • The return value must be plural xpath method (list)

  • Label positioning

    • The leftmost /: xpath expression type must start from the root label positioning
    • Non-leftmost /: Indicates a level
    • @ Leftmost: positioning a label (used) from anywhere
    • Non-left-most //: represent multiple levels
    • // tagName: Targeting all the tagName tag
    • Attribute Positioning: // tagName [@ attrName = "value"]
    • Index Positioning: // tagName [index], index index is starting from 1
    • Fuzzy matching:
      • //div[contains(@class, "ng")]
      • //div[starts-with(@class, "ta")]
  • Take text

    • / Text (): take direct text content. List only one element
    • // text (): all text content. List have multiple list elements
  • Take property

    • /@attrName
  • The sample code

    from lxml import etree
    tree = etree.parse('./test.html')
    tree.xpath('/html/head/meta')
    tree.xpath('/html//meta')
    tree.xpath('//meta')
    tree.xpath('//div')
    tree.xpath('//div[@class="tang"]')
    tree.xpath('//li[1]')
    tree.xpath('//a[@id="feng"]/text()')[0]
    tree.xpath('//div[2]//text()')
    tree.xpath('//a[@id="feng"]/@href')
  • Requirements: crawling eye teeth to resolve url room name in live, heat, details page

    url = 'https://www.huya.com/g/lol'
    page_text = requests.get(url=url,headers=headers).text
    # 数据解析
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//div[@class="box-bd"]/ul/li')
    for li in li_list:
        # 实现局部解析:将局部标签下指定的内容进行解析
        # 局部解析xpath表达式中的最左侧的./表示的就是xpath方法调用者对应的标签
        title = li.xpath('./a[2]/text()')[0]
        hot = li.xpath('./span/span[2]/i[2]/text()')[0]
        detail_url = li.xpath('./a[1]/@href')[0]
        print(title,hot,detail_url)
    • xpath garbled picture data processing crawling +
    # url模板
    url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
    for page in range(1,11):
        new_url = format(url%page)  # 只可以表示非第一页的页码连接
        if page == 1:
            new_url = 'http://pic.netbian.com/4kmeinv/'
        page_text = requests.get(new_url,headers=headers).text
        tree = etree.HTML(page_text)
        li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
        for li in li_list:
            img_name = li.xpath('./a/img/@alt')[0]+'.jpg'
            img_name = img_name.encode('iso-8859-1').decode('gbk')
            img_src = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]
            print(img_name,img_src)
    • Use pipe character xpath
    url = 'https://www.aqistudy.cn/historydata/'
    page_text = requests.get(url,headers=headers).text
    
    tree = etree.HTML(page_text)
    # hot_cities = tree.xpath('//div[@class="bottom"]/ul/li/a/text()')
    all_cities = tree.xpath('//div[@class="bottom"]/ul/div[2]/li/a/text() | //div[@class="bottom"]/ul/li/a/text()')
    all_cities
    • xpath pipe character expression Application
      • Objective: xpath expression that has more versatility

to sum up:

  1. It returns the type of content is a binary response data

    data = requests.get(url=url,headers=headers).content

  2. Regular use of re.S

    • img_src_list = re.findall(ex,page_text,re.S)

    • If you do not use re.S parameters, only to be matched within each line, if there is no line, it replaced a row to start again, does not cross lines

      And later use re.S parameters, this will be a regular expression string as a whole, the "\ n" as an ordinary character added to the string, the entire match

  3. new_url = format(url%page)

    def format(value, format_spec='', /)
     Return value.__format__(format_spec)
  4. urllib request with the requested data, requests are not

Guess you like

Origin www.cnblogs.com/liubing8/p/11980080.html