Introduction to data analysis (1) ——— xpath and beautifulsoup

1.xpath

Note: Install the xpath plugin in advance
(1) Open the chrome browser
(2) Click the small dot in the upper right corner
(3) More tools
(4) Extension program
(5) Drag the xpath plugin into the extension program
(6) If the crx file Invalid, you need to change the suffix to zip
(7) drag again
(8) close the browser and reopen
(9) ctrl + shift + x
(10) a small black box appears

This appears to indicate that it has been installed
Please add a picture description

Basic syntax of xpath:

1. Path query //: Find all descendant nodes, regardless of hierarchical relationship /: Find direct child nodes
2. Predicate query //div[@id] //div[@id="maincontent"]
3. Attribute query// @class
4. Fuzzy query //div[contains(@id, “he”)] //div[starts‐with(@id, “he”)]
5. Content query //div/h1/text() 6 .Logical operation //div[@id="head" and @class="s_down"] //title | //price

1. Install the lxml library
pip install lxml -i https://pypi.douban.com/simple
2. Import lxml.etree
from lxml import etree
3. etree.parse()
parse local files html_tree = etree.parse('XX. html')
4.etree.HTML() server response file
html_tree = etree.HTML(response.read().decode('utf-8')
5.html_tree.xpath(xpath path)

Small code example demo:

import urllib.request
from lxml import etree
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

def create_request(page):
    if (page == 1):
        url = 'https://sc.chinaz.com/tupian/qinglvtupian.html'
    else:
        url = 'https://sc.chinaz.com/tupian/qinglvtupian_' + str(page) + '.html'

    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    }
    request = urllib.request.Request(url=url,headers=headers)
    return request

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')#将从网站上读取到的网站源代码编译为utf-8的格式。
    return content

def down_load(content):
    #下载图片
    #urllib.request.urlretrieve("图片地址",'文件的名字')#
    #解析服务器相应文件使用的HTML。解析本地文件使用的parse。
    #tree = etree.parse()
    tree = etree.HTML(content)

    name_list = tree.xpath('//div[@class="container"]//div/img/@alt')
    #//div[@class="container"]//div/img/@src
    # 一般设计图片的网站都会进行懒加载
    src_list = tree.xpath('//div[@class="container"]//div/img/@data-original')
    for i in range(len(name_list)):
        name = name_list[i]
        src = src_list[i]
        url = 'http:' + src

        urllib.request.urlretrieve(url=url,filename=name+'.jpg')

if __name__ == '__main__':
    #start_page =
    #end_page =
    start_page = 1
    end_page = 3
    for page in range(start_page,end_page+1):
        #请求对象的定制
        request = create_request(page)
        #获取网页的源码
        content = get_content(request)
        #下载
        down_load(content)
    print('结束运行')

2.BeautifulSoup

1. Abbreviation of BeautifulSoup: bs4
2. What is BeautifulSoup? BeautifulSoup, like lxml, is an html parser whose main function is to parse and extract data.
3. Advantages and disadvantages. Disadvantages: The efficiency is not as high as that of lxml. Advantages: The interface design is user-friendly and easy to use
Note: The encoding format of the open file is gbk by default, so you need to specify the encoding format to open

3.JsonPath

Guess you like

Origin blog.csdn.net/guoguozgw/article/details/128835764