【Crawler_Analysis】


1. XPath

The use of xpath:
first install the xpath plug-in
and install it in the browser extension
. After installation, close the browser first, and then open it. Press and hold ctrl+shift+x
insert image description hereto display this box (indicating that the xpath installation is successful)

  1. Install the lxml library

pip install lxml
mirror installation; pip install -i http://pypi.douban.com/simple/ --trusted-host=pypi.douban.com/simple lxml

insert image description here

insert image description here

  1. import lxml.etree

from lxml import etree

  1. etree.parse() — parse a local file

html_tree = etree.parse('xx.html')
first create an html file,
write a little data casually,
insert image description hereadd a slash (/) in the html file, indicating that there is no end tag
insert image description here

xpath parses local files

Print tree
insert image description hereprint result:
insert image description here

Basic syntax of xpath
(1) path query

//: Find all descendant nodes, regardless of hierarchical relationship
/: Find child nodes between

(2) Predicate query

//div[@id]
//div[@id=“maincontent”]

(3) Attribute query

@class

(4) Fuzzy query

div[contains(@id,“he”)]
div[starts-with(@id,“he”)]

(5) Content query

//div/h1/text()

Specifically use the code block:

from lxml import etree

tree = etree.parse('解析_xpath的基本使用.html')

# tree.xpath('xpath路径')

# 查找ul下面的li
# li_list = tree.xpath('//body/ul/li')

# 查找所有有id属性的li标签
# text()获取标签中的内容
# li_list = tree.xpath('//ul/li[@id]/text()')


# 查找id为1的li标签
# li_list = tree.xpath('//ul/li[@id="l1"]/text()')

# 查找到id为l1的li标签的class的属性值
# li = tree.xpath('//ul/li[@id="l1"]/@class')


# 查询id中包含l的li标签(模糊查询)
# li_list = tree.xpath('//ul/li[contains(@id,"l")]/text()')

# 查询id的值以l开头的li标签
# li_list = tree.xpath('//ul/li[starts-with(@id,"c")]/text()')


Parsing the basic use of _xpath.html
insert image description here

  1. Server Response File — etree.HTML()

html_tree = etree.HTML(response.read().decode(‘utf-8’))

2. Obtain the "Baidu click" of Baidu website

  1. Get the source code of the web page
  2. analyze
  3. Print

get xpath:
insert image description here

import urllib.request

url = 'https://www.baidu.com/'


headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}

# 请求对象的定制
request = urllib.request.Request(url=url, headers=headers)

# 模拟浏览器访问服务器
response = urllib.request.urlopen(request)

# 获取网页源码
content = response.read().decode('utf-8')

# 解析网页源码 来获取我们想要的数据
from lxml import etree

# 解析服务器响应的文件
tree = etree.HTML(content)

# 获取想要的数据
result = tree.xpath('//input[@id="su"]/@value')[0]

print(result)

Three, webmaster material

Requirements: Download the first ten pages of pictures,
first find
insert image description herethe preview and response of the first page in the inspection, check whether it is the first page,
if so, copy the url: https://sc.chinaz.com/tupian/qinglvtupian.html
the same method Find the url of the second page: https://sc.chinaz.com/tupian/qinglvtupian_2.html

It can be found that the page number of the first page is different from that of other pages

Specifically realize the complete code:
finally save the pictures in the loveImg folder

import urllib.request
from lxml import etree

def create_request(page):
    if(page == 1):
        url = 'https://sc.chinaz.com/tupian/qinglvtupian.html'
    else:
        url = 'https://sc.chinaz.com/tupian/qinglvtupian_' + str(page) + '.html'

    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
    }

    request = urllib.request.Request(url=url, headers=headers)
    return request

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content


def down_load(content):
    tree = etree.HTML(content)

    name_list = tree.xpath('//div[@class="container"]//img/@alt')

    data_list = tree.xpath('//div[@class="container"]//img/@data-original')

    for i in range(len(name_list)):
        name = name_list[i]
        src = data_list[i]
        url = 'https:' + src
        urllib.request.urlretrieve(url=url, filename='./loveImg/' + name + '.jpg')



if __name__ == '__main__':
    start_page = int(input('请输入起始页码:'))
    end_page = int(input('请输入结束页码:'))

    for page in range(start_page, end_page+1):

        # 请求对象的定制
        request =  create_request(page)
        # 获取网页的源码
        content = get_content(request)
        # 下载
        down_load(content)


4. Jsonpath

Suitable for parsing json data

Open Tao Piao Piao official website:
Let's make a small case for demonstration
insert image description here

It should be noted that: jsonpath can only parse local files, not files responded by the server, which is different from xpath

Download the jsonpath module before starting
insert image description here

insert image description here

import urllib.request

url = 'https://dianying.taobao.com/cityAction.json?activityId&_ksTS=1669306447473_102&jsoncallback=jsonp103&action=cityAction&n_s=new&event_submit_doGetAllRegion=true'

headers = {
    
    
   	# 填入请求标头
   	# 其中 # ':authority': 'dianying.taobao.com',
    # ':method': 'GET',
    # ':path': '/cityAction.json?activityId&_ksTS=1669306447473_102&jsoncallback=jsonp103&action=cityAction&n_s=new&event_submit_doGetAllRegion=true',
    # ':scheme': 'https',
    # 'accept-encoding': 'gzip, deflate, br',
    # 需要注释
 }

request = urllib.request.Request(url=url, headers=headers)

response = urllib.request.urlopen(request)

content = response.read().decode('utf-8')

content = content.split('(')[1].split(')')[0]

with open('解析_jsonpath解析淘票票.json', 'w', encoding='utf-8')as fp:
    fp.write(content)


import json
import jsonpath

obj = json.load(open('解析_jsonpath解析淘票票.json', 'r', encoding='utf-8'))

city_list = jsonpath.jsonpath(obj, '$..regionName')

print(city_list)

Five, the basic use of bs4

BeautifulSoup is called bs4 for short. Like lxml, it is an html parser whose main function is to parse and extract data.

Install the bs4 module first
insert image description here
insert image description here

bs4 crawls Starbucks data

Open the Starbucks official website and select the menu
Requirement: Crawl the data in the menu: text

Find the interface of the menu interface
insert image description hereinsert image description here

Here is the method of using xpath to bs4

import urllib.request

url = 'https://www.starbucks.com.cn/menu/'

response = urllib.request.urlopen(url)

content = response.read().decode('utf-8')

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, 'lxml')

name_list = soup.select('ul[class="grid padded-3 product"] strong')

for name in name_list:
    print(name.string)

Partial screenshot of running results
insert image description here

Guess you like

Origin blog.csdn.net/qq_64451048/article/details/128005218