Article directory
1. XPath
The use of xpath:
first install the xpath plug-in
and install it in the browser extension
. After installation, close the browser first, and then open it. Press and hold ctrl+shift+x
to display this box (indicating that the xpath installation is successful)
- Install the lxml library
pip install lxml
mirror installation; pip install -i http://pypi.douban.com/simple/ --trusted-host=pypi.douban.com/simple lxml
- import lxml.etree
from lxml import etree
- etree.parse() — parse a local file
html_tree = etree.parse('xx.html')
first create an html file,
write a little data casually,
add a slash (/) in the html file, indicating that there is no end tag
xpath parses local files
Print tree
print result:
Basic syntax of xpath
(1) path query
//: Find all descendant nodes, regardless of hierarchical relationship
/: Find child nodes between
(2) Predicate query
//div[@id]
//div[@id=“maincontent”]
(3) Attribute query
@class
(4) Fuzzy query
div[contains(@id,“he”)]
div[starts-with(@id,“he”)]
(5) Content query
//div/h1/text()
Specifically use the code block:
from lxml import etree
tree = etree.parse('解析_xpath的基本使用.html')
# tree.xpath('xpath路径')
# 查找ul下面的li
# li_list = tree.xpath('//body/ul/li')
# 查找所有有id属性的li标签
# text()获取标签中的内容
# li_list = tree.xpath('//ul/li[@id]/text()')
# 查找id为1的li标签
# li_list = tree.xpath('//ul/li[@id="l1"]/text()')
# 查找到id为l1的li标签的class的属性值
# li = tree.xpath('//ul/li[@id="l1"]/@class')
# 查询id中包含l的li标签(模糊查询)
# li_list = tree.xpath('//ul/li[contains(@id,"l")]/text()')
# 查询id的值以l开头的li标签
# li_list = tree.xpath('//ul/li[starts-with(@id,"c")]/text()')
Parsing the basic use of _xpath.html
- Server Response File — etree.HTML()
html_tree = etree.HTML(response.read().decode(‘utf-8’))
2. Obtain the "Baidu click" of Baidu website
- Get the source code of the web page
- analyze
get xpath:
import urllib.request
url = 'https://www.baidu.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
# 请求对象的定制
request = urllib.request.Request(url=url, headers=headers)
# 模拟浏览器访问服务器
response = urllib.request.urlopen(request)
# 获取网页源码
content = response.read().decode('utf-8')
# 解析网页源码 来获取我们想要的数据
from lxml import etree
# 解析服务器响应的文件
tree = etree.HTML(content)
# 获取想要的数据
result = tree.xpath('//input[@id="su"]/@value')[0]
print(result)
Three, webmaster material
Requirements: Download the first ten pages of pictures,
first find
the preview and response of the first page in the inspection, check whether it is the first page,
if so, copy the url: https://sc.chinaz.com/tupian/qinglvtupian.html
the same method Find the url of the second page: https://sc.chinaz.com/tupian/qinglvtupian_2.html
It can be found that the page number of the first page is different from that of other pages
Specifically realize the complete code:
finally save the pictures in the loveImg folder
import urllib.request
from lxml import etree
def create_request(page):
if(page == 1):
url = 'https://sc.chinaz.com/tupian/qinglvtupian.html'
else:
url = 'https://sc.chinaz.com/tupian/qinglvtupian_' + str(page) + '.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
}
request = urllib.request.Request(url=url, headers=headers)
return request
def get_content(request):
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
return content
def down_load(content):
tree = etree.HTML(content)
name_list = tree.xpath('//div[@class="container"]//img/@alt')
data_list = tree.xpath('//div[@class="container"]//img/@data-original')
for i in range(len(name_list)):
name = name_list[i]
src = data_list[i]
url = 'https:' + src
urllib.request.urlretrieve(url=url, filename='./loveImg/' + name + '.jpg')
if __name__ == '__main__':
start_page = int(input('请输入起始页码:'))
end_page = int(input('请输入结束页码:'))
for page in range(start_page, end_page+1):
# 请求对象的定制
request = create_request(page)
# 获取网页的源码
content = get_content(request)
# 下载
down_load(content)
4. Jsonpath
Suitable for parsing json data
Open Tao Piao Piao official website:
Let's make a small case for demonstration
It should be noted that: jsonpath can only parse local files, not files responded by the server, which is different from xpath
Download the jsonpath module before starting
import urllib.request
url = 'https://dianying.taobao.com/cityAction.json?activityId&_ksTS=1669306447473_102&jsoncallback=jsonp103&action=cityAction&n_s=new&event_submit_doGetAllRegion=true'
headers = {
# 填入请求标头
# 其中 # ':authority': 'dianying.taobao.com',
# ':method': 'GET',
# ':path': '/cityAction.json?activityId&_ksTS=1669306447473_102&jsoncallback=jsonp103&action=cityAction&n_s=new&event_submit_doGetAllRegion=true',
# ':scheme': 'https',
# 'accept-encoding': 'gzip, deflate, br',
# 需要注释
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
content = content.split('(')[1].split(')')[0]
with open('解析_jsonpath解析淘票票.json', 'w', encoding='utf-8')as fp:
fp.write(content)
import json
import jsonpath
obj = json.load(open('解析_jsonpath解析淘票票.json', 'r', encoding='utf-8'))
city_list = jsonpath.jsonpath(obj, '$..regionName')
print(city_list)
Five, the basic use of bs4
BeautifulSoup is called bs4 for short. Like lxml, it is an html parser whose main function is to parse and extract data.
Install the bs4 module first
bs4 crawls Starbucks data
Open the Starbucks official website and select the menu
Requirement: Crawl the data in the menu: text
Find the interface of the menu interface
Here is the method of using xpath to bs4
import urllib.request
url = 'https://www.starbucks.com.cn/menu/'
response = urllib.request.urlopen(url)
content = response.read().decode('utf-8')
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'lxml')
name_list = soup.select('ul[class="grid padded-3 product"] strong')
for name in name_list:
print(name.string)
Partial screenshot of running results