python reptile tutorial: python3 xpath requests and applications Detailed

This article describes the python3 xpath Comments and requests applications, a good reference value, we want to help. Come and see, to follow the small series together
according to a crawl IMDb ranked small application, simple to use and request etree library.

etree use xpath syntax.

import requests
import ssl
from lxml import etree
 
 
ssl._create_default_https_context = ssl._create_unverified_context
 
session = requests.Session()
for id in range(0, 251, 25):
 URL = 'https://movie.douban.com/top250/?start=' + str(id)
 req = session.get(URL)
 # 设置网页编码格式
 req.encoding = 'utf8'
 # 将request.content 转化为 Element
 root = etree.HTML(req.content)
 # 选取 ol/li/div[@class="item"] 不管它们在文档中的位置
 items = root.xpath('//ol/li/div[@class="item"]')
 for item in items:
  # 注意可能只有中文名,没有英文名;可能没有quote简评
  rank, name, alias, rating_num, quote, url = "", "", "", "", "", ""
  try:
   url = item.xpath('./div[@class="pic"]/a/@href')[0]
   rank = item.xpath('./div[@class="pic"]/em/text()')[0]
   title = item.xpath('./div[@class="info"]//a/span[@class="title"]/text()')
   name = title[0].encode('gb2312', 'ignore').decode('gb2312')
   alias = title[1].encode('gb2312', 'ignore').decode('gb2312') if len(title) == 2 else ""
   rating_num = item.xpath('.//div[@class="bd"]//span[@class="rating_num"]/text()')[0]
   quote_tag = item.xpath('.//div[@class="bd"]//span[@class="inq"]')
   if len(quote_tag) is not 0:
    quote = quote_tag[0].text.encode('gb2312', 'ignore').decode('gb2312').replace('\xa0', '')
   # 输出 排名,评分,简介
   print(rank, rating_num, quote)
   # 输出 中文名,英文名
   print(name.encode('gb2312', 'ignore').decode('gb2312'),
     alias.encode('gb2312', 'ignore').decode('gb2312').replace('/', ','))
  except:
   print('faild!')
   pass

Operating results
Here Insert Picture Description
additional knowledge: requests to crawl and parse Xpath

Code:

# requests抓取
import requests
  
# 新浪新闻的一篇新闻的url
url = 'http://news.sina.com.cn/s/2018-05-09/doc-ihaichqz1009657.shtml'
  
res = requests.get(url)
# 查看编码方式
enconding = requests.utils.get_encodings_from_content(res.text)
#print(enconding)
  
  
# 打印网页内容
html_doc = res.content.decode("utf-8")
print(html_doc[:500])
  
# 保存网页内容
with open('test.html', 'w') as f:
 f.write(html_doc)

operation result:

<!DOCTYPE html>
<!-- [ published at 2018-05-09 18:23:13 ] -->
<!-- LLTJ_MT:name ="澎湃新闻" -->
  
<html>
<head>
<meta charset="utf-8"/>
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="sudameta" content="urlpath:s/; allCIDs:51924,257,51895,200856,56264,258,38790">
<title>小学老师罚学生赤脚跑操场 官方:将按规定处理|赤脚|学生|华龙网_新浪新闻</title>
<meta name="keywords" content="赤脚,学生,华龙网" />
<meta name="tags" content="赤脚,学生,华龙网" />
<meta name="description" content="原标题:潼南一小学体育老师罚学生赤脚跑操场续:区教委向华龙网发来情况

Code:

# xpath解析
from lxml import etree
  
# 建立html的树
tree = etree.HTML(html_doc)
  
# 设置目标路径(标题)
path_title = '/html/body//h1[@class="main-title"]//text()'
  
# 提取节点
node_title = tree.xpath(path_title)
print("===" * 20)
print(node_title[0])
  
# 设置内容路径
path_content = '//div[@class="article-content-left"]//div[@id="article"]//text()'
  
# 提取节点
node_content = tree.xpath(path_content)
print("===" * 20)
print("。".join(node_content))

operation result

:============================================================
小学老师罚学生赤脚跑操场 官方:将按规定处理
============================================================
  
 。  原标题:潼南一小学体育老师罚学生赤脚跑操场续:区教委向华龙网发来情况说明。
。  重庆客户端-华龙网5月9日消息,这两天,重庆潼南区朝阳小学二年级6班不少家长心疼不已,因为多个娃儿脚底被磨出了泡。一问才知道,是因为有些学生体育课上没穿运动鞋,被体育老师要求赤脚在操场上跑步。收到重庆网络问政平台这一投诉后,华龙网记者立即进行了调查。今(9)日,华龙网发布了。《重庆潼南一小学体育老师罚学生赤脚跑操场脚底磨出泡当地教委介入》。报道后,潼南教委高度重视并给华龙网传来官方的情况说明。。
。 。 [说明全文]。
。  关于家长在华龙网投诉教师上体育课体罚学生的情况说明。
。  潼南区朝阳小学体育教师邹老师于2018年5月7日上午上体育课时,发现该班有少部分名学生未按体育课的要求穿运动鞋。该教师认为,穿着凉鞋跑步对学生本人及他人存在安全隐患,塑胶跑道不会对学生光脚运动造成影响,于是就叫未穿运动鞋的学生,脱掉凉鞋进行随班热身跑步。当时邹老师未发现学生有异常情况,也未接到学生有异常情况的反映。后经家长反映到学校,有极少数光着脚跑步的学生有异常情况,学校庚即与部分家长进行了沟通,并及时调查了解了此事,并对该教师这种不恰当教学方法进行了批评教育,我们将按相关规定对该教师作出相应的处理。。
。  重庆市潼南区教育委员会。
。  2018年5月9日。
。  来源:华龙网。
  
。责任编辑:张义凌 。

I write to you, for everyone to recommend a very wide python learning resource gathering, click to enter , there is a senior programmer before learning to share experiences, study notes, there is a chance of business experience, and for everyone to carefully organize a python zero the basis of the actual project data, daily python to you on the latest technology, prospects, learning small details that need to comment on
the above application requests this python3 xpath and explain that small series to share the entire contents of all of the

Published 46 original articles · won praise 30 · views 70000 +

Guess you like

Origin blog.csdn.net/haoxun09/article/details/104827998