python reptile - IMDb climb with xpath

step
  1. The next page of the destination site crawled down
  2. The extracted data fetch down according to a certain rule
 
specific process
  1. The next page of the destination site crawled down
1. inverted Treasury
import requests
2. header information (and sometimes do not write)
headers = {
    # Request Identity / default User-Agent: python
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36',
    'Referer': 'https://movie.douban.com/'
}  
3.url
url = 'https://movie.douban.com/cinema/nowplaying/zhengzhou/'

4. Return the response
response = requests.get(url,headers=headers)  #响应
#print(response.text)
text = response.text 
response.text: it returns a string after decoding, is str (unicode) Type
response.concent: Returns a string that is native, is crawled down from the web, without decoding string type is bytes
 
2. The extracted data fetch down according to a certain rule
1. The crawling down the data by parsing lxml
from lxml import etree
html = etree.HTML(text)
2. Get ul, under li 'title', 'score', 'poster'
Take a look at the frame
ul (class='list')
at the ······
the
at the
a ······
ul = html.xpath("//ul[@class='lists']")[0]
#print(etree.tostring(ul,encoding='utf-8').decode('utf-8'))
lis = ul.xpath('./li')
for li in lis:
    #print(etree.tostring(li,encoding='utf-8').decode('utf-8'))
    title = li.xpath('@data-title')[0]
    #print(title)
    score = li.xpath('@data-score')[0]
    # print(score)
    poster = li.xpath('.//img/@src')[0]
   # print(poster)
[0] 只获取第一个内容
// 获取网页当中所有的元素
./ 在当前标签下获取
.// 在当前标签下下获取
xpath返回的是列表的形式 [''],[0]就可以只拿内容
 
3.储存信息
1.下载
request.urlretrieve(poster, 'D:/A/' + score + title + '.jpg')

下载到D盘下A目录中,文件名为 评分+影名.jpg

2.显示进度条

fns_num = 1
num = len(lis)
for li in lis:
    ···
    print("\r完成进度: {:.2f}%".format(fns_num * 100 / num), end="")
    fns_num += 1

 

完整代码
#coding=UTF-8

import requests
from lxml import etree
from urllib import request

headers = {
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36',
	'Referer': 'https://movie.douban.com/'
}
url = 'https://movie.douban.com/cinema/nowplaying/zhengzhou/'
response = requests.get(url,headers=headers)
# print(response.text)
text = response.text 

html = etree.HTML(text)
ul = html.xpath("//ul[@class='lists']")[0]
# print(etree.tostring(ul,encoding='utf-8').decode('utf-8'))
lis = ul.xpath("./li")
# movies = []
fns_num = 1
num = len(lis)
for li in lis:
    # print(etree.tostring(li,encoding='utf-8').decode('utf-8'))
    title = li.xpath('@data-title')[0]
    # print(title)
    score = li.xpath('@data-score')[0]
    # print(score)
    poster = li.xpath('.//img/@src')[0]
    # print(poster)
    
    request.urlretrieve(poster, 'D:/A/' + score + title + '.jpg')
    print("\r完成进度: {:.2f}%".format(fns_num * 100 / num), end="")
    fns_num += 1
    

 

Guess you like

Origin www.cnblogs.com/m718/p/11831697.html