一. XPath解析网页,提取数据
有些书会有副标题,如《人类简史:从动物到上帝》,冒号及后面的文字跟主标题不在同一个xpath,需要另外提取。而有些书名没有副标题,如果也用提取副标题的xpath就会出错,所以要用 try…except… 来处理。
name = p. xpath( './tr/td[2]/div[1]/a/text()' ) [ 0 ] . strip( )
name = name. replace( '\n' , '' ) . replace( ' ' , '' )
try :
add = p. xpath( './tr/td[2]/div[1]/a/span/text()' ) [ 0 ] . replace( ' ' , '' )
name = name + add
except :
pass
二. xlwings存储数据
使用xlwings保存到xlsx文件中时,如果不想看见excel打开,可以在打开时设置visible= False.
app = xlwings. App( visible= False , add_book= False )
如果用wb = xlwings.Book()
打开xlsx文件,最后不用quit()方法,只需wb.close()。 如果用app的话,最后要app.quit()。
app = xw. App( visible= False , add_book= False )
wb = app. books. add( )
'' '
wb. save( 'file.xlsx' )
wb. close( )
app. quit( )
xlwings的save()方法是覆写的(如果是已存在的文件)。
三.完整代码
import requests
from lxml import etree
import xlwings as xw
import time
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36' ,
'Host' : 'book.douban.com'
}
books = [ ]
for i in range ( 0 , 10 ) :
link = 'https://book.douban.com/top250?start=' + str ( i* 25 )
res = requests. get( link, headers= headers, timeout= 20 )
print ( str ( i+ 1 ) , '页响应状态码:' , res. status_code)
html = etree. HTML( res. text)
paths = html. xpath( '//*[@id="content"]/div/div[1]/div/table' )
for p in paths:
name = p. xpath( './tr/td[2]/div[1]/a/text()' ) [ 0 ] . strip( )
name = name. replace( '\n' , '' ) . replace( ' ' , '' )
try :
add = p. xpath( './tr/td[2]/div[1]/a/span/text()' ) [ 0 ]
name = name + add
except :
pass
info = p. xpath( './tr/td[2]/p[1]/text()' ) [ 0 ]
rating = p. xpath( './tr/td[2]/div[2]/span[2]/text()' ) [ 0 ]
rating_people = p. xpath( './tr/td[2]/div[2]/span[3]/text()' ) [ 0 ] . strip( )
rating_people = rating_people. replace( '(' , '' ) . replace( ')' , '' ) . replace( '\n' , '' ) . replace( ' ' , '' ) . replace( '人评价' , '' )
quote = p. xpath( './tr/td[2]/p[2]/span/text()' ) [ 0 ]
book_link = p. xpath( './tr/td[2]/div[1]/a/@href' ) [ 0 ]
books. append( [ name, info, rating, rating_people, quote, book_link] )
time. sleep( 2 )
print ( 'Writing into file...' )
app = xw. App( visible= False , add_book= False )
wb = app. books. add( )
sht0 = wb. sheets[ 0 ]
sht0. range ( 'A1' ) . value = [ '书名' , '作者/出版社/出版时间/定价' , '评分' , '评价人数' , '评语' , '网址' ]
sht0. range ( 'A2' ) . value = books
wb. save( 'books_by_xpath.xlsx' )
wb. close( )
app. quit( )
print ( 'Finish!' )