Development Environment: (Windows) eclipse + pydev
URL: https://book.douban.com/top250?start=0
from lxml Import etree # parsed to extract data Import Requests # requested web page data acquired Import CSV # store data FP = Open ( ' D: \ Pyproject \ douban.csv ' , ' wt ' , NEWLINE = '' , encoding = ' UTF 8 ' ) # create a csv file Writer = csv.writer (fp) writer.writerow (( ' name ' , ' url ' , ' author ', ' Publisher ' , ' DATE ' , ' . Price ' , ' Rate ' , ' Comment ' )) # write header information, i.e., the first row URLs = [ ' https://book.douban.com/top250?start {} = ' .format (STR (I)) for I in Range (0,250,25)] # Construction URLs headers = { ' the User-- Agent ' : 'Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 76.0.3809.100 Safari / 537.36 ' } for url in urls: # cycle url, first grasp the big, after grasping a small (!! ! important, the following Detailed) HTML = requests.get (url, headers = headers) Selector = etree.HTML (html.text) the infos = selector.xpath ( ' // TR [@ class = "Item"] ' ) for info in the infos: name = info.xpath ( ' TD / div / A / @ title ' ) URL = info.xpath ( 'td/div/a/@href') book_infos = info.xpath('td/p/text()')[0] author = book_infos.split('/')[0] publisher =book_infos.split('/')[-3] date = book_infos.split('/')[-2] price = book_infos.split('/')[-1] rate = info.xpath('td/div/span[2]/text()') Comments = info.xpath ( ' TD / P / span / text () ' ) Comment = Comments [0] IF len (Comments)! = 0 the else '' writer.writerow ((name, URL, author, Publisher, DATE, . price, Rate, Comment)) # write data fp.close () # close csv file, forget
The results show:
# Garbled error, use Notepad to open, save as UTF-8 file resolves
The cases mainly learning csv library use and batch data Fetch (that is, first big catch, the catch is small, look for loop points)
csv library to create a csv file and write data mode:
import csv fp = ('C://Users/LP/Desktop/text.csv','w+') writer = csv.writer(fp) writer.writerow('id','name') writer.writerow('1','OCT') writer.writerow('2','NOV') #写入行 fp.close()
Batch fetching data:
Delete predicate part by similar BeautifulSoup not feasible in the selector (), the idea should be "the first big catch, the catch is small, look for the cycle point" (hand-written, non-copy xpath)
Chrome Open Browser "check", the "triangle symbol" folding element, find the complete information tag, as shown below
(Selector () Get the maximum area data set, following the path expression, the first pre-string)
After each individual data (name, price, etc.) Another:
Such as name, home: <a> -> <div> -> <td>
Therefore, when the home :( invariant point, whichever is less range)
name = info.xpath('td/div/a/@title')