Examples of Learning - Data crawling watercress TOP250

  Development Environment: (Windows) eclipse + pydev

  URL: https://book.douban.com/top250?start=0

from lxml Import etree         # parsed to extract data 
Import Requests                # requested web page data acquired 
Import CSV                     # store data 

FP = Open ( ' D: \ Pyproject \ douban.csv ' , ' wt ' , NEWLINE = '' , encoding = ' UTF 8 ' )       # create a csv file 
Writer = csv.writer (fp) 
writer.writerow (( ' name ' , ' url ' , ' author ', ' Publisher ' , ' DATE ' , ' . Price ' , ' Rate ' , ' Comment ' ))   # write header information, i.e., the first row 

URLs = [ ' https://book.douban.com/top250?start {} = ' .format (STR (I)) for I in Range (0,250,25)]    # Construction URLs 

headers = { ' the User-- Agent ' : 'Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 76.0.3809.100 Safari / 537.36 ' } 

for url in urls:                      # cycle url, first grasp the big, after grasping a small (!! ! important, the following Detailed) 
    HTML = requests.get (url, headers = headers) 
    Selector = etree.HTML (html.text) 
    the infos = selector.xpath ( ' // TR [@ class = "Item"] ' )
     for info in the infos: 
        name = info.xpath ( ' TD / div / A / @ title ' ) 
        URL = info.xpath ( 'td/div/a/@href')
        book_infos = info.xpath('td/p/text()')[0]
        author = book_infos.split('/')[0]
        publisher =book_infos.split('/')[-3]
        date = book_infos.split('/')[-2]
        price = book_infos.split('/')[-1]
        rate = info.xpath('td/div/span[2]/text()')
        Comments = info.xpath ( ' TD / P / span / text () ' ) 
        Comment = Comments [0] IF len (Comments)! = 0 the else  '' 
        writer.writerow ((name, URL, author, Publisher, DATE, . price, Rate, Comment))    # write data 
fp.close ()                           # close csv file, forget

The results show:

# Garbled error, use Notepad to open, save as UTF-8 file resolves

The cases mainly learning csv library use and batch data Fetch (that is, first big catch, the catch is small, look for loop points)

csv library to create a csv file and write data mode:

import csv
fp = ('C://Users/LP/Desktop/text.csv','w+')
writer = csv.writer(fp)
writer.writerow('id','name')
writer.writerow('1','OCT')
writer.writerow('2','NOV')                #写入行
fp.close()

Batch fetching data:

Delete predicate part by similar BeautifulSoup not feasible in the selector (), the idea should be "the first big catch, the catch is small, look for the cycle point" (hand-written, non-copy xpath)

Chrome Open Browser "check", the "triangle symbol" folding element, find the complete information tag, as shown below

(Selector () Get the maximum area data set, following the path expression, the first pre-string)

 After each individual data (name, price, etc.) Another:

Such as name, home: <a> -> <div> -> <td>

 Therefore, when the home :( invariant point, whichever is less range) 

 name = info.xpath('td/div/a/@title')

 

Guess you like

Origin www.cnblogs.com/junecode/p/11443471.html