Watercress book data collection

1, watercress web crawlers collected from the data, the database is connected mongo, mongo introduced into the data, the code is as follows:

#   Access URL 
# use requests to access the 
Import PANDAS AS pd
 Import requests
 Import pymongo
 Import Re 


U = ' https://book.douban.com/tag/ philosophy ' 
r = requests.get (url = U) 

# parse URL 
# Use BeautifulSoup parsing the URL 
from BS4 Import   BeautifulSoup 
Soup = BeautifulSoup (r.text, ' lxml ' ) 
urlist = []
 for I in Range (. 7 ): 
    urlist.append ('https://book.douban.com/tag/哲学?start=' + str(20*i)+ '&type=T')
n=0
for u in urlist:
    r = requests.get(url=u)
    soup =BeautifulSoup(r.text,'lxml')
    soup.find('div',id="content").h1.text
    lis = soup.find('ul',class_='subject-list').find_all('')
     For Li in LIS: 
        DIC = {}       # Create an empty dictionary, the stored data 
        DIC [ ' Title ' ] = li.h2.text.replace ( '  ' , '' ) .replace ( ' \ n- ' , '' ) 
        DIC [ ' other information ' ] = li.find ( ' div ' , the class_ = " Pub " ) .text.replace ( '  ' , '' ) .replace ( ' \ n- ' , '')
        dic['评分']=li.find('span',class_="rating_nums").text
        dic['评价人数']=re.search(r'(\d*)人',li.find('span',class_="pl").text.replace(' ','').replace('\n','')).group(1)
        datatable.insert_one (DIC)     "(Print
        n-+ =. 1  The acquired data storage each#
         Success of data acquisition% i " % n-) 

myclient = pymongo.MongoClient ( " MongoDB: // localhost: 27017 " ) 
DB = myclient [ ' watercress Data Acquisition ' ] 
DataTable = DB [ ' Test ' ]
     # Create a connected mongo database connection, and create a database watercress data acquisition, and test forms
Watercress data reptiles

2, mongo installation configuration: https: //www.cnblogs.com/zhoulifeng/p/9429597.html#4242074

3, ROBO 3T installation: https: //www.cnblogs.com/tugenhua0707/p/9250673.html

Guess you like

Origin www.cnblogs.com/wangchenghua/p/11278008.html