Crawling Douban movie list

 Recently on reptiles related courses, so to learn about the reptiles of the case.

Because I prefer to watch movies, so I chose the watercress, watercress on top250 have movie list.

 

 

Development Environment: python3.6

Use Software: sublime

The main dependencies: requests- acquiring web page data

beautifulSoup- analytical data pages

Set reptiles waiting time time-

xlwt- for reading excel

Import Requests   # acquiring web page data 
from BS4 Import BeautifulSoup   # parse the web page data 
Import Time   # Set the wait time reptile 
Import xlwt 


# get watercress URL and parse the data 
DEF get_douban_books (url, NUM): 
    RES = requests.get (url)   # Requests initiated the request , static pages with GET 
    Soup = the BeautifulSoup (res.text, ' html.parser ' ) 
    
    m = n-J = = NUM 
    
    items_title = soup.find_all ( " div " , the class_ = " PL2 ")    
    for i in items_title:        
        tag = i.find("a")        
        # 去掉空格和换行符
        name = ''.join(tag.text.split())
        link = tag["href"]
        title_markdown = "[{}]({})".format(name,link)
        sheet.write(m, 0, title_markdown)
        m += 1
        
    items_author = soup.find_all("p", class_="pl") 
    for i in items_author:              
        author_markdown = i.text
        sheet.write(n, 1, author_markdown)
        n += 1
        
    items_image = soup.find_all("a", class_="nbg")   
    for i in items_image:        
        tag = i.find("img")
        link = tag["src"]
        image_markdown = "! [] ({}) " .Format (Link) 
        sheet.write (J, 2 , image_markdown) 
        J +. 1 = # define the location to save Excel 
Workbook xlwt.Workbook = ()   # define Workbook 
Sheet = workbook.add_sheet ( ' The famous book ' )   # Add sheet 
head = [ ' title ' , ' author ' , ' picture ' ]     # table header for H in the Range (len (head)): 
    sheet.write (0, H, head [H])     # to write the header to go inside Excel #
        


        
A total of 10 data watercress 
# previously formed URLs 
URL = ' https://book.douban.com/top250?start= {} ' 
URLs = [url.format (NUM * 25) for NUM in Range (10 )] 
The page_num = [25 + NUM *. 1 for NUM in Range (10 )]
 for I in Range (10 ): 
    get_douban_books (URLs [I], the page_num [I]) 
    # 1 second pause, fast access is blocked to prevent 
    time.sleep (1 ) 

# save Excel files 
workbook.save ( ' famous book .xls ' )

Crawl before you can complete the famous book charts 250 books by title, author, pictures, and can be saved to the document.

       Through this study, we can learn some basic knowledge of web crawler, and how to use some basic libraries for python. Next, I'll use some of these advanced network operations and data related to the libraries to crawl to the store to be a way to a deeper understanding of python web crawler.

Guess you like

Origin www.cnblogs.com/wt714/p/11875930.html