Recently on reptiles related courses, so to learn about the reptiles of the case.
Because I prefer to watch movies, so I chose the watercress, watercress on top250 have movie list.
Development Environment: python3.6
Use Software: sublime
The main dependencies: requests- acquiring web page data
beautifulSoup- analytical data pages
Set reptiles waiting time time-
xlwt- for reading excel
Import Requests # acquiring web page data from BS4 Import BeautifulSoup # parse the web page data Import Time # Set the wait time reptile Import xlwt # get watercress URL and parse the data DEF get_douban_books (url, NUM): RES = requests.get (url) # Requests initiated the request , static pages with GET Soup = the BeautifulSoup (res.text, ' html.parser ' ) m = n-J = = NUM items_title = soup.find_all ( " div " , the class_ = " PL2 ") for i in items_title: tag = i.find("a") # 去掉空格和换行符 name = ''.join(tag.text.split()) link = tag["href"] title_markdown = "[{}]({})".format(name,link) sheet.write(m, 0, title_markdown) m += 1 items_author = soup.find_all("p", class_="pl") for i in items_author: author_markdown = i.text sheet.write(n, 1, author_markdown) n += 1 items_image = soup.find_all("a", class_="nbg") for i in items_image: tag = i.find("img") link = tag["src"] image_markdown = "! [] ({}) " .Format (Link) sheet.write (J, 2 , image_markdown) J +. 1 = # define the location to save Excel Workbook xlwt.Workbook = () # define Workbook Sheet = workbook.add_sheet ( ' The famous book ' ) # Add sheet head = [ ' title ' , ' author ' , ' picture ' ] # table header for H in the Range (len (head)): sheet.write (0, H, head [H]) # to write the header to go inside Excel # A total of 10 data watercress # previously formed URLs URL = ' https://book.douban.com/top250?start= {} ' URLs = [url.format (NUM * 25) for NUM in Range (10 )] The page_num = [25 + NUM *. 1 for NUM in Range (10 )] for I in Range (10 ): get_douban_books (URLs [I], the page_num [I]) # 1 second pause, fast access is blocked to prevent time.sleep (1 ) # save Excel files workbook.save ( ' famous book .xls ' )
Crawl before you can complete the famous book charts 250 books by title, author, pictures, and can be saved to the document.
Through this study, we can learn some basic knowledge of web crawler, and how to use some basic libraries for python. Next, I'll use some of these advanced network operations and data related to the libraries to crawl to the store to be a way to a deeper understanding of python web crawler.