Python advanced application design task requirements

Implemented a subject-oriented Python web crawler program and complete the following:
(Note: per person a question, the subject matter of choice, all design content and source code to be submitted to the blog platform Park)

First, the web crawler themed design (15 points)
1. Thematic Web Crawler name
Over the years the novel situation of war drama princes
content and data features 2. Thematic web crawlers crawling analysis
Each novel in the name of war drama princes
Total number of hits in each of the novel
3. Overview of thematic design web crawler (including the realization of ideas and technical difficulties)
  http://home.zongheng.com/show/userInfo/166130.html
http://book.zongheng.com/book/{}.html
Start with the author information page crawling the url address books, and then crawling traffic based on books name and url address books, and then made Excel chart name and clicks.
Second, the structure relating to the page analysis (15 points)
structural features of the subject page 1

 

 http://home.zongheng.com/show/userInfo/166130.html

Play war lords author information page url
 
http://book.zongheng.com/book/{}.html
 
In the works of author information page crawl url, populated by different codes in brackets.
2.Htmls page parsing

 

 

 

 The gripping type from a class label imgbox the div, then gripping the work url different from the href a label.

 

From the class name for the type of work to crawl under the book-info div tag.

 

 

 I crawl from class type label for the next div nums, and then crawl works hits from the second i tab.

3. Node (tag) and traversal method lookup method
(shown node tree structure, if necessary)
DEF namesinfo (HTML): 
    Soup = BeautifulSoup (HTML, ' html.parser ' )
     # get div book-name attribute is the 
    name = soup.find_all ( " div " , attrs = ' Book-name ' )
     # regular acquire Chinese book name 
    namess the re.findall = (R & lt " [\ u4e00- \ u9fa5] + " , STR (name [0]))

 


find_all way to find, then the regular expression to obtain Chinese title.


Third, the network design crawlers (60 minutes)
the crawler body to be included in the following sections, to be attached to the source code and comments in detail, and provides an output result after every part of the program theme.

from bs4 import BeautifulSoup
import requests, matplotlib, re, xlwt
import matplotlib.pyplot as plt


#获取页面
def gethtml(url):
    info = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'}
    try:
        data = requests.get(url, headers=info)
        data.raise_for_status()
        data.encoding = data.apparent_encoding
        return data.text
    the except :
         return  "  " 


# books url 
DEF urlinfo (url): 
    Books = [] 
    Book = getHtml (url) 
    Soup = BeautifulSoup (Book, " html.parser " )
     # get property tit p-label 
    p = soup.find_all ( " the p- " , attrs = " TIT " )
     for Item in the p-:
         # acquire books address 
        books.append (item.a.attrs [ ' href ' ])
     returnBooks 


# traffic information 
DEF numsinfo (HTML): 
    n- = [] 
    Soup = the BeautifulSoup (HTML, ' html.parser ' ) 
    div = soup.find_all ( " div " , attrs = ' the nums ' ) 
    the nums = div [0] 
    I = 0
     for Spa in nums.find_all ( " I " ):
         IF I == 2 :
             # clicks on Get 
            n.append (spa.string.split ( '. ' ) [0])
             BREAK 
        I + =. 1
     return n- 


# title information 
DEF namesinfo (HTML): 
    Soup = the BeautifulSoup (HTML, ' html.parser ' )
     # get property as the div book-name 
    name = soup.find_all ( " div " , attrs = ' Book-name ' )
     # regular acquire Chinese title 
    namess = re.findall (r " [\ u4e00- \ u9fa5] + " , str (name [0]))
     return namess 


# repair Chinese block 
matplotlib.rcParams [' Font.sans serif- ' ] = [ ' SimHei ' ] 
matplotlib.rcParams [ ' font.family ' ] = ' Sans-serif ' 
matplotlib.rcParams [ ' axes.unicode_minus ' ] = False 


# bar graph 
DEF Bar (X , Y, User): 
    plt.bar (left = X, Y = height, Color = ' Y ' , width = 0.5 ) 
    plt.ylabel ( ' hits ' ) 
    plt.xlabel ( ' title ' )  
    plt.title ( user)
    plt.savefig ( user, dpi = 300 ) 
    plt.show () 


DEF File (Book, nums, address):
     # Create a Workbook, create the equivalent of Excel 
    Excel = xlwt.Workbook (encoding = ' UTF-8 ' )
     # Create a name table One of 
    Sheet1 excel.add_sheet = (U ' One ' , cell_overwrite_ok = True)
     # write column names 
    sheet1.write (0, 0, ' Book ' ) 
    sheet1.write (0, . 1, ' the nums ' ) 

    for I in the Range (1 , len (Book)):
        sheet1.write (I, 0, Book [I]) 
    for Jin Range (. 1 , len (the nums)): 
        sheet1.write (J, . 1 , the nums [J]) 
    excel.save (address) 


# list element type conversion 
DEF Convert (Lista): 
    listB = []
     for I in Lista: 
        listb.append (I [0]) 
    return listB 


DEF main ():
     # OF page 
    author = ' http://home.zongheng.com/show/userInfo/166130.html ' 
    User = ' play war lords ' 
    URLs = urlinfo (author) 
    the namelist = []
    countlist = []
    for url in urls:
        html = gethtml(url)
        namelist.append(namesinfo(html))
        countlist.append(numsinfo(html))
    namelist = convert(namelist)
    countlist = convert(countlist)
    for i in range(len(countlist)):
        countlist[i] = int(countlist[i])
    #保存地址
    addr = f'D:\\{user}.xls'
    file(namelist, countlist, addr)
    Bar(namelist, countlist, user)


if __name__ == '__main__ ' : 
    main ()

1. Data acquisition and crawling

DEF urlinfo (URL): 
    Books = [] 
    Book = the getHtml (URL) 
    Soup = the BeautifulSoup (Book, " html.parser " )
     # get property as a p-tag tit 
    p = soup.find_all ( " p " , attrs = " tit " )
     for Item in the p-:
         # acquire books address 
        books.append (item.a.attrs [ ' href ' ])
     return Books
def numsinfo(html):
    n = []
    soup = BeautifulSoup(html, 'html.parser')
    div = soup.find_all("div", attrs='nums')
    nums = div[0]
    i = 0
    for spa in nums.find_all("i"):
        if i == 2:
            #获取点击量
            n.append(spa.string.split('.')[0])
            break
        i += 1
    return n

 


2. The data processing and cleaning
Data cleaning
 
    for spa in nums.find_all("i"):
        if i == 2:
            #获取点击量
            n.append(spa.string.split('.')[0])
            break
        i += 1

Data cleaning

    namess = re.findall(r"[\u4e00-\u9fa5]+", str(name[0]))
    return namess

 


3. Text analysis (optional): jieba word, wordcloud visualization

 

 

DEF File (Book, nums, address):
     # Create a Workbook, create the equivalent of Excel 
    Excel = xlwt.Workbook (encoding = ' UTF-8 ' )
     # Create a table named One of 
    sheet1 = excel.add_sheet (U ' One ' , = cell_overwrite_ok True)
     # write column names 
    sheet1.write (0, 0, ' Book ' ) 
    sheet1.write (0, . 1, ' the nums ' ) 

    for I in Range (. 1 , len (Book)): 
        sheet1.write ( I, 0, Book [I]) 
    for J in Range (. 1, len(nums)):
        sheet1.write(j, 1, nums[j])
    excel.save(address)

 


4. Data analysis and visualization
(for example: column graph, histogram, scatter, FIG box, distribution, regression analysis data, etc.)

 

 

def Bar(x, y, user):
    plt.bar(left=x, height=y, color='y', width=0.5)
    plt.ylabel('点击量')
    plt.xlabel('书名')
    plt.title(user)
    plt.savefig(user, dpi=300)
    plt.show()

 

 
 5. Data Persistence
 

 

 

CONCLUSIONS (10 points)
1. After analyzing data on the subject and visualization, what conclusion can be?
The sword and the snow is the author of best-selling novel.
2. Make a simple summary of the circumstances of this programming task is completed.
Through this experiment, I applied for python, let me crawling on the network have a more profound understanding.

Guess you like

Origin www.cnblogs.com/Chinaluyi/p/12044214.html