Reptiles (4) _ code analysis

[Follow-up will sort out PPT, placing the source material for download ~]

[Any reptile are just for learning, not for commercial and other purposes, tort deleted]

[If you have ideas other test areas, can private letter I press my number displayed on the public Oh]


A Demo


First, let's watch the next effect, this effect is backend action crawling; temporary storage and does not involve the front-end display. [ Virtual machine centos network is not good, I switched to a window presentation]

[Please pay attention to the public to watch No.]

I will content crawled into the html file, html file and retains the style, style, style address from the official website, did not continue crawling style official website

II. Installation


In the previous two sections, we installed python + django. Based on the third site analysis, we need to install two libraries:

pip install requestspip install PyQuery

requests: When requesting a web page need to use;

PyQuery: for when parsing web pages need to use this syntax is similar to jQuery.

III. Coding analysis


 

1) Home parsing code analysis

$(".PagedList-skipToPage")

When we analyze the home, there demonstrate the use of this code console knock-out effect, can go back and watch under the section " reptiles (3) _ analysis of the site "

This code above actually get value for all pages, statistics total number of pages requested that the total amount of free books, this role is the value of the parameter kw sent the request, refer to the following:

def FreeComputerBook():
    url="http://www.ituring.com.cn/book"
    kw="tab=free&sort=vote"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
    response=requests.get(url,params=kw,headers=headers)
    htmlResult=pq(response.text)
    
    pageCount=htmlResult(".PagedList-skipToPage").items()
    pageNum=0
    bookCount=0
    books=[]
    for page in pageCount:
        kw="tab=free&sort=vote&page="+str(pageNum)
        pageNum=pageNum+1
        response=requests.get(url,params=kw,headers=headers)
        htmlResult=pq(response.text)
        books.append(allBookDetails(htmlResult,bookCount,headers))
    print(books)

By value kw, and send a request to traverse obtain books URL :( change the parameters of the page)

http://www.ituring.com.cn/book?tab=free&sort=vote&page=1

2) After determining the total number of books, such as URL information to start getting books into the details page, enter the same value before the Reference Section:

$(".block-books li")

This code is mainly to get a list of all books of information, a single list including the URL, author, title and other information of a book. The value returned is an array of arrays, this piece of information we want to obtain a single traverse of the book returned array, as follows: use a for loop to obtain information on a single book

def allBookDetails(htmlResult,bookCount,headers):
    books=[]
    bookItems=htmlResult(".block-books li").items()
    for bookitem in bookItems:
        book={}
        bookCount=bookCount+1
        bookUrl=bookitem.find(".book-img a").attr("href")
        bookName=bookitem.find(".book-img a").attr("title")
        bookAuthor=bookitem.find(".book-info a").text()
        book["url"]="http://www.ituring.com.cn"+bookUrl
        book["name"]=bookName
        book["author"]=bookAuthor
        book["catalogue"]=BookCatalogue(book["url"],headers,bookName)
        books.append(book)
        print("the "+str(bookCount)+"book is crawling over.")
    return books

We will get to the information stored in the dictionary, then the dictionary and finally into an array, this parameter primarily to prepare storage

 

3) to obtain directory information books (discovered by observing directory website URL into the content of the page will be different according to id rather see the contents of the directory corresponding to the directory) so id the URL is to be obtained, the code reference input of the third quarter:

$(".table tr")

Watch website found that the entire catalog is displayed in a table, so we need to get the URL corresponding to the information in each row of the table, write code as follows:

def BookCatalogue(url,headers,bookName):
    bookCatalogue={}
    response=requests.get(url)
    htmlResult=pq(response.text)
    contents=""
    catalogues=htmlResult(".table tr").items()
    
    for catalogue in catalogues:
        catalogueUrl=catalogue.eq(0).find("a").attr("href")
        catalogueName=catalogue.eq(0).find("a").text()
        bookCatalogue["catalogueUrl"]="http://www.ituring.com.cn"+catalogueUrl
        bookCatalogue["catalogueName"]=catalogueName

The same information I will also get into the dictionary to prepare for storage use -

 

4) After obtaining directory, you need to acquire contents data corresponding to the directory, but also refer to the content of the third ~

$(".article-detail").html()

This code is to obtain a content page content, including style information, the code is interpreted as follows:

def BookContent(url,headers):
    response=requests.get(url) 
    htmlResult=pq(response.text)
    #contents=htmlResult(".article-detail").text().encode('UTF-8')
    content=htmlResult(".article-detail").html().encode('UTF-8')
    return content

Four .END


That's all the whole process of crawling pages of information, we need to follow all of the information stored in the database; the same, you will find the time to watch the demo performance is not good, the execution time is too long, this would explain the subsequent optimization process ~

Guess you like

Origin www.cnblogs.com/VVsky/p/11183011.html