Python crawling with watercress fiction

The first time you use python crawling pages of data, although the code is very simple, and debugging a very long time. But quite significant, so record it.

Using BeautifulSoup

# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
from urllib2 import urlopen, HTTPError, URLError

def getUrl(url):
    try:
        html = urlopen("https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4" + "?" + url)
    except (HTTPError,URLError):
        return None
    try:
        bsObj = BeautifulSoup(html.read(),from_encoding="utf8")
    except AttributeError as e :
        return None
    return bsObj

#截取页面上的书籍名称,评分,评分人数
def getAll(bsObj):
    qu = []   #使用列表记录这个页面符合要求的书籍
    try:
        allSubject = bsObj.findAll("li", {"class":"subject-item"})
        for i in allSubject:
            # 获取书籍名称
            j =  i.find("h2").find("a")
            name = j.attrs['title']

            # 获取评分
            k = i.find("span", {"class": "rating_nums"})
            score = float(k.get_text())

            # 获取评价人数,因为评价人数不止是有数字,还有中文,因此用filter截取数字
            g = i.find("span", {"class": "pl"})
            peo = g.get_text().encode('utf-8')
            people = int(filter(str.isdigit,peo))

            # 判断人数和评分是否符合要求
            if people >= 30000 and score >= 8.5:
                qu.append([name, score, people])
        return qu

    except AttributeError as e:
        return None

def main():
    All = []
    # 这里我只截取了一页的数据来测试,但是可以将第二个20改成更大的数字
    for i in range(0, 20, 20):

        bsObj = getUrl("start=" + str(i) + "&type=T")
        if bsObj is None:
            print "what?"
        else:
            k = getAll(bsObj)
            if k is not None:
                All.append(k)

    for i in All:
        for j in i:
            print j[0], j[1], j[2]

main()

Here presented problems encountered and the solution:

Question:
1. First, it is time to evaluate the number of acquired real is (XXX people commented), but we only need XXX this data.
2. a coding problem encountered in the beginning in order to test the direct output All data is all saved, but the title of the book is found Unicode output.

Method:
1. In a method of online search python interception numbers in the string, the selection of the filter () function.

filter () is defined as follows:
filter () function is used to filter sequences, filtered ineligible element returns a list of qualified new elements.
The receiving two parameters, as a function of a first, a second sequence, each element of the sequence as an argument to a function arbitrates, then return True or False, and finally returns True new elements into the list.

There is also a problem, that is the time taken to remember to change the encoding data,
PEO = g.get_text (). Encode ( 'UTF-. 8')
or else alone g.get_text () is a type of Unicode.

2. The problem is more an idiot, got into a dead end, the Chinese have been wondering why the output is Unicode, later I understood why, because I direct output encoding it in the list, so the output directly from the index on the line, will become Chinese . But still record it as a way to alert after themselves.

Because it is the first crawling data, so the code is relatively simple.

Guess you like

Origin blog.csdn.net/RebelHero/article/details/78473504