The first time you use python crawling pages of data, although the code is very simple, and debugging a very long time. But quite significant, so record it.
Using BeautifulSoup
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
from urllib2 import urlopen, HTTPError, URLError
def getUrl(url):
try:
html = urlopen("https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4" + "?" + url)
except (HTTPError,URLError):
return None
try:
bsObj = BeautifulSoup(html.read(),from_encoding="utf8")
except AttributeError as e :
return None
return bsObj
#截取页面上的书籍名称,评分,评分人数
def getAll(bsObj):
qu = [] #使用列表记录这个页面符合要求的书籍
try:
allSubject = bsObj.findAll("li", {"class":"subject-item"})
for i in allSubject:
# 获取书籍名称
j = i.find("h2").find("a")
name = j.attrs['title']
# 获取评分
k = i.find("span", {"class": "rating_nums"})
score = float(k.get_text())
# 获取评价人数,因为评价人数不止是有数字,还有中文,因此用filter截取数字
g = i.find("span", {"class": "pl"})
peo = g.get_text().encode('utf-8')
people = int(filter(str.isdigit,peo))
# 判断人数和评分是否符合要求
if people >= 30000 and score >= 8.5:
qu.append([name, score, people])
return qu
except AttributeError as e:
return None
def main():
All = []
# 这里我只截取了一页的数据来测试,但是可以将第二个20改成更大的数字
for i in range(0, 20, 20):
bsObj = getUrl("start=" + str(i) + "&type=T")
if bsObj is None:
print "what?"
else:
k = getAll(bsObj)
if k is not None:
All.append(k)
for i in All:
for j in i:
print j[0], j[1], j[2]
main()
Here presented problems encountered and the solution:
Question:
1. First, it is time to evaluate the number of acquired real is (XXX people commented), but we only need XXX this data.
2. a coding problem encountered in the beginning in order to test the direct output All data is all saved, but the title of the book is found Unicode output.
Method:
1. In a method of online search python interception numbers in the string, the selection of the filter () function.
filter () is defined as follows:
filter () function is used to filter sequences, filtered ineligible element returns a list of qualified new elements.
The receiving two parameters, as a function of a first, a second sequence, each element of the sequence as an argument to a function arbitrates, then return True or False, and finally returns True new elements into the list.
There is also a problem, that is the time taken to remember to change the encoding data,
PEO = g.get_text (). Encode ( 'UTF-. 8')
or else alone g.get_text () is a type of Unicode.
2. The problem is more an idiot, got into a dead end, the Chinese have been wondering why the output is Unicode, later I understood why, because I direct output encoding it in the list, so the output directly from the index on the line, will become Chinese . But still record it as a way to alert after themselves.
Because it is the first crawling data, so the code is relatively simple.