Douban reading top250 reptile combat
Preface
This article mainly introduces the data crawling and data preprocessing of Douban reading top250. The main libraries used are re, request, Beautifulsoup, and lxml. This article focuses on summarizing some of the pits I encountered while crawling, and my approach to these pits. The code and data for crawling are attached at the end of the article. This is my first actual crawler: the sister version of the Douban movie top250 .
reptile
Define download link function
When downloading webpages, errors are often reported. In order to reduce unnecessary errors, you can pay attention to the following three points when downloading:
- Pause between each download, otherwise you may be blocked
- Add a header, pretend that you are not a crawler. But the content of the header seems to be free
- Add encoding, decode ='utf-8', otherwise the crawled content may be garbled
# 引入库
import re
import pandas as pd
import time
import urllib.request
from lxml.html import fromstring
from bs4 import BeautifulSoup
# 下载链接
def download(url):
print('Downloading:', url)
request = urllib.request.Request(url)
request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36') #进行伪装
resp = urllib.request.urlopen(request)
html = resp.read().decode('utf-8') #utf-8编码
time.sleep(3) #间隔3s,防止被封禁
return html
Choice of crawling content
When I crawled Douban Movies, the method I used was to download each index page, and then download all the movie content on each index page. There are a total of 10 index pages, and each page has 25 movie information, so 25*10+10=260 websites must be downloaded. Although this can add a little more crawling information, one obvious drawback is that it takes a long time.
When I crawled Douban to read books, the method I used was to download each index page, and then directly extract the book information on the index page (under the title). Only 10 pages need to be downloaded! Moreover, when I followed the old method of Crawling Douban Movies and Crawling Douban to read books, I found it difficult to locate the author's information. This enlightens us to choose the way we crawl content flexibly.
Choice of positioning method
In the first Douban movie, there are three methods for crawler positioning:
- Targeting through regular expressions
- Locate through the find function in Beautifulsoup
- Locate through Xpath in lxml
So which method should be used? My principle is: simple is the best.
I personally recommend Xpath, because it is the easiest and no-brainer, just press F12 , and then select the content to be crawled, right-click copy xpath, and secondly, you can use the find function in Beautifulsoup to locate, which is less than a last resort. Try not to use regular expressions.
For example , the xpath positioning method is used when reading book titles , but there are two points to note here:
- Need to remove /tbody from xpath
- Need to use .strip() to remove line breaks and spaces
# 待爬取内容
name = []
rate = []
info = []
# 循环爬取每页内容
for k in range(10):
url = download('https://book.douban.com/top250?start={}'.format(k*25))
tree = fromstring(url)
soup = BeautifulSoup(url)
#找出该页所有书籍信息
for k in range(25):
name.append(tree.xpath('//*[@id="content"]/div/div[1]/div/table[{}]/tr/td[2]/div[1]/a/text()'.format(k+1))[0].strip())
rate.append(soup.find_all('span',{
'class':'rating_nums'})[k].get_text())
info.append(soup.find_all('p',{
'class':'pl'})[k].get_text())
# 拼接
book_data = pd.concat([name_pd, rate_pd, info_pd], axis=1)
book_data.columns=['书名', '评分', '信息']
book_data.head()
Data preprocessing
Next, preprocess the data read above:
- The variable information into a writer, publisher, year of publication, pricing
- There are two abnormal data that need to be adjusted manually
- With regular expressions to extract published in the year, the pricing of digital section
- Burning bridges, deleting information variables
Note that the author, publisher, year of publication, pricing is to find out the index by location, but there are two exceptions where data need to manually adjust:
The specific code is as follows:
# 数据预处理:
# 将信息分割
Info = book_data['信息'].apply(lambda x: x.split('/'))
# 提取信息
book_data['作家'] = Info.apply(lambda x: x[0])
book_data['出版社'] = Info.apply(lambda x: x[-3])
book_data['出版年'] = Info.apply(lambda x: x[-2])
book_data['定价'] = Info.apply(lambda x: x[-1])
# 手动调整
book_data.iloc[9,4] = '群众出版社'
book_data.iloc[9,5] = '1981'
book_data.iloc[184,5] = '1996'
book_data.iloc[184,6] = '0'
#提取年份
f = lambda x: re.search('[0-9]{4,4}', x).group()
book_data['出版年'] = book_data['出版年'].apply(f)
#提取定价
g = lambda x: re.search('([0-9]+\.[0-9]+|[0-9]+)', x).group()
book_data['定价'] = book_data['定价'].apply(g)
book_data = book_data.drop(['信息'], axis =1)
# 输出
outputpath='c:/Users/zxw/Desktop/修身/与自己/数据分析/数据分析/爬虫/豆瓣读书/book.csv' ## 路径需要自己改!
book_data.to_csv(outputpath,sep=',',index=False,header=True,encoding='utf_8_sig')
postscript
At present yourself to reptile regarded as preliminary understanding, the next might consider learning machine learning content, starting with An Introduction to Statistical Learning with R write from it