Python crawler combat 2: Douban reading top250 crawling

Preface

This article mainly introduces the data crawling and data preprocessing of Douban reading top250. The main libraries used are re, request, Beautifulsoup, and lxml. This article focuses on summarizing some of the pits I encountered while crawling, and my approach to these pits. The code and data for crawling are attached at the end of the article. This is my first actual crawler: the sister version of the Douban movie top250 .

reptile

Define download link function

When downloading webpages, errors are often reported. In order to reduce unnecessary errors, you can pay attention to the following three points when downloading:

  1. Pause between each download, otherwise you may be blocked
  2. Add a header, pretend that you are not a crawler. But the content of the header seems to be free
  3. Add encoding, decode ='utf-8', otherwise the crawled content may be garbled
# 引入库
import re
import pandas as pd
import time
import urllib.request
from lxml.html import fromstring
from bs4 import BeautifulSoup
# 下载链接
def download(url):
    print('Downloading:', url)
    request = urllib.request.Request(url)
    request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36') #进行伪装
    resp = urllib.request.urlopen(request)
    html = resp.read().decode('utf-8') #utf-8编码
    time.sleep(3)   #间隔3s,防止被封禁
    return html

Choice of crawling content

When I crawled Douban Movies, the method I used was to download each index page, and then download all the movie content on each index page. There are a total of 10 index pages, and each page has 25 movie information, so 25*10+10=260 websites must be downloaded. Although this can add a little more crawling information, one obvious drawback is that it takes a long time.
Crawl content

When I crawled Douban to read books, the method I used was to download each index page, and then directly extract the book information on the index page (under the title). Only 10 pages need to be downloaded! Moreover, when I followed the old method of Crawling Douban Movies and Crawling Douban to read books, I found it difficult to locate the author's information. This enlightens us to choose the way we crawl content flexibly.
Crawl content

Choice of positioning method

In the first Douban movie, there are three methods for crawler positioning:

  1. Targeting through regular expressions
  2. Locate through the find function in Beautifulsoup
  3. Locate through Xpath in lxml

So which method should be used? My principle is: simple is the best.
I personally recommend Xpath, because it is the easiest and no-brainer, just press F12 , and then select the content to be crawled, right-click copy xpath, and secondly, you can use the find function in Beautifulsoup to locate, which is less than a last resort. Try not to use regular expressions.
For example , the xpath positioning method is used when reading book titles , but there are two points to note here:

  1. Need to remove /tbody from xpath
  2. Need to use .strip() to remove line breaks and spaces

xpath

# 待爬取内容
name = []
rate = []
info = []


# 循环爬取每页内容
for k in range(10):
    url = download('https://book.douban.com/top250?start={}'.format(k*25))
    tree = fromstring(url)
    soup = BeautifulSoup(url)
    #找出该页所有书籍信息
    for k in range(25):
        name.append(tree.xpath('//*[@id="content"]/div/div[1]/div/table[{}]/tr/td[2]/div[1]/a/text()'.format(k+1))[0].strip())
        rate.append(soup.find_all('span',{
    
    'class':'rating_nums'})[k].get_text())
        info.append(soup.find_all('p',{
    
    'class':'pl'})[k].get_text())
# 拼接
book_data = pd.concat([name_pd, rate_pd, info_pd], axis=1)
book_data.columns=['书名', '评分', '信息']
book_data.head()

Data preprocessing

Next, preprocess the data read above:

  1. The variable information into a writer, publisher, year of publication, pricing
  2. There are two abnormal data that need to be adjusted manually
  3. With regular expressions to extract published in the year, the pricing of digital section
  4. Burning bridges, deleting information variables

Note that the author, publisher, year of publication, pricing is to find out the index by location, but there are two exceptions where data need to manually adjust:
Holmes
Bible

The specific code is as follows:

# 数据预处理:

# 将信息分割
Info = book_data['信息'].apply(lambda x: x.split('/'))

# 提取信息
book_data['作家'] = Info.apply(lambda x: x[0])
book_data['出版社'] = Info.apply(lambda x: x[-3])
book_data['出版年'] = Info.apply(lambda x: x[-2])
book_data['定价'] = Info.apply(lambda x: x[-1])

# 手动调整
book_data.iloc[9,4] = '群众出版社'
book_data.iloc[9,5] = '1981'
book_data.iloc[184,5] = '1996'
book_data.iloc[184,6] = '0'

#提取年份
f = lambda x: re.search('[0-9]{4,4}', x).group()
book_data['出版年'] = book_data['出版年'].apply(f)

#提取定价
g = lambda x: re.search('([0-9]+\.[0-9]+|[0-9]+)', x).group()
book_data['定价'] = book_data['定价'].apply(g)

book_data = book_data.drop(['信息'], axis =1)

# 输出
outputpath='c:/Users/zxw/Desktop/修身/与自己/数据分析/数据分析/爬虫/豆瓣读书/book.csv' ## 路径需要自己改!
book_data.to_csv(outputpath,sep=',',index=False,header=True,encoding='utf_8_sig')

postscript

At present yourself to reptile regarded as preliminary understanding, the next might consider learning machine learning content, starting with An Introduction to Statistical Learning with R write from it

Code and data set (extraction code: disq)

Guess you like

Origin blog.csdn.net/weixin_43084570/article/details/108666114