Python3 crawler crawls China Book Network (Amoy Book Group) records

I am a novice who has just started learning Python crawler, and I post only to record my learning process, which is convenient for review.

To crawl the link: http://tuan.bookschina.com/

Content to crawl: book name, book price, and a link to the corresponding preview image

This article uses py packages: requests, BeautifulSoup, json, cvs

When I opened the group purchase page of China Books.com, I found that the information of the website was dynamically loaded:

Anyways, without thinking about loading more pages of book information, let's start by trying to grab the first page of book information:

The browser used in this crawler is chrome

So we open the browser's developer mode F12, and we can see the corresponding information about page loading

In order to implement the simulated browser login function, we need to view the header information:

Complete the corresponding code:

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
    'Host': 'tuan.bookschina.com',
    'Referer': 'http://tuan.bookschina.com/',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9'
}

What we need to do next is to analyze the DOM of the entire China Book Network to see which tags the information we need is encapsulated in.

After a carpet search. . . . We found that the information we need is encapsulated in the child node li of <ul id='taoList".....>

Therefore, we plan to use BeautifulSoup's parsing and crawling function to achieve the information we need in the li

Corresponding code:

url = 'http://tuan.bookschina.com/' 
response = requests.get(url, headers = header) #Imitate browser login 
response.encoding = 'utf-8' 
soup = BeautifulSoup(response.text,'html .parser') 
for item in soup.select('div .taoListInner ul li'): 
    print(item.select('h2')[0].text) #The returned object is an array 
    print(item.select('.salePrice ')[0].text) 
    print(item.select('img')[0].get('src')) #get method is used to get the attribute value inside the tab

First, we need to call the get method of requests, get the response of the response, and then parse it through BS. We will find that in the div tag with the class name taoListInner, the li under the ul we want is encapsulated

Checked the documentation of beautifulsoup, compared find_all and select, and decided to call the select method to get the corresponding label, and then get the book title under the corresponding h2 label; the price under the salePrice class; and the img label

A preview of the inner src link. This will print out the information we want for the book displayed on the first page.

But the problem arises. . . What if we want to get more book information on subsequent pages, because the select method of bs can only parse static Dom

So we suspect that the subsequent book data is loaded through Ajax or JS

When we come to the XHR in the developer mode, we will find that whenever we pull down the scroll bar and refresh the book information, the follower will refresh a link of GroupList?.....

we open him

Surprisingly, in the previews, the data we need is encapsulated and stored in the form of Json

So we want to get this Json data, we need to get his Request URL

The current URL is: http://tuan.bookschina.com/Home/GroupList?Type=0&Category=0&Price=0&Order=11&Page=2&Tyjson=true

We will find a pattern, whenever there is new book information refreshed, Page=? will also increase

So the problem is solved. . . . We only need to get the returned JSON through the URL for parsing, and then we can get all the data we want

It also verified a statement that many dynamically loaded websites will encapsulate Json data as a response, so as to find a shortcut for our crawlers

url = 'http://tuan.bookschina.com/Home/GroupList?Type=0&Category=0&Price=0&Order=11&Page=2&Tyjson=true'
response = requests.get(url)
result = json.loads(response.text)
bookinfo = {}
for data in result['Data']:
    bookinfo['bookName'] = data['book_name']
    bookinfo['price'] = data['group_price']
    bookinfo['iconLink'] = data['group_image']
print(url)

The loads() method is called here , and the returned json data is converted into a python dictionary , which is convenient for taking the data.

After getting the data, we decided to save the data to disk and generate a cvs excel file for further data analysis.

Therefore, the whole code of this crawler experiment is as follows:

import requests
from bs4 import BeautifulSoup
import json
import csv

def parse_one_page():
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
        'Host': 'tuan.bookschina.com',
        'Referer': 'http://tuan.bookschina.com/',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.9'
    }
    url = 'http://tuan.bookschina.com/'
    response = requests.get(url,headers = header) #Imitate browser login 
    response.encoding = 'utf-8'
    soup = BeautifulSoup(response.text,'html.parser')
    for item in soup.select('div .taoListInner ul li'):
        print(item.select('h2')[0].text) #返回对象为数组
        print(item.select('.salePrice')[0].text)
        print(item.select('img')[0].get('src')) #get方法用来取得tab内部的属性值


def dynamtic_claw_data(page, headers, fileName):
    for i in range(page):
        url = 'http://tuan.bookschina.com/Home/GroupList?Type=0&Category=0&Price=0&Order=11&Page=' + str(
            i) + '&Tyjson=true'
        response = requests.get(url)
        result = json.loads(response.text)
        bookinfo = {}
        for data in result['Data']:
            bookinfo['bookName'] = data['book_name']
            bookinfo['price'] = data['group_price']
            bookinfo['iconLink'] = data['group_image']
            write_csv_rows(fileName,headers,bookinfo)
        print(url)

def write_csv_headers(path, headers):
    with open(path, 'a', encoding='gb18030', newline='') as f:
        f_csv = csv.DictWriter(f, headers)
        f_csv.writeheader()


def write_csv_rows(path, headers, rows):
    with open(path, 'a', encoding='gb18030', newline='') as f:
        f_csv = csv.DictWriter(f, headers)
         # 如果写入数据为字典，则写入一行，否则写入多行
        if type(rows) == type({}):
            f_csv.writerow(rows)
        else:
            f_csv.writerows(rows)
def main(page):
    # parse_one_page() #Tip: beautifulSoup test
    csv_filename = "bookInfo.csv"
    headers = ['bookName', 'price', 'iconLink']
    write_csv_headers(csv_filename,headers)
    dynamtic_claw_data(page, headers, csv_filename)

if __name__ == '__main__':
    main(20) #input page num to start

Python3 crawler crawls China Book Network (Amoy Book Group) records

Guess you like