Python web crawler - crawling new book recommendation information from People's Posts and Telecommunications Publishing House

 

This code is a program that crawls new book recommendation information from People's Posts and Telecommunications Publishing House. It uses the requests library to send HTTP requests, get a list of new books and details about each book, and then saves the data to an Excel file. The specific steps are as follows:

  1. Import the required libraries: requests, json and openpyxl.
  2. A URL variable is defined for sending a GET request to obtain data for the new book recommendation list.
  3. Set request header information, including User-Agent and Cookie.
  4. Use the get() method of the requests library to send an HTTP request and parse the response content into JSON format.
  5. defines a functionsave_excel() for creating Excel files and saving data. In this function, first create a Workbook object and a Worksheet object.
  6. Create a title row in the Worksheet object, then iterate through each book in the new book list, call the json_detail() function to obtain the detailed information of each book and save it to Excel.
  7. Finally, use the save() method of the openpyxl library to save the workbook as an Excel file.
  8. Defines a functionjson_detail() to get detailed information about each book. In this function, use the POST method to send an HTTP request and specify the ID of the book to be obtained through the bookId parameter.
  9. Parse the response content into JSON format, obtain the author and discount price of the book, and return this information.
  10. Callsave_excel() function at the end to save the obtained data to an Excel file.

The code can be divided into the following parts for block analysis:

  1. Import library
import requests
import json
import openpyxl

In this part, the libraries that need to be used requests, json and openpyxl are imported.

  1. Define request URL and request header information
url = 'https://www.ptpress.com.cn/recommendBook/getRecommendBookListForPortal?bookTagId=d5cbb56d-09ef-41f5-9110-ced741048f5f'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.44',
    'Cookie': '...'
}

In this section, the requested URL and request header information are defined. URL is the interface address for obtaining new book recommendation information. The headers include User-Agent, Cookie and other information.

  1. Send a request to obtain JSON data and parse it
text_json = requests.get(url=url, headers=headers)
res = json.loads(text_json.content)

Obtain the JSON data of new book recommendations by sending a GET request. The JSON data is then parsed into a Python object and stored in the variable res .

  1. Define functions to save to Excel save_excel
def save_excel(res):
    wb1 = openpyxl.Workbook()
    sheet = wb1.active
    sheet.title = "人民邮电新书推荐"
    title = ['书名', '作者', '价格']
    sheet.append(title)

    for re in res['data']:
        author, discountPrice = json_detail(re['bookId'])
        sheet.append([re['bookName'], author, discountPrice])

    wb1.save('生活类新书基本信息.xlsx')

This function is used to save new book recommendation information to an Excel file. Inside the function, first create an Excel workbook, and then create a worksheet named "People's Posts and Telecommunications New Book Recommendations". Next, add header information to the worksheet. Then, iterate through the information of each book, get the detailed information of the book by calling the json_detail function, and add it to the worksheet. Finally, save the Excel file.

  1. Define a function to get book details json_detail
def json_detail(bookid):
    url = 'https://www.ptpress.com.cn/bookinfo/getBookDetailsById'
    params = {
        'bookId': bookid,
    }
    text_json = requests.post(url=url, headers=headers, params=params)
    res = json.loads(text_json.content)['data']
    author = res['author']
    discountPrice = res['discountPrice']
    print(res['bookName'], author, discountPrice)
    return author, discountPrice

This function is used to get the details of a book by its ID. Inside the function, first build the interface URL for obtaining book details and set the book ID parameter. Then, get the JSON data of the book details by sending a POST request. Next, the JSON data is parsed into a dictionary and the author and discount price of the book are extracted. Finally, the name, author, and discounted price of the book are output, and the author and discounted price are returned as return values.

  1. Call function to save to Excel
save_excel(res)

In this part of the code, the save_excel function is called to save the new book recommendation information into an Excel file.

Complete code:

import requests
import json
import openpyxl

url = 'https://www.ptpress.com.cn/recommendBook/getRecommendBookListForPortal?bookTagId=d5cbb56d-09ef-41f5-9110-ced741048f5f'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.44',
    'Cookie': 'gr_user_id=796019e3-dc58-40f5-a6df-892a38008bcd; '
              'acw_tc=2760822416373059896443147efcf3dd457a5539d63a07fdafd12f3041cd93; '
              'JSESSIONID=A0FD72E84771D06417CF145392DAA679; '
              'gr_session_id_9311c428042bb76e=1a1d8cc2-0de9-4409-adc4-07de4cdb503f;'
              ' gr_session_id_9311c428042bb76e_1a1d8cc2-0de9-4409-adc4-07de4cdb503f=true'
}
text_json = requests.get(url=url, headers=headers)
res = json.loads(text_json.content)


def save_execl(res):
    wb1 = openpyxl.Workbook()
    sheet = wb1.active
    sheet.title = "人民邮电新书推荐"
    title = ['书名', '作者', '价格']
    sheet.append(title)

    for re in res['data']:
        author, discountPrice = json_detail(re['bookId'])
        sheet.append([re['bookName'], author, discountPrice])

    wb1.save('生活类新书基本信息.xlsx')


def json_detail(bookid):
    url = 'https://www.ptpress.com.cn/bookinfo/getBookDetailsById'
    bookid = bookid
    params = {
        'bookId': bookid,
    }
    text_json = requests.post(url=url, headers=headers, params=params)
    res = json.loads(text_json.content)['data']
    author = res['author']
    discountPrice = res['discountPrice']
    print(res['bookName'], author, discountPrice)
    return author, discountPrice


save_execl(res)

Guess you like

Origin blog.csdn.net/weixin_66547608/article/details/134126857