Python crawling a website document data complete tutorial (with source code)

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

Basic development environment

Python 3.6

Pycharm

Use of related modules

import os
import requests
import time
import re
import json
from docx import Document
from docx.shared import Cm

 

Install Python and add it to the environment variables, pip installs the required related modules.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

 Python learning exchange group: 1039645993

Landing page analysis

 

The document content of the website is in the form of pictures. It has its own data interface

Interface link:

https://openapi.book118.com/getPreview.html?&project_id=1&aid=272112230&t=f2c66902d6b63726d8e08b557fef90fb&view_token=SqX7ktrZ_ZakjDI@vcohcCwbn_PLb3C1&page=1&callback=jQuery18304186406662159248_1614492889385&_=1614492889486

Interface request parameters

 

the whole idea

  • Request web page to return response data (string)
  • The middle data (list) index is 0 (string) extracted through the re module matching
  • Through the json module, the extracted data is converted into a json module
  • Get the URL address of each picture by traversing
  • Save the picture to a local folder
  • Save the picture to the word document
  • Crawler code implementation

Crawler code implementation

def download():
    content = 0
    for page in range(1, 96, 6):
        # 给定 2秒延时
        time.sleep(2)
        # 获取时间戳
        now_time = int(time.time() * 1000)
        url = 'https://openapi.book118.com/getPreview.html'
        # 请求参数
        params = {
            'project_id': '1',
            'aid': '272112230',
            't': 'f2c66902d6b63726d8e08b557fef90fb',
            'view_token': 'SqX7ktrZ_ZakjDI@vcohcCwbn_PLb3C1',
            'page': f'{page}',
            '_': now_time,
        }
        # 请求头
        headers = {
            'Host': 'openapi.book118.com',
            'Referer': 'https://max.book118.com/html/2020/0427/8026036013002110.shtm',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
        }
        response = requests.get(url=url, params=params, headers=headers)
        # 使用正则表达式提取内容
        result = re.findall('jsonpReturn\((.*?)\)', response.text)[0]
        # 字符串转json数据
        json_data = json.loads(result)['data']
        # 字典值的遍历
        for value in json_data.values():
            content += 1
            # 拼接图片url
            img_url = 'http:' + value
            print(img_url)
            headers_1 = {
                'Host': 'view-cache.book118.com',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
            }
            # 请求图片url地址 获取content二进制数据
            img_content = requests.get(url=img_url, headers=headers_1).content
            # 文件名
            img_name = str(content) + '.jpg'
            # 保存路径
            filename = 'img\\'
            # 以二进制方式保存 (图片、音频、视频等文件都是以二进制的方式保存)
            with open(filename + img_name, mode='wb') as f:
                f.write(img_content)

important point:

1. A delay must be given, otherwise the interface data will not be requested.

2. When requesting the picture url, the headers parameter needs to be written completely, otherwise the saved picture cannot be opened

3. The name should preferably be a given number, 1.jpg, 2.jpg so that it is convenient for subsequent saving to word

The code of the crawler part is relatively simple, there is no special difficulty.

To crawl these documents, you need to print or query, so you must save these individual pictures in the word document.

Write document

def save_picture():
    document = Document()
    path = './img/'
    lis = os.listdir(path)
    c = []
    for li in lis:
        index = li.replace('.jpg', '')
        c.append(index)
    c_1 = sorted(list(map(int, c)))
    print(c_1)
    new_files = [(str(i) + '.jpg') for i in c_1]
    for num in new_files:
        img_path = path + num
        document.add_picture(img_path, width=Cm(17), height=Cm(24))
        document.save('tu.doc')  # 保存文档
        os.remove(img_path)  # 删除保存在本地的图片

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/114398129