How python + pandas realizes the continuous collection of breakpoints of web pages

Table of contents

1. Actual combat scenarios

2. Knowledge points

python basic syntax

Python file read and write

pandas data processing

web continuous capture

3. Rookie actual combat

The basic idea of ​​​​continuous collection of list page breakpoints

The basic idea

Realization of continuous collection of breakpoints on webpage list pages

Pandas save data csv file

Details page breakpoint collection ideas

The basic idea

Realization of code for continuous collection of breakpoints on the webpage details page

Pandas save data csv file

operation result

running result

Capture screenshots continuously


1. Actual combat scenarios

How python + pandas realizes the continuous collection of breakpoints of web pages

2. Knowledge points

python basic syntax

Python file read and write

pandas data processing

web continuous capture

3. Rookie actual combat

The basic idea of ​​​​continuous collection of list page breakpoints

The basic idea

When the list page is collected, the collected data is saved to a file, and each time it is collected, the page collected last time is confirmed by reading the file

Realization of continuous collection of breakpoints on webpage list pages

def __init__(self):
    # 初始化日志
    self.init_log()

    # 默认采集的上一页为第 1 页
    start_page = self.PAGE_START

    list_file_path = self.fileManger.get_data_file_path(self.list_data_file)
    if os.path.isfile(list_file_path):
        # 读取列表文件, 确定上一次采集的第几页, 以支持连续采集
        self.logger.info("数据文件存在")
        self.data_file_exist = True
        # 计算从第几页开始采集
        list_df = pd.read_csv(list_file_path, usecols=['第几页'], encoding=self.encoding)
        max_page = pd.DataFrame(list_df[2:]).max()
        start_page = int(max_page) + 1

    print("采集页面范围: 第[%s]页至第[%s]页" % (start_page, start_page + self.PAGE_STEP - 1))

    for page in range(start_page, start_page + self.PAGE_STEP):
        # 初始化采集链接
        url = self.target_url.replace("p1", "p" + str(page))
        # 构造采集对象
        url_item = UrlItem(url=url, page=page)
        self.url_items.append(url_item)

Pandas save data csv file

def save_to_file(self, data, cols):
    # 保存到文件
    file_path = self.fileManger.get_data_file_path(self.list_data_file)

    # 初始化数据
    frame = pd.DataFrame(data)
    if not self.data_file_exist:
        # 第一次写入带上列表头,原文件清空
        frame.columns = cols
        frame.to_csv(file_path, encoding=self.encoding, index=None)
        self.data_file_exist = True  # 写入后更新数据文件状态
    else:
        # 后续不写如列表头,追加写入
        frame.to_csv(file_path, mode="a", encoding=self.encoding, index=None, header=0)

    self.logger.debug("文件保存完成")

Details page breakpoint collection ideas

The basic idea

When the details page is collected, the collected data is saved to a file. In order to avoid repeated collection, each time you collect, confirm whether the collection link is in the collected data file. If it is, skip the collection, and if it is not, execute the collection

Realization of code for continuous collection of breakpoints on the webpage details page

def __init__(self):
    # 初始化日志
    self.init_log()

    # 从列表文件读取等待采集的链接
    list_file_path = self.fileManger.get_data_file_path(self.list_data_file)
    list_df = pd.read_csv(list_file_path, encoding=self.encoding)
    self.url_items = list_df.values  # 初始化待采集链接数组

    detail_file_path = self.fileManger.get_data_file_path(self.detail_data_file)
    if os.path.isfile(detail_file_path):
        # 从详情文件读取已采集的信息
        self.data_file_exist = True
        detail_df = pd.read_csv(detail_file_path, encoding=self.encoding)
        self.detail_df = detail_df

Pandas save data csv file

def save_to_detail_file(self, data, cols):
    # 保存到详情文件
    file_path = self.fileManger.get_data_file_path(self.detail_data_file)

    # 初始化数据
    frame = pd.DataFrame(data)

    if not self.data_file_exist:
        # 第一次写入带上列表头,原文件清空
        frame.columns = cols
        frame.to_csv(file_path, encoding=self.encoding, index=None)
        self.data_file_exist = True  # 写入后更新数据文件状态
    else:
        # 后续不写如列表头,追加写入
        frame.to_csv(file_path, mode="a", encoding=self.encoding, index=None, header=0)

    self.logger.debug("文件保存完成")

operation result

running result

Collection page range: page [16] to page [20]

100%|██████████| 5/5 [00:14<00:00, 2.91s/it]

python version 3.10.4

Capture screenshots continuously


resource link

Source code-python+pandas how to realize continuous collection of breakpoints on web pages-Python Documentation Resources-CSDN Download
 

Rookie actual combat, keep learning!

Guess you like

Origin blog.csdn.net/qq_39816613/article/details/128619530