Table of contents
The basic idea of continuous collection of list page breakpoints
Realization of continuous collection of breakpoints on webpage list pages
Details page breakpoint collection ideas
Realization of code for continuous collection of breakpoints on the webpage details page
Capture screenshots continuously
1. Actual combat scenarios
How python + pandas realizes the continuous collection of breakpoints of web pages
2. Knowledge points
python basic syntax
Python file read and write
pandas data processing
web continuous capture
3. Rookie actual combat
The basic idea of continuous collection of list page breakpoints
The basic idea
When the list page is collected, the collected data is saved to a file, and each time it is collected, the page collected last time is confirmed by reading the file
Realization of continuous collection of breakpoints on webpage list pages
def __init__(self):
# 初始化日志
self.init_log()
# 默认采集的上一页为第 1 页
start_page = self.PAGE_START
list_file_path = self.fileManger.get_data_file_path(self.list_data_file)
if os.path.isfile(list_file_path):
# 读取列表文件, 确定上一次采集的第几页, 以支持连续采集
self.logger.info("数据文件存在")
self.data_file_exist = True
# 计算从第几页开始采集
list_df = pd.read_csv(list_file_path, usecols=['第几页'], encoding=self.encoding)
max_page = pd.DataFrame(list_df[2:]).max()
start_page = int(max_page) + 1
print("采集页面范围: 第[%s]页至第[%s]页" % (start_page, start_page + self.PAGE_STEP - 1))
for page in range(start_page, start_page + self.PAGE_STEP):
# 初始化采集链接
url = self.target_url.replace("p1", "p" + str(page))
# 构造采集对象
url_item = UrlItem(url=url, page=page)
self.url_items.append(url_item)
Pandas save data csv file
def save_to_file(self, data, cols):
# 保存到文件
file_path = self.fileManger.get_data_file_path(self.list_data_file)
# 初始化数据
frame = pd.DataFrame(data)
if not self.data_file_exist:
# 第一次写入带上列表头,原文件清空
frame.columns = cols
frame.to_csv(file_path, encoding=self.encoding, index=None)
self.data_file_exist = True # 写入后更新数据文件状态
else:
# 后续不写如列表头,追加写入
frame.to_csv(file_path, mode="a", encoding=self.encoding, index=None, header=0)
self.logger.debug("文件保存完成")
Details page breakpoint collection ideas
The basic idea
When the details page is collected, the collected data is saved to a file. In order to avoid repeated collection, each time you collect, confirm whether the collection link is in the collected data file. If it is, skip the collection, and if it is not, execute the collection
Realization of code for continuous collection of breakpoints on the webpage details page
def __init__(self):
# 初始化日志
self.init_log()
# 从列表文件读取等待采集的链接
list_file_path = self.fileManger.get_data_file_path(self.list_data_file)
list_df = pd.read_csv(list_file_path, encoding=self.encoding)
self.url_items = list_df.values # 初始化待采集链接数组
detail_file_path = self.fileManger.get_data_file_path(self.detail_data_file)
if os.path.isfile(detail_file_path):
# 从详情文件读取已采集的信息
self.data_file_exist = True
detail_df = pd.read_csv(detail_file_path, encoding=self.encoding)
self.detail_df = detail_df
Pandas save data csv file
def save_to_detail_file(self, data, cols):
# 保存到详情文件
file_path = self.fileManger.get_data_file_path(self.detail_data_file)
# 初始化数据
frame = pd.DataFrame(data)
if not self.data_file_exist:
# 第一次写入带上列表头,原文件清空
frame.columns = cols
frame.to_csv(file_path, encoding=self.encoding, index=None)
self.data_file_exist = True # 写入后更新数据文件状态
else:
# 后续不写如列表头,追加写入
frame.to_csv(file_path, mode="a", encoding=self.encoding, index=None, header=0)
self.logger.debug("文件保存完成")
operation result
running result
Collection page range: page [16] to page [20]
100%|██████████| 5/5 [00:14<00:00, 2.91s/it]
python version 3.10.4
Capture screenshots continuously
resource link
Rookie actual combat, keep learning!