Python爬虫Selenium爬取matlab文档数据

在安装matlab的时候,需要勾选所需的产品组件,但是列表里有112个,而且这些产品我大都不认识,全选占用空间大,选认识的又怕漏了不认识的。

这里我把112个组件的名字存储在了components.txt里面。

爬虫的结果在这篇博客中Matlab安装勾选产品说明_Toblerone_Wind的博客-CSDN博客

MATLAB
Simulink
5G Toolbox
Aerospace Blockset
Aerospace Toolbox
Antenna Toolbox
Audio Toolbox
Automated Driving Toolbox
AUTOSAR Blockset
Bioinformatics Toolbox
Bluetooth Toolbox
C2000 Microcontroller Blockset
Communications Toolbox
Computer Vision Toolbox
Control System Toolbox
Curve Fitting Toolbox
Data Acquisition Toolbox
Database Toolbox
Datafeed Toolbox
DDS Blockset
Deep Learning HDL Toolbox
Deep Learning Toolbox
DSP HDL Toolbox
DSP System Toolbox
Econometrics Toolbox
Embedded Coder
Filter Design HDL Coder
Financial Instruments Toolbox
Financial Toolbox
Fixed-Point Designer
Fuzzy Logic Toolbox
Global Optimization Toolbox
GPU Coder
HDL Coder
HDL Verifier
Image Acquisition Toolbox
Image Processing Toolbox
Industrial Communication Toolbox
Instrument Control Toolbox
Lidar Toolbox
LTE Toolbox
Mapping Toolbox
MATLAB Coder
MATLAB Compiler
MATLAB Compiler SDK
MATLAB Report Generator
MATLAB Test
Medical Imaging Toolbox
Mixed-Signal Blockset
Model Predictive Control Toolbox
Model-Based Calibration Toolbox
Motor Control Blockset
Navigation Toolbox
Optimization Toolbox
Parallel Computing Toolbox
Partial Differential Equation Toolbox
Phased Array System Toolbox
Powertrain Blockset
Predictive Maintenance Toolbox
Radar Toolbox
Reinforcement Learning Toolbox
Requirements Toolbox
RF Blockset
RF PCB Toolbox
RF Toolbox
Risk Management Toolbox
Robotics System Toolbox
Robust Control Toolbox
ROS Toolbox
Satellite Communications Toolbox
Sensor Fusion and Tracking Toolbox
SerDes Toolbox
Signal integrity Toolbox
Signal Processing Toolbox
SimBiology
SimEvents
Simscape
Simscape Battery
Simscape Driveline
Simscape Electrical
Simscape Fluids
Simscape Multibody
Simulink 3D Animation
Simulink Check
Simulink Code Inspector
Simulink Coder
Simulink Compiler
Simulink Control Design
Simulink Coverage
Simulink Design Optimization
Simulink Design Verifier
Simulink Desktop Real-Time
Simulink PLC Coder
Simulink Real-Time
Simulink Report Generator
Simulink Test
SoC Blockset
Spreadsheet Link
Stateflow
Statistics and Machine Learning Toolbox
Symbolic Math Toolbox
System Composer
System Identification Toolbox
Text Analytics Toolbox
UAV Toolbox
Vehicle Dynamics Blockset
Vehicle Network Toolbox
Vision HDL Toolbox
Wavelet Toolbox
Wireless HDL Toolbox
Wireless Testbench
WLAN Toolbox

比如第4个产品Aerospace Blockset,想知道他到底是干什么的,只需要进入matlab的官方文档页面,在右上角搜索框内查询即可。

第一个搜索结果就是该组件的相关文档,可以看到他是由“产品名称 - 描述”组成的

此外我们还需要获取他的链接,这样就可以在看不懂描述的情况下,进入文档页面进一步查看。

简单来说,需要获取两样东西,产品描述和文档链接。

F12查看元素,点击选择按钮。

 之后鼠标移动到搜索结果上,点击之后,右边会显示相关联的元素。

 一个总的<div id="results_container">下面有若干搜索结果<div class="search_result_content">

<p class="search_result_title">里有我们需要的链接,但是需要处理一下,去除index.html?xxx

<span class="search_result_summary">里面就是我们需要的描述信息。

选中我们需要的元素,右键,复制XPath,用于定位元素获取信息。

那代码也就比较好写了

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from tqdm import tqdm

# input: 'https://ww2.mathworks.cn/help/component/index.html?xxx' 
# output: 'https://ww2.mathworks.cn/help/component'
def removeURL(url):
    index = url.find("?")
    length = len('index/html')
    return url[:index-length]

# input: 'component - summary'
# output: 'summary'
def removeSummary(summary:str):
    index = summary.find('-')
    return summary[index+1:].strip()

# refer to: https://blog.csdn.net/yuan2019035055/article/details/122294472
def is_contains_chinese(strs):
    for _char in strs:
        if '\u4e00' <= _char <= '\u9fa5':
            return True
    return False

# 读取产品组件名称
components = []
with open("components.txt","r") as f:
    for line in f:
        components.append(line.strip("\n"))

# print(components)
searchUrl = 'https://ww2.mathworks.cn/support/search.html?q='
driver = Chrome()

urls = []         #存储文档链接
en_summaries = [] # 存储英文描述
cn_summaries = [] # 存储中文描述

# 遍历所有的产品组件名称,采用tqdm进度条查看进度
for component in tqdm(components):

    driver.get(searchUrl+component)  # 访问页面

    # 设置元素位置
    url_locator = (By.XPATH, '//*[@id="results_container"]/div[1]/div[1]/p[1]/a')
    summary_locator = (By.XPATH, '//*[@id="results_container"]/div[1]/div[1]/p[1]')
    
    # 显示等待,直到指定元素出现
    WebDriverWait(driver,20).until(EC.presence_of_element_located(url_locator))
    
    # 通过元素位置获取相应的信息
    url = driver.find_element(*url_locator).get_attribute("href") # 获取链接
    summary = driver.find_element(*summary_locator).text          # 获取文本

    # 对链接和文本进行处理
    url = removeURL(url)
    summary = removeSummary(summary)
    
    # 将链接url添加至urls
    urls.append(url)

    # 判断是否包含中文
    if is_contains_chinese(summary): 
        cn_summaries.append(summary)
        en_summaries.append("")
    else:
        en_summaries.append(summary)
        cn_summaries.append("")

# 存储结果
import pandas as pd
df = pd.DataFrame()
df['component'] = components
df['url'] = urls
df['en_summary'] = en_summaries
df['cn_summary'] = cn_summaries
df.to_csv('a.csv', index=False)

因为国际版的matlab官方网站网速较慢,我选择爬取国内中文版的网站,所以部分文档的描述采用的是中文。这里引入了一个函数判断描述是否包含中文,如果包含就认为他是中文描述,不包含就认为它是英文描述。

注释我写的比较详细了

猜你喜欢

转载自blog.csdn.net/qq_42276781/article/details/129962258
今日推荐