一、环境安装
下载:Python 官网:https://www.python.org/
安装:配置环境变量
二、爬虫依赖库下载
本文实例设计BeautifulSoup依赖库
安装步骤:
1、cmd中执行命令:pip install beautifulsoup4
2、如果未识别命令pip,可执行步骤3
3、进入python安装目录scripts目录下,执行脚本,如下:
C:\Users\admin\AppData\Local\Programs\Python\Python36-32\Scripts>pip install beautifulsoup4
Collecting beautifulsoup4
Downloading https://files.pythonhosted.org/packages/3b/c8/a55eb6ea11cd7e5ac4bacdf92bac4693b90d3ba79268be16527555e186f0
/beautifulsoup4-4.8.1-py3-none-any.whl (101kB)
100% |████████████████████████████████| 102kB 11kB/s
Collecting soupsieve>=1.2 (from beautifulsoup4)
Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2
/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.1 soupsieve-1.9.5
C:\Users\admin\AppData\Local\Programs\Python\Python36-32\Scripts>
三、需要获取的网站信息(简单应用)
目标:获取官网:https://www.runoob.com/python/python-variable-types.html
所以目录的信息,保存到本地
四、新建工程
新建python文件
from bs4 import BeautifulSoup
import urllib.request as requests
import json
class Runoob(object):
def __init__(self):
# root
self.root = "https://www.runoob.com"
# 网站地址
self.url = 'https://www.runoob.com/python/python-variable-types.html'
# 各标签内容存储地址(a+ 追加模式)
self.file = open("D:/py/Runoob_Label.txt", mode="a+", encoding="utf-8")
def __del__(self):
self.file.close()
def get_labels(self, url):
# 定义返回字典
dictionary = {}
# 访问获取页面内容
response = requests.urlopen(url)
text = response.read()
# 构造BeautifulSoup对象
bs = BeautifulSoup(text, "html.parser")
for div in bs.find_all("div", class_="design"):
for a in div.find_all("a", target="_top"):
label = a.string.strip()
# 获取标签中超链接信息
href = self.root + a.attrs['href']
dictionary.setdefault(label, href)
return dictionary
@staticmethod
def process_label(label, url):
response = requests.urlopen(url)
# 将字节码转为utf8格式
text = response.read().decode('utf-8')
# 每个页面写一个文件(W+ 写模式,没有文件则创建)
file = open("D:/py/Runoob/" + label + ".html", mode="w+", encoding="utf-8")
file.write(text)
file.close()
def run(self):
labels = self.get_labels(self.url)
# 保存标签页信息
content = json.dumps(labels, ensure_ascii=False)
self.file.write(content)
# 抓取各标签页的内容
for label in labels:
url = labels.get(label)
try:
Runoob.process_label(label, url)
except Exception as e:
print(e)
if __name__ == '__main__':
Runoob().run()
运行main方法即可看到本地目录下生成的目录文件和各相关标签页数据:
Runoob_Label.txt
{
"Python 基础教程": "https://www.runoob.com/python/python-tutorial.html",
"Python 简介": "https://www.runoob.com/python/python-intro.html",
"Python 环境搭建": "https://www.runoob.com/python/python-install.html",
"Python 中文编码": "https://www.runoob.compython-chinese-encoding.html",
"Python 基础语法": "https://www.runoob.com/python/python-basic-syntax.html",
"Python 变量类型": "https://www.runoob.com/python/python-variable-types.html",
"Python 运算符": "https://www.runoob.com/python/python-operators.html",
"Python 条件语句": "https://www.runoob.com/python/python-if-statement.html",
"Python 循环语句": "https://www.runoob.com/python/python-loops.html",
"Python While 循环语句": "https://www.runoob.com/python/python-while-loop.html",
"Python for 循环语句": "https://www.runoob.com/python/python-for-loop.html",
"Python 循环嵌套": "https://www.runoob.com/python/python-nested-loops.html",
"Python break 语句": "https://www.runoob.com/python/python-break-statement.html",
"Python continue 语句": "https://www.runoob.com/python/python-continue-statement.html",
"Python pass 语句": "https://www.runoob.com/python/python-pass-statement.html",
"Python Number(数字)": "https://www.runoob.com/python/python-numbers.html",
"Python 字符串": "https://www.runoob.com/python/python-strings.html",
"Python 列表(List)": "https://www.runoob.com/python/python-lists.html",
"Python 元组": "https://www.runoob.com/python/python-tuples.html",
"Python 字典(Dictionary)": "https://www.runoob.com/python/python-dictionary.html",
"Python 日期和时间": "https://www.runoob.com/python/python-date-time.html",
"Python 函数": "https://www.runoob.com/python/python-functions.html",
"Python 模块": "https://www.runoob.com/python/python-modules.html",
"Python 文件I/O": "https://www.runoob.com/python/python-files-io.html",
"Python File 方法": "https://www.runoob.comfile-methods.html",
"Python 异常处理": "https://www.runoob.com/python/python-exceptions.html",
"Python OS 文件/目录方法": "https://www.runoob.comos-file-methods.html",
"Python 内置函数": "https://www.runoob.compython-built-in-functions.html",
"Python 面向对象": "https://www.runoob.com/python/python-object.html",
"Python 正则表达式": "https://www.runoob.com/python/python-reg-expressions.html",
"Python CGI 编程": "https://www.runoob.com/python/python-cgi.html",
"Python MySQL": "https://www.runoob.com/python/python-mysql.html",
"Python 网络编程": "https://www.runoob.compython-socket.html",
"Python SMTP": "https://www.runoob.com/python/python-email.html",
"Python 多线程": "https://www.runoob.com/python/python-multithreading.html",
"Python XML 解析": "https://www.runoob.com/python/python-xml.html",
"Python GUI 编程(Tkinter)": "https://www.runoob.com/python/python-gui-tkinter.html",
"Python2.x与3.x版本区别": "https://www.runoob.com/python/python-2x-3x.html",
"Python IDE": "https://www.runoob.com/python/python-ide.html",
"Python JSON": "https://www.runoob.com/python/python-json.html",
"Python 100例": "https://www.runoob.com/python/python-100-examples.html"
}
Runoob文件夹
五、过程中遇到的问题
1、控制台输出异常信息:[Errno 2] No such file or directory: ‘D:/py/Runoob/Python 文件I/O.html’
分析:是由于文件名含有‘/’,导致打开文件时,编译为文件路径,从而发现目录不存在,无法打开文件异常。
解决方案:可以在获取标签Label名时对名称特殊处理,如替换特殊字符等