Python学习之路 (小记) 爬虫简单实现小结

一、环境安装
下载:Python 官网:https://www.python.org/
安装:配置环境变量

二、爬虫依赖库下载
本文实例设计BeautifulSoup依赖库
安装步骤:
1、cmd中执行命令:pip install beautifulsoup4
2、如果未识别命令pip,可执行步骤3
3、进入python安装目录scripts目录下,执行脚本,如下:

C:\Users\admin\AppData\Local\Programs\Python\Python36-32\Scripts>pip install beautifulsoup4
Collecting beautifulsoup4
  Downloading https://files.pythonhosted.org/packages/3b/c8/a55eb6ea11cd7e5ac4bacdf92bac4693b90d3ba79268be16527555e186f0
/beautifulsoup4-4.8.1-py3-none-any.whl (101kB)
    100% |████████████████████████████████| 102kB 11kB/s
Collecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa75796c507f61e688c2
/soupsieve-1.9.5-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.1 soupsieve-1.9.5

C:\Users\admin\AppData\Local\Programs\Python\Python36-32\Scripts>

三、需要获取的网站信息(简单应用)
目标:获取官网:https://www.runoob.com/python/python-variable-types.html
所以目录的信息,保存到本地
在这里插入图片描述
四、新建工程
新建python文件

from bs4 import BeautifulSoup
import urllib.request as requests
import json


class Runoob(object):

    def __init__(self):
        # root
        self.root = "https://www.runoob.com"
        # 网站地址
        self.url = 'https://www.runoob.com/python/python-variable-types.html'
        # 各标签内容存储地址(a+ 追加模式)
        self.file = open("D:/py/Runoob_Label.txt", mode="a+", encoding="utf-8")

    def __del__(self):
        self.file.close()

    def get_labels(self, url):
        # 定义返回字典
        dictionary = {}
        # 访问获取页面内容
        response = requests.urlopen(url)
        text = response.read()
        # 构造BeautifulSoup对象
        bs = BeautifulSoup(text, "html.parser")
        for div in bs.find_all("div", class_="design"):
            for a in div.find_all("a", target="_top"):
                label = a.string.strip()
                # 获取标签中超链接信息
                href = self.root + a.attrs['href']
                dictionary.setdefault(label, href)
        return dictionary

    @staticmethod
    def process_label(label, url):
        response = requests.urlopen(url)
        # 将字节码转为utf8格式
        text = response.read().decode('utf-8')
        # 每个页面写一个文件(W+ 写模式,没有文件则创建)
        file = open("D:/py/Runoob/" + label + ".html", mode="w+", encoding="utf-8")
        file.write(text)
        file.close()

    def run(self):
        labels = self.get_labels(self.url)
        # 保存标签页信息
        content = json.dumps(labels, ensure_ascii=False)
        self.file.write(content)
        # 抓取各标签页的内容
        for label in labels:
            url = labels.get(label)
            try:
                Runoob.process_label(label, url)
            except Exception as e:
                print(e)


if __name__ == '__main__':
    Runoob().run()

运行main方法即可看到本地目录下生成的目录文件和各相关标签页数据:
Runoob_Label.txt

{
 "Python 基础教程": "https://www.runoob.com/python/python-tutorial.html", 
 "Python 简介": "https://www.runoob.com/python/python-intro.html", 
 "Python 环境搭建": "https://www.runoob.com/python/python-install.html", 
 "Python 中文编码": "https://www.runoob.compython-chinese-encoding.html",
 "Python 基础语法": "https://www.runoob.com/python/python-basic-syntax.html",
 "Python 变量类型": "https://www.runoob.com/python/python-variable-types.html", 
 "Python 运算符": "https://www.runoob.com/python/python-operators.html", 
 "Python 条件语句": "https://www.runoob.com/python/python-if-statement.html", 
 "Python 循环语句": "https://www.runoob.com/python/python-loops.html", 
 "Python While 循环语句": "https://www.runoob.com/python/python-while-loop.html", 
 "Python for 循环语句": "https://www.runoob.com/python/python-for-loop.html", 
 "Python 循环嵌套": "https://www.runoob.com/python/python-nested-loops.html", 
 "Python break 语句": "https://www.runoob.com/python/python-break-statement.html", 
 "Python continue  语句": "https://www.runoob.com/python/python-continue-statement.html", 
 "Python pass 语句": "https://www.runoob.com/python/python-pass-statement.html", 
 "Python Number(数字)": "https://www.runoob.com/python/python-numbers.html", 
 "Python 字符串": "https://www.runoob.com/python/python-strings.html", 
 "Python 列表(List)": "https://www.runoob.com/python/python-lists.html", 
 "Python 元组": "https://www.runoob.com/python/python-tuples.html", 
 "Python 字典(Dictionary)": "https://www.runoob.com/python/python-dictionary.html", 
 "Python 日期和时间": "https://www.runoob.com/python/python-date-time.html", 
 "Python 函数": "https://www.runoob.com/python/python-functions.html", 
 "Python 模块": "https://www.runoob.com/python/python-modules.html", 
 "Python 文件I/O": "https://www.runoob.com/python/python-files-io.html", 
 "Python File 方法": "https://www.runoob.comfile-methods.html", 
 "Python 异常处理": "https://www.runoob.com/python/python-exceptions.html", 
 "Python OS 文件/目录方法": "https://www.runoob.comos-file-methods.html", 
 "Python 内置函数": "https://www.runoob.compython-built-in-functions.html", 
 "Python 面向对象": "https://www.runoob.com/python/python-object.html", 
 "Python 正则表达式": "https://www.runoob.com/python/python-reg-expressions.html", 
 "Python CGI 编程": "https://www.runoob.com/python/python-cgi.html", 
 "Python MySQL": "https://www.runoob.com/python/python-mysql.html", 
 "Python 网络编程": "https://www.runoob.compython-socket.html", 
 "Python SMTP": "https://www.runoob.com/python/python-email.html", 
 "Python 多线程": "https://www.runoob.com/python/python-multithreading.html", 
 "Python XML 解析": "https://www.runoob.com/python/python-xml.html", 
 "Python GUI 编程(Tkinter)": "https://www.runoob.com/python/python-gui-tkinter.html", 
 "Python2.x与3​​.x版本区别": "https://www.runoob.com/python/python-2x-3x.html", 
 "Python IDE": "https://www.runoob.com/python/python-ide.html", 
 "Python JSON": "https://www.runoob.com/python/python-json.html", 
 "Python 100例": "https://www.runoob.com/python/python-100-examples.html"
}

Runoob文件夹
在这里插入图片描述
五、过程中遇到的问题
1、控制台输出异常信息:[Errno 2] No such file or directory: ‘D:/py/Runoob/Python 文件I/O.html’
分析:是由于文件名含有‘/’,导致打开文件时,编译为文件路径,从而发现目录不存在,无法打开文件异常。
解决方案:可以在获取标签Label名时对名称特殊处理,如替换特殊字符等

发布了40 篇原创文章 · 获赞 31 · 访问量 62万+

猜你喜欢

转载自blog.csdn.net/weixin_38422258/article/details/103688643