计算生态

近20年的开源运动产生了深植于各个信息技术领域的大量可重用资源——形成了"计算生态"

Python官方的第三方库索引 ：http://pypi.python.org/pypi
“胶水语言”：python可调用许多采用C、C++等语言确定的专业库
python标准库：随安装包一起发布，随时可以使用
python解释器提供了68个内置函数

Web页面信息提取

Web页面一般都是HTML页面
HTML：超文本标记语言，严格来说，HTML不是一个编程语言而是一个对信息的标记语言，对web内容进行描述

步骤

读取本地html文件
提取图片链接
输出结果到屏幕
保存

主程序，顶层设计

# -------主函数设计，让代码更清晰------------------------------
def main():
    inputfile = 'D:\python3.7\Web获取实验\csdn.html' # 获取路径
    outputfile = 'D:\python3.7\Web获取实验\csdn-urls.txt' # 输出保存路径
    
    # 读取HTML文件
    htmlLines = getHTMLlines(inputfile)
    
    # 解析提取出图片连接
    imageUrls = extractImageUrls(htmlLines)
    
    # 输出提取结果到屏幕
    showResults(imageUrls)
    
    # 保存结果
    saveResults(outputfile, imageUrls)

main()

读取

"""getHTMLlines()函数读取HTML文件内容，并系那个结果转换成一个分行列表,都是html标签"""
def getHTMLlines(htmlpath):
    f = open(htmlpath, 'r', encoding='utf-8')
    alist = f.readlines()
    f.close()
    return alist

解析提取图片链接

"""extracImageUrls（）程序核心，解析提取图像的url,图像都采用<img>标签
<img title=
"photo story"
src=
"http://image.nationalgeographic.com.cn/2018/0122/20
180122042251164.jpg"
width=
"968px" />

src=引导的URL是这个图像的真实位置
每个URL都以http开头"""
def extractImageUrls(htmllist):
    urls = []
    for line in htmllist:
        if 'img' in line:
            # 输出 src= 后面的字符串和 截止到 "
            url = line.split('src=')[-1].split('"')[1] 
            if 'http' in url:
                urls.append(url) # 提取出图像链接
    return urls

输出到屏幕

"""showResult（）将获取的链接输出到屏幕上"""
def showResults(urls):
    count = 0
    for url in urls:
        print('第{:2}个URL：{}'.format(count, url))
        count += 1

保存

"""保存"""
def saveResults(filepath, urls):
    f = open(filepath, 'w')
    for url in urls:
        f.write(url + '\n')
    f.close()

第 0个URL：https://csdnimg.cn/public/favicon.ico
第 1个URL：https://t.csdnimg.cn/c8Q4
第 2个URL：https://t.csdnimg.cn/c8Q4
第 3个URL：https://t.csdnimg.cn/c8Q4
第 4个URL：https://blog.csdn.net/m0_37907797
第 5个URL：https://blog.csdn.net/qq_36903042
第 6个URL：https://blog.csdn.net/qing_gee
第 7个URL：https://blog.csdn.net/Eastmount
第 8个URL：https://blog.csdn.net/siyuanwai
第 9个URL：https://blog.csdn.net/weixin_43570367
第10个URL：https://kunyu.csdn.net/1.png?p=436&a=1932&c=1087&k=&d=1&t=3&u=91ec659eb8d84b5bb9542a26efcc2bef
第11个URL：https://blog.csdn.net/kexuanxiu1163
第12个URL：https://blog.csdn.net/u014044812
第13个URL：https://blog.csdn.net/dam454450872
第14个URL：https://blog.csdn.net/qing_gee
第15个URL：https://blog.csdn.net/JiuZhang_ninechapter
第16个URL：https://blog.csdn.net/qq_42322103
第17个URL：https://blog.csdn.net/weiwenhou
第18个URL：https://blog.csdn.net/kexuanxiu1163
第19个URL：https://blog.csdn.net/Design407
第20个URL：https://blog.csdn.net/caoz
第21个URL：https://blog.csdn.net/weixin_37649168
第22个URL：https://blog.csdn.net/TeFuirnever
第23个URL：https://blog.csdn.net/m0_38106923
第24个URL：https://blog.csdn.net/zzti_erlie
第25个URL：https://blog.csdn.net/qq_35190492
第26个URL：https://blog.csdn.net/qq_16855077
第27个URL：https://blog.csdn.net/qing_gee
第28个URL：https://blog.csdn.net/qq_36903042
第29个URL：https://blog.csdn.net/qq_36894974
第30个URL：https://blog.csdn.net/qq_35190492
第31个URL：https://blog.csdn.net/Eastmount
第32个URL：https://blog.csdn.net/harvic880925
第33个URL：https://blog.csdn.net/hebtu666
第34个URL：https://blog.csdn.net/u013486414
第35个URL：https://blog.csdn.net/HarderXin
第36个URL：https://blog.csdn.net/qq_42322103
第37个URL：https://blog.csdn.net/BEYONDMA
第38个URL：https://blog.csdn.net/coderising
第39个URL：https://blog.csdn.net/yuanziok
第40个URL：{"mod":"popu_474","dest":"https://t.csdnimg.cn/jfJx","strategy":"","index":"1"}
第41个URL：https://kunyu.csdn.net/1.png?d=2&k=&m=JcfvbHEJitLDtEiDAytibAHAicAnpAfJJbpQLbnbfHtfUJinfHbmJnfXiHLpiiEAJQbJibSEWSQpSScyXAvoUciHJEpcAQtpnLnQ
第42个URL：https://kunyu.csdn.net/1.png?d=2&k=&m=cLDtniUHEnEcUbmLSAEHcvUmLHfSfDLtEJAnHicDpJJvQtpnnHSiAtntbtLnpQSnEptLWiSSQcvAUfpfmyHncniAnbAbLtfininQ

我是小杨我就这样

发布了198 篇原创文章 · 获赞 48 · 访问量 1万+

私信关注

二级python——计算生态，Web信息页面提取

文章目录

计算生态

Web页面信息提取

步骤

主程序，顶层设计

读取

解析提取图片链接

输出到屏幕

保存

猜你喜欢