1.4.2python网站地图爬虫（每天一更） - 代码天地

1.4.2python网站地图爬虫（每天一更）

其他 2019-05-07 22:11:25 阅读次数: 0

# -*- coding: utf-8 -*-
'''
Created on 2019年5月6日

@author: 薛卫卫
'''

import urllib.request
import re

def download(url, user_agent="wswp",num_retries=2):
    print("Downloading: " , url)
    headers = { 'User-agent': user_agent}
    request = urllib.request.Request(url, headers=headers)
    try:
        html = urllib.request.urlopen(request).read()
    except urllib.request.URLError as e:
        print('Download error:' , e.reason)
        html = None
        if num_retries > 0 :
            if hasattr(e, 'code') and 500 <= e.code < 600:
                return download(url, user_agent, num_retries-1)
    return html

def crawl_sitemap(url):
    # download the sitemap file
    sitemap = download(url)
    # 不修改正则表达式，修改输出的结果，将urlopen().read()返回的data进行解码
    sitemap = sitemap.decode('utf-8')
    # extract the sitemap links
    links = re.findall('<loc>(.*?)</loc>', sitemap)
    #download each link
    for link in links:
        html = download(link)
        # scrape html here
        # ...
        
crawl_sitemap("http://example.webscraping.com/sitemap.xml")

　　

猜你喜欢

转载自www.cnblogs.com/xww115/p/10828446.html

1.4.2python网站地图爬虫（每天一更）

1.4.4链接爬虫（每天一更）

每天一例python程序0614

每天一例python程序0608

python每天一个log文件 TimedRotatingFileHandler

每天一个python模块之Requests

Python每天一个小程序

每天一道python面试题

有哪些实用的网站值得大学生每天一看

比Python更狠毒的一种爬虫！

每天一道 python 面试题 - Python中的元类(metaclass)

每天一道 python 面试题 - Python中的元类(metaclass) 详细版本

LeetCode_每天一题_python_ 第一题两数之和

每天一个linux命令

每天一道编程题

每天一条Linux命令

每天一个命令：groupadd

每天一模式——单例模式

每天一句git

每天一道算法题

【每天一个Linux命令】

每天一道leetCode

每天一篇好论文

每天一个语句

每天一题：快速排序

每天一句英语

每天一点acpi

每天一刷0817

每天一个设计模式

每天一点点python语言入门----类

今日推荐

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

周排行

购置笔记本常识

从源码看Spring Security之采坑笔记（Spring Boot篇）

大数据学习——高可用配置案例

如何避免选择不专业的建站公司?

Euclid's Game HDU - 1525（博弈）

面试笔记（六）---Js实现eventHandler

Windows 实例搭建的 FTP 在外网无法连接和访问

设计模式 : 桥接模式

USB 设备驱动开发之几个重要结构体分析

14-p14_sqrt求平方根

每日归档

更多

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)