python简单爬取一个blogs内容 - 代码天地

python简单爬取一个blogs内容

编程语言 2018-05-09 20:26:22 阅读次数: 2

# -*- coding: utf-8 -*-

from urllib2 import urlopen,Request

import urllib

from lxml import *

import lxml.html as HTML

import time

def error(txt):

    with open("../it/error.txt","a") as f:

        f.write(txt + '\n')

def con(url,count=4):

    try:

        req = Request(url)

        req.add_header('Referer','http://www.baidu.com')

        req.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')

        res = urlopen(req,timeout = 20)

        page = res.read()

        res.close()

        #dom = HTML.document_fromstring(page)

        return page

    except Exception,e:

        if count >= 10:

            print e

            error(url)

        else:

            count += 1

            time.sleep(1)

            return con(url,count)

def menu(url):

    page = con(url)

    dom = HTML.document_fromstring(page)

    path = "//h5/a"

    node = dom.xpath(path)

    for n in node:

        dic = {}

        dic['title'] = n.text_content()

        dic['url'] = "http:" + n.get("href")

        if dic['title'] and dic['url']:

            yield dic

def save(title,content):

    with open('../it/'+unicode(title)+'.html','w') as f:

        f.write(content)

def blog():

    prev = menu("http://www.schooltop.net")

    for dic in prev:

        title = dic.get("title",'')

        url = dic.get("url",'')

        page = con(url)

        save(title,page)

        print "saved      ",unicode(title)

 

if __name__ == "__main__":

##    try:

        blog()

##    except Exception,e:

##        print e

方法二：

import urllib2
import re  
arr = ['289','300']
for i in arr:
  content = urllib2.urlopen('http://www.schooltop.net/blogs/'+i).read()
  pattern = re.compile('<div class="article">(.*?)<div class="row t_margin_20">', re.S)
  match = re.search(pattern, content)
  if match:
    print match.group(1)
  else: 
    print 111

猜你喜欢

转载自schooltop.iteye.com/blog/2399769

python简单爬取一个blogs内容

python 一个简单的爬取程序

python之简单爬取一个网站信息

使用python爬取一个网页里表格的内容

一个简单的爬取小说的python程序彻底搞懂Python的字符编码

python成功爬取拉勾网——初识反爬（一个小白真实的爬取路程，内容有点小多）

Python - Python 简单爬取网页内容

python3.6+scrapy 1.5爬取网站一个简单实例

一个简单Python爬虫实例（爬取的是前程无忧网的部分招聘信息）

用Python构建一个简单的爬虫系统：爬取妹纸图片，建议收藏

一个非常简单的爬取网站图片的Python爬虫实例

一个简单的实例操作入门python爬虫--爬取漂亮小姐姐的图片

一个简单的恋家的信息爬取

一个简单的蟒蛇爬取知乎

爬虫：一个简单的数据爬取统计实例

Python之简单爬取网页内容

python爬取一个网站（一)--------下载html

使用python爬取一个省市城市列表

Python使用requests爬取一个网页并保存

写一个简单的python爬虫程序，爬取一下百度图片

一个简单的爬取一个电影网的磁力链接

第一个blogs

用一个小小小爬虫，爬取淘宝宝贝评价内容

很简单的一个爬取豆瓣音乐前250的一些信息。

百度贴吧的内容的爬取，以一个NBA吧的实例，在python3.6上，IDE是pycharm，最新的正则。

一个简单的爬虫代码爬取糗事百科段子（selenium+ChromeDriver）

php-Curl扩展一个简单示例-爬取新闻网站数据

【转】写一个简单的爬虫来批量爬取新浪网的新闻

这是一个简单的爬虫代码，却能爬取英雄联盟全皮肤！

一个简单的使用scrapy爬取小说的例

今日推荐

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

周排行

购置笔记本常识

从源码看Spring Security之采坑笔记（Spring Boot篇）

大数据学习——高可用配置案例

如何避免选择不专业的建站公司?

Euclid's Game HDU - 1525（博弈）

面试笔记（六）---Js实现eventHandler

Windows 实例搭建的 FTP 在外网无法连接和访问

设计模式 : 桥接模式

USB 设备驱动开发之几个重要结构体分析

14-p14_sqrt求平方根

每日归档

更多

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)