net spider（python 网络爬虫） - 代码天地

net spider（python 网络爬虫）

其他 2018-06-18 22:04:42 阅读次数: 2

# -*- coding: utf-8 -*-
import  urllib2,cookielib
from   bs4 import  BeautifulSoup
url="http://www.baidu.com"


#第一种方法
response1=urllib2.urlopen(url)
print response1.getcode()
print len(response1.read())

#第二种方法
request=urllib2.Request(url)
request.add_header("user-agent","Mozilla/5.0")
response2=urllib2.urlopen(request)
print response2.getcode()
print len(response2.read())


#第三种方法
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
response3=urllib2.urlopen(url)
print response3.getcode()
print cj
print response3.read()


#BeautifulSoup实例
html_doc="""********************************************
**********************
******************
*************
*******
"""
soup=BeautifulSoup(html_doc,
                   'html.parser',
                   from_encoding='utf-8')
print "获取所有的链接"
links=soup.find_all("a")
for link in links:
    print  link.name,link['href'],link.get_text()
print '获取单个链接'
link_node=soup.find('a',href='http://example.com/lacie')
print  link_node.name, link_node['href'], link_node.get_text()


print "正则表达式"
link_node=soup.find('a',href=re.compile(r"ill"))
print link_node.name,link_node['href'],link_node.get_text()


print "获取p段落文字"
p_node=soup.find('p',class_="title")
print p_node.name,p_node.get_text()

猜你喜欢

转载自www.cnblogs.com/1314520xh/p/9196186.html

net spider（python 网络爬虫）

Java网络爬虫Spider

python 爬虫(三) spider类详解

学习爬虫1之python学习spider

SVG反爬虫绕过-Python Spider

SVG反爬虫绕过-Python Spider

python网络爬虫（web spider）系统化整理总结（一）：入门

[Python3网络爬虫开发实战] 13.4–Spider 的用法

Python 爬虫，scrapy，pipeline管道，open_spider(),close_spider()

python爬虫(十五)-------------------使用scrapy其他spider(默认为标准spider)

Spider 爬虫

转网络爬虫（Spider）Java实现原理

网络爬虫（Spider）Java实现原理（转载）

Python爬虫：scrapy框架Spider类参数设置

python爬虫(十三)-------------------HelloWorld级scrapy(scrapy spider组件)

Python笔记：爬虫框架Scrapy之Spider的原理

Python笔记：爬虫框架Scrapy之Spider Middleware的使用

Python爬虫——使用Spider实现数据的爬取（一）

【spider】爬虫学习路线-精通Scrapy网络爬虫

Python：Spider爬虫工程化入门到进阶（2）使用Spider Admin Pro管理scrapy爬虫项目

python网络爬虫--爬虫概述

推荐13个.Net开源的网络爬虫

python网络爬虫（web spider）系统化整理总结（二）：爬虫python代码示例(两种响应格式：json和html)

Python的网络爬虫框架-初识网络爬虫

小白学 Python 爬虫（37）：爬虫框架 Scrapy 入门基础（五） Spider Middleware

Python：Spider爬虫工程化入门到进阶（1）创建Scrapy爬虫项目

Screaming Frog SEO Spider for Mac(尖叫青蛙网络爬虫软件)

爬虫Spider 01 - 网络爬虫概述 | 爬虫请求模块 | URL地址编码模块 | 正则解析模块

python简单网络爬虫

python 网络爬虫（一）

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)