python-python网络爬虫实战更新ing

爬虫是什么?

自动抓取互联网信息的程序(自动访问互联网,并提取数据的程序)

爬虫的价值?

对互联网数据进行处理,为我所用

下载python(mac自带2.7版本)

1.官网下载
2.mac下带2.7版本,尽量不要修改环境变量,而是使用python3命令
python3 --version
python3
目前的Python3其实已经集成了pip,由于系统自身的Python版本并行存在，我们使用pip的时候也需要将pip命令换成用pip3这个命令。

pyChram(宝箱区⭐️)

亲测可用
教程链接
 获取code
软件常用设置链接

爬虫时序图

这里写图片描述

url管理器:防止重复抓取(已经抓取的url可以保存在1.内存python set2.mysql3.redis)
url下载器:(核心组件)将互联网上的网页下载到本地文件或字符串urllib2 和 requests第三方,更强大两种下载器的用法
解析器:提取有价值的数据,正则表达式/html.parse/BeautifulSoap/lxml

1.第一个爬虫

#hello 单行注释
import requests
newUrl = 'http://news.sina.com.cn/china'
res = requests.get(newUrl)
res.encoding = 'utf-8'
print(res.text)

运行命令python3 hello.py
####问题解决
没有requests包,执行安装命令pip3 install requests
显示安装pip3 list
###2.结构化数据(解析器)
BeautifulSoup4官网
使用BeautifulSoup剖析网页元素pip3 install BeautifulSoup4

# hello
import requests
from bs4 import BeautifulSoup
newUrl = 'http://news.sina.com.cn/china'
res = requests.get(newUrl)
res.encoding = 'utf-8'
# print(res.text)
# 解析DOM树
soup = BeautifulSoup(res.text,"html.parser")
print(soup.text)

soup = BeautifulSoup(html_doc,“html.parser”)
这一句中删除【from_encoding=“utf-8”】
原因：
python3 缺省的编码是unicode, 再在from_encoding设置为utf8, 会被忽视掉，去掉【from_encoding=“utf-8”】这一个好了

取出含有特定标签的元素(但是span也出来)取出为list,带[]

import requests
from bs4 import BeautifulSoup
newUrl = 'http://news.sina.com.cn/china'
res = requests.get(newUrl)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text,"html.parser")
# 取出所有a标签
alink = soup.select('a')
print(alink)

# 取出a标签
soup = BeautifulSoup(html_doc, "html.parser")
links = soup.find_all('a')
for link in links:
    print(link.name, link['href'], link.get_text())

print(alink[0])
print(alink[1])
print(alink[2])
print(alink[0].text)
# 输出
<a href="http://www.sina.com.cn/">新浪首页</a>
<a href="http://news.sina.com.cn/">新闻</a>
<a href="http://sports.sina.com.cn/">体育</a>
新浪首页

取出所有a标签内的文字|查找a标签的属性

import requests
from bs4 import BeautifulSoup
newUrl = 'http://news.sina.com.cn/china'
res = requests.get(newUrl)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text,"html.parser")
alink = soup.select('a')
for link in alink:
    print(link.text)
    # 取href属性
    print(link['href'])

按照标签id或class查找元素及内容

import requests
from bs4 import BeautifulSoup
newUrl = 'http://news.sina.com.cn/china'
res = requests.get(newUrl)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text,"html.parser")
# 取id为select的值
alink = soup.select('#select')
print(alink)
# 取class为link的元素
blink = soup.select('.link')
print(blink)

###下载器两种
A:urllib2 的使用2.7

import urllib2
# 直接请求
response = urllib2.urlopen("http://www.baidu.com")
# 获取状态码
print(response.getcode())
# 获取内容
content = response.read()

3.3之后

# 在python3.3里面，用urllib.request代替urllib2
import urllib.request
url = 'http://www.baidu.com'

print('第一种方法')
response1 = urllib.request.urlopen(url)
print(response1.getcode())
print(len(response1.read()))

这里写图片描述