《搜索引擎》Python网络爬虫入门

今天做了一个简单的python爬虫入门，初学者最烦人的就是版本兼容性问题，以及各个版本之间库的兼容性。BeautifulSoup在4.4.0以前的版本不支持Python3.5以上版本，浪费了很多时间。以后注意看插入或者库(BeautifulSoup)的文档的对python的兼容性。
BeautifulSoup学习网站https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id4

BeautifulSoup兼容性

转载:https://blog.csdn.net/u010358168/article/details/62040603

mac系统下，通过python setup.py install安装bs4后，发现在python2中 from bs4 import BeautifulSoup，可以正常运行，而在python3中，执行from bs4 import BeautifulSoup，发现报错，如下：ImportError: cannot import name ‘HTMLParseError’，网上搜寻半天，找到原因：BeautifulSoup在4.4.0以前的版本不支持Python3.5以上版本，so，就又了以下两种方案：

把BeautifulSoup升级到4.4.0版本以上
把python版本降为3.4或2.7

方案二呢，治标不治本，就不想用3.4 2.7呢，那么就按照方案一解决一下吧。

使用pip命令升级BeautifulSoup版本
sudo pip install –upgrade beautifulsoup4
执行完发现，还是不行，这是为什么呢，原来默认是使用的pip是指系统自带的python2中的升级命令，那么我们要升级python3的呢，很明显，使用pip3呀
sudo pip3 install –upgrade beautifulsoup4
try一下，perfect，搞定

urllib兼容性

在Python3.5中urllib已经被整合了
具体官方文档：
a new urllib package was created. It consists of code from
urllib, urllib2, urlparse, and robotparser. The old
modules have all been removed. The new package has five submodules:
urllib.parse, urllib.request, urllib.response,
urllib.error, and urllib.robotparser. The
urllib.request.urlopen() function uses the url opener from
urllib2. (Note that the unittests have not been renamed for the
beta, but they will be renamed in the future.)

urlopen()用法：
import urllib.request
url=”http://www.baidu.com”
get=urllib.request.urlopen(url).read()
print(get)

                    <link rel="stylesheet" href="https://csdnimg.cn/release/phoenix/template/css/markdown_views-ea0013b516.css">
                        </div>

爬虫代码

python2.7

# -*- coding: UTF-8 -*-
__author__ = 'Administrator'
from bs4 import BeautifulSoup
from urllib import urlopen
# from urllib.request import urlopen
from bs4 import BeautifulSoup
#获取html内容
html = urlopen("https://www.csdn.net/")
#用BeautifulSoup
soup = BeautifulSoup(html)
#打印全部内容
#print(soup.prettify())
#标题
print(soup.title.string)
print(soup.title.name)
for link in soup.find_all('a'):
    print(link.get('href'))

urlopen()用法：
import urllib.request
url=”http://www.baidu.com”
get=urllib.request.urlopen(url).read()
print(get)

                    <link rel="stylesheet" href="https://csdnimg.cn/release/phoenix/template/css/markdown_views-ea0013b516.css">
                        </div>