python-爬虫

简单实例

网络爬虫的第一步就是根据URL，获取网页的HTML信息。在Python3中，可以使用urllib.request和requests进行网页爬取。

urllib库是python内置的，无需我们额外安装，只要安装了Python就可以使用这个库。
requests库是第三方库，需要我们自己安装。（requests库强大好用，所以本文使用requests库获取网页的HTML信息。requests库的github地址：https://github.com/requests/requests）
(1) requests安装

在cmd中，使用如下指令安装requests：
```
pip install requests
```
- 1
或者：
```
easy_install requests
```
- 1
- （如果指令无效，则可能是环境变量有问题。具体的解决方法，请查看：https://jingyan.baidu.com/article/86f4a73ea7766e37d7526979.html）
- （2）Beautiful Soup
- 爬虫的第一步，获取整个网页的HTML信息，我们已经完成。接下来就是爬虫的第二步，解析HTML信息，提取我们感兴趣的内容。对于本小节的实战，我们感兴趣的内容就是文章的正文。提取的方法有很多，例如使用正则表达式、Xpath、Beautiful Soup等。对于初学者而言，最容易理解，并且使用简单的方法就是使用Beautiful Soup提取感兴趣内容。
  
  Beautiful Soup的安装方法和requests一样，使用如下指令安装(也是二选一)：
  - pip install beautifulsoup4
  - easy_install beautifulsoup4
  一个强大的第三方库，都会有一个详细的官方文档。我们很幸运，Beautiful Soup也是有中文的官方文档：http://beautifulsoup.readthedocs.io/zh_CN/latest/
- 具体代码如下（python版本3）
- import requests
  from bs4 import BeautifulSoup
  
  class downloader(object):
  def __init__(self):
  self.server="http://www.biqukan.com/"
  self.target="http://www.biqukan.com/1_1094/"
  self.names=[]
  self.urls=[]
  self.nums=0
  
  #Parameters:
  #Returns:
  
  #Modify:
  # 2018-06-05
  
  def get_download_url(self):
  req=requests.get(url=self.target)
  html=req.text
  div_bf=BeautifulSoup(html)
  div=div_bf.find_all("div",class_="listmain")
  a_bf=BeautifulSoup(str(div[0]))
  a=a_bf.find_all("a")
  self.nums=len(a[15:])#删除更新的章节
  
  for each in a[15:]:
  self.names.append(each.string)
  self.urls.append(self.server+each.get("href"))
  
  #Parameters:
  #target - 下载链接
  #Returns:
  #texts-章节内容
  #Mod#ify:
  #2018-06-05
  def get_contents(self,target):
  req=requests.get(url=target)
  html=req.text
  bf=BeautifulSoup(html)
  texts=bf.find_all("div",class_="showtxt")
  texts=texts[0].text.replace("\xa0"*8,"\n\n")
  return texts
  
  def writer(self,name,path,text):
  write_flag=True
  with open(path,'a',encoding="utf-8") as f:
  f.write(name+"\n")
  f.writelines(text)
  f.write("\n\n")
  
  if __name__=="__main__":
  dl=downloader()
  dl.get_download_url()
  print("《一年永恒》开始下载：")
  for i in range(dl.nums):
  dl.writer(dl.names[i],"一念永恒.txt",dl.get_contents(dl.urls[i]))
  print(dl.nums)
  print("已下载:%.3f%%"%float(i/dl.nums)+"\r")
  #sys.stdout.flush()
  print("《一念永恒》下载完成")

(1) requests安装

猜你喜欢