python3 web crawler (grab text information)

This article is the study notes linked below: Getting started with python3 web crawler in one hour
. The original notes were taken under winows, and this article was all operations under ubuntu.
The general idea of ​​the crawler is actually two points:

  • Get HTML information of a webpage
  • Parse HTML information and extract what we really need

I. Introduction

2 Introduction to web crawler

1. Review elements

chrome:F12

2. Simple example

The web crawler obtains the HTML information of the web page according to the provided URL information. Use request and urllib.request
in Python\3 to obtain the specific information of the web page.

(1) ubuntu installation request:

 sudo apt-get install python-requests

(2) Simple example

/*
    构造一个请求,支撑以下各方法的基础方法
*/
requests.request() 

/*获取HTML网页的主要方法,对应HTTP的GET*/
requests.get()

/*获取HTML网页头信息的方法,对应于HTTP的HEAD*/
requests.head()

/*向HTML网页提交POST请求的方法,对应于HTTP的POST*/
requests.post()

/*向HTML网页提交PUT请求的方法,对应于HTTP的PUT*/
requests.put()

/*向HTML提交局部修改请求,对应于HTTP的PATCH*/
requests.patch()

/*向HTML页面提交删除请求,对应于HTTP的DELETE*/
requests.delete()

The use tutorial of the requests library The
meaning of get request, as the name suggests, is to obtain data information from the server. Here is an example:

 #-*- coding:UTF-8 -*-
  2 import requests
  3 if __name__ == '__main__':
  4     target = 'http://gitbook.cn/'
  5     req = requests.get(url=target)  //req中保存了我们获取到信息
  6     print(req.text)                  

The following is the HTML information captured after executing the above program:The left is the information captured by the program, and the right is the web page information

Reptile combat

1. Novel download

(1) Actual background

Target website: http://www.biqukan.com/
This is a novel website. The goal this time is to crawl and save a novel called "Eternal Mind".

(2) Small test knife

Crawling the content of the first chapter of "One Thought Eternal"
Just modify the code written earlier and run it, as follows:

  # -*- coding:UTF-8 -*-
    import requests

    if __name__ == '__main__':
        target = 'http://www.biqukan.com/1_1094/5403177.html'
        req = requests.get(url=target)
        print(req.text)

Running the code, you will find that what you get is a bunch of novel content with various HTML tags. The next goal is to extract the content of the novel and filter out these useless HTML tags.

(3)Beautiful Soup

There are many ways to extract what we really need, such as regular expressions, Xpath, Beautiful Soup, etc. Beautiful Soup is used here.
Beautiful Soup is a third-party library, here is the installation method of the Chinese learning document
beautiful soup 4:

sudo apt-get install python-bs4

The method to check whether the beautiful soup is successful:

from bs4 import BeautifulSoup

As you can see, the div\ tag stores the body content of the novel, so the goal now is to extract the content in the div.
Here, the div sets two attributes class and id . The id is the unique identifier of the div, and the class specifies the element One or more class names of .
The code to extract the body of the novel is as follows:

  # -*- coding:utf-8 -*-
  import requests
  from bs4 import BeautifulSoup
  
  if __name__ == '__main__':
      target = 'http://www.biqukan.com/1_1094/5403177.html'
      req = requests.get(url=target)
      html = req.text
      bf = BeautifulSoup(html,'lxml')
       ##使用find_all方法,获取html信息中所有class属性为showtxt的div标签
       ##find_all的第一个参数是获取的标签名,第二个参数class_是标签属性
       ##class在Python中是关键字,所以用class_标识class属性,,避免冲突
      texts = bf.find_all('div',class_ = 'showtxt')
     ##decoude()是为了将texts转变成中文,如果不用这个方法,输出的内容就是一堆编码
    print(texts[0].decode())

Enter image description
As can be seen from the picture, there are some other HTML tags in the content at this time, such as <br>
The next step is to remove these unnecessary characters, and also delete some unnecessary spaces. The code is as follows:

  1 # -*- coding:utf-8 -*-
  2 import requests
  3 from bs4 import BeautifulSoup
  4 
  5 if __name__ == '__main__':
  6     target = 'http://www.biqukan.com/1_1094/5403177.html'
  7     req = requests.get(url=target)
  8     html = req.text
  9     bf = BeautifulSoup(html,'lxml')
 10      ##使用find_all方法,获取html信息中所有class属性为showtxt的div标签
 11      ##find_all的第一个参数是获取的标签名,第二个参数class_是标签属性
 12      ##class在Python中是关键字,所以用class_标识class属性,,避免冲突
 13     texts = bf.find_all('div',class_ = 'showtxt')
 14     ##decoude()是为了将texts转变成中文,如果不用这个方法,输出的内容就是一堆编码
 15     print(texts[0].text.replace('\xa0'*8,'\n\n'))

After running the code, the grabbing effect is as follows: Grab effect
use " " to represent spaces in HTML (remember to add a ;). The meaning of the last line of the above code is to
remove the 8 space symbols in the text and replace them with carriage return.
So far, we have been able to grab the content of one chapter of the novel and display it in segments. The next goal is to download the entire novel.
Through the inspection element, we can see that all the chapter titles of the target novel are All exist under the <div class="listmain"> tag.
The specific chapters also exist in the <dd><a></a></dd> tags in the <div> subtag. In html, the tag <a ></a> is used to store hyperlinks, and the link address exists in the attribute href.
Review elements
Next, first grab the directory list of the novel, the code is as follows:

  1 # -*- coding:utf-8 -*-
  2 import requests
  3 from bs4 import BeautifulSoup
  4 
  5 if __name__ == '__main__':
  6     target = 'http://www.biqukan.com/1_1094/'
  7     req = requests.get(url=target)
  8     html = req.text
  9     div_bf = BeautifulSoup(html)
 10     div = div_bf.find_all('div',class_="listmain")
 11     print(div[0])

The crawling results are as follows:
Grab chapter titles
Next, match each <a></a> tag captured, and extract the chapter name and chapter article. For example, take the first chapter, the content of the tag is as follows:

<a href="/1_1094/5403177.html">第一章 他叫白小纯</a>

For the matching result a returned by BeautifulSoup, use the a.get("href") method to get the attribute value of href, and use a.string to get the chapter name. The code is as follows:

  1  -*- coding:utf-8 -*-
  2 import requests
  3 from bs4 import BeautifulSoup
  4 
  5 if __name__ == '__main__':
  6     server = 'http://www.biqukan.com'
  7     target = 'http://www.biqukan.com/1_1094/'
  8     req = requests.get(url=target)
  9     html = req.text
 10     div_bf = BeautifulSoup(html)
 11     div = div_bf.find_all('div',class_="listmain")
 12     a_bf = BeautifulSoup(str(div[0]))
 13     a=a_bf.find_all('a')
 14     for each in a:
 15         print(each.string,server+each.get('href'))

Code execution result:
Extract each chapter link and title
Now the chapter name and chapter link of each chapter are available. The next step is to integrate the code and write the obtained content into a text file for storage. The code is as follows:

#-*-coding:utf-8-*-
  2 from bs4 import BeautifulSoup
  3 import requests,sys
  4 
  5 class downloader(object):
  6     def __init__(self):
  7         self.server = 'http://www.biqukan.com/'
  8         self.target = 'http://www.biqukan.com/1_1094/'
  9         self.names = [] #存放章节名
 10         self.urls = [] #存放章节链接
 11         self.nums = 0   #章节数
 12 
 13 #获取下载地址
 14     def get_download_url(self):
 15         req = requests.get(url = self.target)
 16         html = req.text
 17         div_bf = BeautifulSoup(html)
 18         div = div_bf.find_all('div',class_='listmain')
 19         a_bf = BeautifulSoup(str(div[0]))
 20         a = a_bf.find_all('a')
 21         self.nums = len(a[15:])
 22         for each in a[15:]:
 23             self.names.append(each.string)
 24             self.urls.append(self.server+each.get('href'))
 25 
 26 #获取章节内容
 27     def get_contents(self,target):
 28         req = requests.get(url =target)
 29         html = req.text
 30         bf = BeautifulSoup(html)
 31         texts = bf.find_all('div',class_='showtxt')
 32         texts = texts[0].text.replace('\xa0'*8,'\n\n')
 33         return texts
 34 
 35 #将抓取的文章内容写入文件
 36     def writer(self,name,path,text):
 37         write_flag = True
 38         with open(path,'a',encoding='utf-8') as f:
 39             f.write(name+'\n')
 40             f.writelines(text)
 41             f.write('\n\n')
 42
 43 #主函数
 44 if __name__ == "__main__":
 45     dl = downloader()
 46     dl.get_download_url()
 47     print('<一年永恒>开始下载:')
 48     for i in range(dl.nums):
 49         dl.writer(dl.names[i],'一念永恒.txt',dl.get_contents(dl.urls[i]))
 50         sys.stdout.write("  已下载:%.3f%%"% float(i/dl.nums)+'\r')
 51         sys.stdout.flush()
 52     print('<一念永恒>下载完成')
                                      

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325150791&siteId=291194637