Python simple crawler third egg!

Lecture three:

　　Today we introduce a new module, Beautiful Soup!

　　This module has many magical functions. First of all, we don't make it so magical. Only by installing it well can it reflect its magical functions.

　　Install as follows:

　　Run cmd (command prompt) under Win10 system

　　Type pip install beautifulsoup4

　　Installation is ready. Forget the specifics of installation here, let's focus on its amazing features!

1  ''' Add new module ''' 
2  
3  import requests
 4  
5  from bs4 import BeautifulSoup
 6  
7   
8  
9  def getHTML_Text(url):
 10  
11      try :
 12  
13          r = requests.get(url)
 14  
15          r.raise_for_status( ) #If the status is not 200, exception 
16  
17          r.encoding = r.apparent_encoding
 18  
19          return r.text
 20  
21      except :
22  
23          return  ' An exception occurred ' 
24  
25   
26  
27  if  __name__ == ' __main__ ' :
 28  
29      url = ' https://www.hao123.com/ ' 
30  
31      print (getHTML_Text(url)[:5000])

　　This is the general code of our last issue, after a simple modification, run:

　　Get a string of strings, as mentioned earlier, which contains the information we need (refer to Example 1 in the second lecture), and what we have to do today is to extract them. The magic of the BeautifulSoup module is self-evident.

　　Without further ado, let's first understand some basic applications of beautifulsoup.

1 from bs4 import BeautifulSoup
2 soup = BeautifulSoup(‘<p>data</p>’,’html.parser’)

　　<p>data</p> is a string that refers to…

　　That is, as we mentioned in the last two lectures, the text content of the web page crawled,

　　html.parser is a parser for parsing HTML Documents

　　Through the above code, a tag tree can be established, corresponding to the entire content of the HTML document, so that after establishment, it is more convenient for us to find the desired content.

　　Basic elements of the BeautifulSoup class

fundamental element	illustrate
Tag	Tags, the most basic information organization unit, <> and </> mark the beginning and end
Name	The name of the tag, in the format <tag>.name
Attributes	Attributes of tags, organized as a dictionary, format: <tag>.attrs
Navigable String	Non-attributed strings in tags, strings in <>…</>, format: <tag>.string
Comment	The comment part of the string inside the tag

　　Let's show them one by one:

1  ''' BeautifulSoup demo ''' 
2  
3  import requests
 4  
5  from bs4 import BeautifulSoup
 6  
7   
8  
9  def getHTML_Text(url):
 10  
11      try :
 12  
13          r = requests.get(url)
 14  
15          r.raise_for_status() #If the status is not 200, an exception will be generated 
16  
17          r.encoding = r.apparent_encoding
 18  
19          return r.text
 20  
21      except:
22 
23         return '产生异常'
24 
25  
26 
27 if __name__ == '__main__':
28 
29     url = 'https://www.hao123.com/'
30 
31     soup = BeautifulSoup(getHTML_Text(url),'html.parser')
32 
33     title = soup.title
34 
35     tag = soup.a
36 
37     print(title.string)
38 
39     print(str(tag.name))
40 
41     print(str(tag.attrs))
42 
43     print(str(tag.string))
44 
45     print(str(tag.comment))

　　operation result:

　　First print out the content of the title tag of the entire HTML

　　After obtaining the first a tag, print out the tag name, tag attributes, tag content, and tag comment parts respectively.

　　In general, these are the uses, and there are more interesting functions that you can discover by yourself, which are only briefly mentioned here.

　　A prettify() method is also introduced here to make our display more friendly.

　　E.g:

 1 '''BeautifulSoup演示'''
 2 
 3 import requests
 4 
 5 from bs4 import BeautifulSoup
 6 
 7  
 8 
 9 def getHTML_Text(url):
10 
11     try:
12 
13         r = requests.get(url)
14 
15         r.raise_for_status() #如果状态不是200，则产生异常
16 
17         r.encoding = r.apparent_encoding
18 
19         return r.text
20 
21     except:
22 
23         return '产生异常'
24 
25  
26 
27 if __name__ == '__main__':
28 
29     url = 'https://www.hao123.com/'
30 
31     soup = BeautifulSoup(getHTML_Text(url),'html.parser')
32 
33     print(soup.prettify()[:5000])

　　结果如上图，显示出比较友好的文本形式了，否则直接打印的话，密麻麻的字符串，看得确实让人头疼。

　　最后小小展示一些BeautifulSoup的一个有用的方法，具体的下期再详细讲，看代码！

 1 '''BeautifulSoup演示'''
 2 
 3 import requests
 4 
 5 from bs4 import BeautifulSoup
 6 
 7  
 8 
 9 def getHTML_Text(url):
10 
11     try:
12 
13         r = requests.get(url)
14 
15         r.raise_for_status() #如果状态不是200，则产生异常
16 
17         r.encoding = r.apparent_encoding
18 
19         return r.text
20 
21     except:
22 
23         return '产生异常'
24 
25  
26 
27 if __name__ == '__main__':
28 
29     url = 'https://www.hao123.com/'
30 
31     soup = BeautifulSoup(getHTML_Text(url),'html.parser')
32 
33     for link in soup.find_all('a'):
34 
35         print(link.get('href'))

　　已经获取到所有<a>标签的url连接了，是不是很强大？

　　你问这个有什么用？啊哈？有什么用呢，下期继续讲。

　　今天到此结束，谢谢。

Python simple crawler third egg!

Guess you like