Lecture three:
Today we introduce a new module, Beautiful Soup!
This module has many magical functions. First of all, we don't make it so magical. Only by installing it well can it reflect its magical functions.
Install as follows:
Run cmd (command prompt) under Win10 system
Type pip install beautifulsoup4
Installation is ready. Forget the specifics of installation here, let's focus on its amazing features!
1 ''' Add new module ''' 2 3 import requests 4 5 from bs4 import BeautifulSoup 6 7 8 9 def getHTML_Text(url): 10 11 try : 12 13 r = requests.get(url) 14 15 r.raise_for_status( ) #If the status is not 200, exception 16 17 r.encoding = r.apparent_encoding 18 19 return r.text 20 21 except : 22 23 return ' An exception occurred ' 24 25 26 27 if __name__ == ' __main__ ' : 28 29 url = ' https://www.hao123.com/ ' 30 31 print (getHTML_Text(url)[:5000])
This is the general code of our last issue, after a simple modification, run:
Get a string of strings, as mentioned earlier, which contains the information we need (refer to Example 1 in the second lecture), and what we have to do today is to extract them. The magic of the BeautifulSoup module is self-evident.
Without further ado, let's first understand some basic applications of beautifulsoup.
1 from bs4 import BeautifulSoup 2 soup = BeautifulSoup(‘<p>data</p>’,’html.parser’)
<p>data</p> is a string that refers to…
That is, as we mentioned in the last two lectures, the text content of the web page crawled,
html.parser is a parser for parsing HTML Documents
Through the above code, a tag tree can be established, corresponding to the entire content of the HTML document, so that after establishment, it is more convenient for us to find the desired content.
Basic elements of the BeautifulSoup class
fundamental element |
illustrate |
Tag |
Tags, the most basic information organization unit, <> and </> mark the beginning and end |
Name |
The name of the tag, in the format <tag>.name |
Attributes |
Attributes of tags, organized as a dictionary, format: <tag>.attrs |
Navigable String |
Non-attributed strings in tags, strings in <>…</>, format: <tag>.string |
Comment |
The comment part of the string inside the tag |
Let's show them one by one:
1 ''' BeautifulSoup demo ''' 2 3 import requests 4 5 from bs4 import BeautifulSoup 6 7 8 9 def getHTML_Text(url): 10 11 try : 12 13 r = requests.get(url) 14 15 r.raise_for_status() #If the status is not 200, an exception will be generated 16 17 r.encoding = r.apparent_encoding 18 19 return r.text 20 21 except: 22 23 return '产生异常' 24 25 26 27 if __name__ == '__main__': 28 29 url = 'https://www.hao123.com/' 30 31 soup = BeautifulSoup(getHTML_Text(url),'html.parser') 32 33 title = soup.title 34 35 tag = soup.a 36 37 print(title.string) 38 39 print(str(tag.name)) 40 41 print(str(tag.attrs)) 42 43 print(str(tag.string)) 44 45 print(str(tag.comment))
operation result:
First print out the content of the title tag of the entire HTML
After obtaining the first a tag, print out the tag name, tag attributes, tag content, and tag comment parts respectively.
In general, these are the uses, and there are more interesting functions that you can discover by yourself, which are only briefly mentioned here.
A prettify() method is also introduced here to make our display more friendly.
E.g:
1 '''BeautifulSoup演示''' 2 3 import requests 4 5 from bs4 import BeautifulSoup 6 7 8 9 def getHTML_Text(url): 10 11 try: 12 13 r = requests.get(url) 14 15 r.raise_for_status() #如果状态不是200,则产生异常 16 17 r.encoding = r.apparent_encoding 18 19 return r.text 20 21 except: 22 23 return '产生异常' 24 25 26 27 if __name__ == '__main__': 28 29 url = 'https://www.hao123.com/' 30 31 soup = BeautifulSoup(getHTML_Text(url),'html.parser') 32 33 print(soup.prettify()[:5000])
结果如上图,显示出比较友好的文本形式了,否则直接打印的话,密麻麻的字符串,看得确实让人头疼。
最后小小展示一些BeautifulSoup的一个有用的方法,具体的下期再详细讲,看代码!
1 '''BeautifulSoup演示''' 2 3 import requests 4 5 from bs4 import BeautifulSoup 6 7 8 9 def getHTML_Text(url): 10 11 try: 12 13 r = requests.get(url) 14 15 r.raise_for_status() #如果状态不是200,则产生异常 16 17 r.encoding = r.apparent_encoding 18 19 return r.text 20 21 except: 22 23 return '产生异常' 24 25 26 27 if __name__ == '__main__': 28 29 url = 'https://www.hao123.com/' 30 31 soup = BeautifulSoup(getHTML_Text(url),'html.parser') 32 33 for link in soup.find_all('a'): 34 35 print(link.get('href'))
已经获取到所有<a>标签的url连接了,是不是很强大?
你问这个有什么用?啊哈?有什么用呢,下期继续讲。
今天到此结束,谢谢。