Introduction: Now we have written on all tall, cool pages in the database is how to store it? In fact, the article on tall all stored in the database are stored html, then we usually see are due to write the editor, within the editor to do a conversion, so we can write text directly, rather than with html to write the text.
Beautiful Soup introduction of official documents
Beautiful Soup provide some simple, Python type functions for handling navigation, search, modify functions parse tree. It is a toolkit to provide needed for the user to parse the document crawled data , because simple, so do not need much code to write a complete application.
Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output utf-8 encoded. You do not need to consider encoding, unless the document does not specify a code, then, Beautiful Soup can not automatically identify the encoding. Then, you just need to explain the original coding on it.
Beautiful Soup has become and lxml, html6lib as good as the python interpreter, provide users with different analytical strategies or strong rate flexibility.
Download Beautiful Soup
Download, then we recommend downloading the latest version, Beautiful Soup4, direct downloads like in python
Use Beautiful Soup
Import module
from bs4 import BeautifulSoup
Objects generated soup
= BeautifulSoup Soup (Content) # which put the deal is that you want the text content
Get text after treatment
tags = soup.find_all()
And the difference between plain text
[<html><body> </pre>]
More than two lines above, is all the text on a big list, and the text to be more than ordinary text, because he got all html files with js including direct write it?
Get all labels
for tag in tags: print(tag.name)
div <class 'str'>
h1 <class 'str'>
p <class 'str'>
p <class 'str'>
h2 <class 'str'>
div <class 'str'>
Solve scripting attacks
for Tag in Tags: IF tag.name == ' Script ' : tag.decompose () # This is used to delete
Introduction The realization of interception
desc = soup.text[0:150]
Complete article html become plain text, and also addresses the scripting attacks, in fact, his function is far stronger than this, the follow-up even more!