beautifulsoup implement article interception and scripting attacks

Introduction: Now we have written on all tall, cool pages in the database is how to store it? In fact, the article on tall all stored in the database are stored html, then we usually see are due to write the editor, within the editor to do a conversion, so we can write text directly, rather than with html to write the text.

Beautiful Soup introduction of official documents

  Beautiful Soup provide some simple, Python type functions for handling navigation, search, modify functions parse tree. It is a toolkit to provide needed for the user to parse the document crawled data , because simple, so do not need much code to write a complete application.

Beautiful Soup automatically converted to Unicode encoding input document, the document is converted to an output utf-8 encoded. You do not need to consider encoding, unless the document does not specify a code, then, Beautiful Soup can not automatically identify the encoding. Then, you just need to explain the original coding on it.

Beautiful Soup has become and lxml, html6lib as good as the python interpreter, provide users with different analytical strategies or strong rate flexibility.

Download Beautiful Soup

  Download, then we recommend downloading the latest version, Beautiful Soup4, direct downloads like in python

Use Beautiful Soup

Import module

from bs4 import BeautifulSoup

Objects generated soup

= BeautifulSoup Soup (Content)    # which put the deal is that you want the text content

Get text after treatment

tags = soup.find_all()

And the difference between plain text

[<html><body>     </pre>]

More than two lines above, is all the text on a big list, and the text to be more than ordinary text, because he got all html files with js including direct write it?

Get all labels

 for tag in tags:
     print(tag.name)

div <class 'str'>
h1 <class 'str'>
p <class 'str'>
p <class 'str'>
h2 <class 'str'>
div <class 'str'>

Solve scripting attacks

        for Tag in Tags:
             IF tag.name == ' Script ' : 
                tag.decompose ()   # This is used to delete

Introduction The realization of interception

desc = soup.text[0:150]

Complete article html become plain text, and also addresses the scripting attacks, in fact, his function is far stronger than this, the follow-up even more!

 

Guess you like

Origin www.cnblogs.com/mcc61/p/11084810.html