A small reptiles, demonstrates climb a word, after deposited into txt file to

A small reptiles, demonstrates climb a word, after data cleaning, deposited into txt file to

Requests Import, Re 
from the BeautifulSoup BS4 Import

URL = "https://trade.500.com/sfc/"
URL2 = "https://so.gushiwen.org/shiwenv_4d3b4d132c82.aspx"

REQ = requests.get (URL2)
IF == 200 is req.status_code:
IF req.encoding == "GBK" or req.encoding == "the ISO-8859-1":
HTML req.content.decode = ( "GBK")
the else:
HTML = req.text

Soup the BeautifulSoup = (HTML, 'lxml')

# regular use to locate
# h1-h7 is a lookup tag all data
# soup.findAll Result = (the re.compile ( "H [1-7]"))
# is a lookup tag All data, plus a look for content that includes "500" character of all labels and content
# result2 = soup.findAll ( "a", text = re.compile (. "* (500) +. *"))
# Find information outside the chain, namely href = "http: // ..." or "https://..."
Soup.findAll result3 = # ( "A", attrs = { "href": re.compile ( "^ (HTTP \ :) | ^ (HTTPS \:.) *")})

# Use the navigation tree to find
# soup .body.children
# soup.body.descendants
. # soup.body.find ( "div") next_siblings
. # soup.body.find ( "div") parent

# get all the source code
# Print (Soup)
# Gets title:
title soup.findAll = ( "h1")
title = [x.text in for the X-title]
title = "" the Join (title).
Print (title)
# access to content:
# content = soup.body.findAll ( "div", = ID "contson4d3b4d132c82")
Content = soup.body.findAll ( "div", attrs = { "ID": "contson4d3b4d132c82"}) # effect a supra
content = [x.Content for X in text]

# content data cleaning:
Content = "" .join (Content) .strip () to spaces #
# Content = re.sub ( "original character", "replacement character" Content)
# content = re.sub ( "\ (. *? \)" "" Content) #. *? Lazy match, No need to? Invincible matching
Print (Content)

# last write txt file
with Open (F "} {title .txt", "W", encoding = "UTF-. 8") AS F:
f.write (title + "\ n-" Content +)




the else:
Print ( "connection is not successful, check the program and the network?")












Guess you like

Origin www.cnblogs.com/yiyea/p/11442405.html