Beautiful Soup is a Python library can extract data from HTML or XML file. BeautifulSoup bs4 module with requests library can write simple reptiles.
installation
- Command : PIP install beautifulsoup4
Parser
- The main parser, as well as their advantages and disadvantages as follows:
Installation command:
- pip install lxml
- pip install html5lib
requests
- Requests of the underlying implementation is urllib, requests can automatically help us unpack (gzip compression, etc.) web content
- Installation command: PIP install Requests
- Recommended response.content.deocde () way to get the html page response
pandas
- Installation command: PIP install PANDAS
- NumPy is based on a tool, the tool to solve data analysis tasks created.
data structure:
- Series: one-dimensional array, with one-dimensional array Numpy are similar. Both basic and Python List data structure is also very similar, the difference is: List of elements may be different data types, and only allow the Series Array, and store the same data type, so that more efficient use of memory, improve operational efficiency.
- Time- Series: Time-indexed Series.
- DataFrame: two-dimensional tabular data structure. The many features and R data.frame similar. Series DataFrame may be understood as a container. The following content mainly DataFrame based.
- Panel: a three-dimensional array, as will be appreciated DataFrame container.
use
Beautiful Soup complex HTML documents converted into a complex tree structure, each node is Python objects, all objects can be grouped into four kinds:
- Tag
- NavigableString
- BeautifulSoup
- Comment
Tag: Tag objects and native XML or HTML document in the same tag, tag the most important attributes: name and attributes
Gets the specified tag from a Web page, the property value, in the way:
- Acquired by the tag name: tag.name tag corresponding type is <class 'bs4.element.Tag'>
- By acquiring property: tag.attrs
- Get tag attributes: tag.get ( 'attribute name') or Tag [ 'attribute name']
Features tab
- stripped_strings: output string may contain a number of blanks or blank lines, can be used to remove excess white space .stripped_strings
- Standard output page: soup.prettify ()
Find elements:
- find_all (class _ = "class" ) returns a plurality of labels
- find (class _ = "class") returns a label
- select_one () Returns a label
- select () returns the plurality of labels
- soup = BeautifulSoup (backdata, 'html.parser') # is converted to the form property BeautifulSoup
- soup.find_all ( 'tag name', attrs { 'attribute name': 'attribute'}) # returns a list of
- limitk control the number of returned find_all
- recursive = Flase returned direct child elements of tag
demo
import sys
import io
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np
from py_teldcore import sqlserver_db as db
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')
url = "http://www.tianqihoubao.com/lishi/hefei/month/201812.html"
def get_soap():
try:
r = requests.get(url)
soap = bs(r.text, "lxml")
return soap
except Exception as e:
print(e)
return "Request Error"
def save2cvs(data, path):
result_weather = pd.DataFrame(data, columns=['date', 'tq', 'temp', 'wind'])
result_weather.to_csv(path, encoding='gbk')
print('save weather sucess')
def save2mssql(data):
sql = "Insert into Weather(date, tq, temp, wind) values(%s, %s, %s, %s)"
data_list = np.ndarray.tolist(data)
# sqlvalues = list()
# for data in data_list:
# sqlvalues.append(tuple(data))
sqlvalues = [tuple(iq) for iq in data_list]
try:
db.exec_sqlmany(sql, sqlvalues)
except Exception as e:
print(e)
def get_data():
soap = get_soap()
print(soap)
all_weather = soap.find("div", class_="wdetail").find("table").find_all("tr")
data = list()
for tr in all_weather[1:]:
td_li = tr.find_all("td")
for td in td_li:
s = td.get_text()
data.append("".join(s.split()))
res = np.array(data).reshape(-1, 4)
return res
if __name__ == "__main__":
data = get_data()
save2mssql(data)
print("save2 Sqlserver ok!")
参考资料