python bs4 BeautifulSoup

Beautiful Soup is a Python library can extract data from HTML or XML file. BeautifulSoup bs4 module with requests library can write simple reptiles.

installation


  • Command : PIP install beautifulsoup4 

Parser


  • The main parser, as well as their advantages and disadvantages as follows:

Installation command:

  • pip install lxml
  • pip install html5lib

requests


  •  Requests of the underlying implementation is urllib, requests can automatically help us unpack (gzip compression, etc.) web content
  • Installation command: PIP install Requests
  • Recommended response.content.deocde () way to get the html page response

pandas


  • Installation command: PIP install PANDAS 
  • NumPy is based on a tool, the tool to solve data analysis tasks created.

 data structure:

  • Series: one-dimensional array, with one-dimensional array Numpy are similar. Both basic and Python List data structure is also very similar, the difference is: List of elements may be different data types, and only allow the Series Array, and store the same data type, so that more efficient use of memory, improve operational efficiency.
  • Time- Series: Time-indexed Series.
  • DataFrame: two-dimensional tabular data structure. The many features and R data.frame similar. Series DataFrame may be understood as a container. The following content mainly DataFrame based.
  • Panel: a three-dimensional array, as will be appreciated DataFrame container.

use


Beautiful Soup complex HTML documents converted into a complex tree structure, each node is Python objects, all objects can be grouped into four kinds:

  • Tag 
  • NavigableString 
  • BeautifulSoup 
  • Comment 

Tag: Tag objects and native XML or HTML document in the same tag, tag the most important attributes: name and attributes

 

Gets the specified tag from a Web page, the property value, in the way:

  • Acquired by the tag name: tag.name tag corresponding type is <class 'bs4.element.Tag'>
  • By acquiring property: tag.attrs
  • Get tag attributes: tag.get ( 'attribute name') or Tag [ 'attribute name']

Features tab

  • stripped_strings: output string may contain a number of blanks or blank lines, can be used to remove excess white space .stripped_strings
  • Standard output page: soup.prettify ()

Find elements:

  • find_all (class _ = "class"  )   returns a plurality of labels
  • find (class _ = "class") returns a label
  • select_one () Returns a label
  • select () returns the plurality of labels
  • soup = BeautifulSoup (backdata, 'html.parser') # is converted to the form property BeautifulSoup
  • soup.find_all ( 'tag name', attrs { 'attribute name': 'attribute'})      # returns a list of
  • limitk control the number of returned find_all
  • recursive = Flase returned direct child elements of tag

demo


 

import sys
import io
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np
from py_teldcore import sqlserver_db as db

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')

url = "http://www.tianqihoubao.com/lishi/hefei/month/201812.html"


def get_soap():
    try:
        r = requests.get(url)
        soap = bs(r.text, "lxml")
        return soap
    except Exception as e:
        print(e)
        return "Request Error"


def save2cvs(data, path):
    result_weather = pd.DataFrame(data, columns=['date', 'tq', 'temp', 'wind'])
    result_weather.to_csv(path,  encoding='gbk')
    print('save weather sucess')


def save2mssql(data):
    sql = "Insert into Weather(date, tq, temp, wind) values(%s, %s, %s, %s)"
    data_list = np.ndarray.tolist(data)

    # sqlvalues = list()
    # for data in data_list:
    #     sqlvalues.append(tuple(data))

    sqlvalues = [tuple(iq) for iq in data_list]

    try:
        db.exec_sqlmany(sql, sqlvalues)
    except Exception as e:
        print(e)


def get_data():
    soap = get_soap()
    print(soap)
    all_weather = soap.find("div", class_="wdetail").find("table").find_all("tr")
    data = list()
    for tr in all_weather[1:]:
        td_li = tr.find_all("td")
        for td in td_li:
            s = td.get_text()
            data.append("".join(s.split()))

    res = np.array(data).reshape(-1, 4)
    return res


if __name__ == "__main__":
    data = get_data()
    save2mssql(data)
    print("save2 Sqlserver ok!")

  

 

参考资料


Guess you like

Origin www.cnblogs.com/tgzhu/p/11385068.html