Python Reptile (seven) _ unstructured data with structured data

Page parsing and extracting data

In fact reptile of a total of four main steps:

  1. Set (you know what are you going to range or website to search)
  2. Climb (the contents of all of the site's entire climb down)
  3. Take (data analysis, data useless for us to remove at)
  4. Deposit (according to the way we want to store and use)
  5. Table (depending on the type of data by some of the icons show)

Previously learned is how to climb from the site data, and climb down the data analysis failed to do, now, to start to do some data analysis.

Data can be divided 非结构化数据and结构化数据

  • Unstructured data: data first, and then have the structure
  • Structured data: first, structure, and then there are data
  • Different types of data, we need to adopt a different way to deal with

Unstructured data processing

Text, phone numbers, e-mail address

HTML file

  • Regular Expressions
  • XPath
  • CSS selectors

Structured data processing

JSON file

  • JSON Path
  • Conversion operation (JSON class) type for Python

XML file

    • Python is converted to type (xmltodict)
    • XPath
    • CSS selectors
    • Regular Expressions

Guess you like

Origin www.cnblogs.com/moying-wq/p/11569914.html