Page parsing and extracting data
In fact reptile of a total of four main steps:
- Set (you know what are you going to range or website to search)
- Climb (the contents of all of the site's entire climb down)
- Take (data analysis, data useless for us to remove at)
- Deposit (according to the way we want to store and use)
- Table (depending on the type of data by some of the icons show)
Previously learned is how to climb from the site data, and climb down the data analysis failed to do, now, to start to do some data analysis.
Data can be divided 非结构化数据
and结构化数据
- Unstructured data: data first, and then have the structure
- Structured data: first, structure, and then there are data
- Different types of data, we need to adopt a different way to deal with
Unstructured data processing
Text, phone numbers, e-mail address
- Regular Expressions Python Regular Expressions
HTML file
- Regular Expressions
- XPath
- CSS selectors
Structured data processing
JSON file
- JSON Path
- Conversion operation (JSON class) type for Python
XML file
- Python is converted to type (xmltodict)
- XPath
- CSS selectors
- Regular Expressions