Data collection: how to automate data collection
A trend in the data, is influenced by multiple dimensions, to collect as much data dimensions, while ensuring the quality of data in order to obtain high-quality data mining structure
Data Sources four categories: open data sources (government, business, universities), reptiles crawl (web, APP), the log collection (acquisition front-end, back-end script), the sensor (image velocimetry, thermal)
How to use open source data
Open data source can be considered from two dimensions, one dimension units, such as government, enterprises, colleges and universities; one is the trade dimension, such as transportation, finance, energy and other areas, if looking for a data source fields, such as finance field, you can search directly open financial data sources
How do reptiles crawl use
Code written in Python Reptile
- Use Requests crawling content, use Requests library to crawl pages of information, the HTTP Requests library is a Python library, this library to crawl through pages of data
- Content parsing using XPath, XPath is an abbreviation for XML Path, XML Path Language is a language for determining the position of a portion of an XML document, XPath can be indexed by the position of elements and attributes
- Use Pandas save data by saving data Pandas crawling, or written to, XLS database MYSQL
Of course, it can not be programmed to crawl pages of information, such as: train collector, octopus, set off search
How to use the log collection tool
The basic sensor acquisition based on a particular device, the information collected by the device can be collected
Why were the log collection? By analyzing the user visits, improve system performance, thereby improving the carrying capacity of the system
Logging the whole process of users visit the Web site: Who at what time, through what channels (search engines, enter the URL), what action system if an error occurs, the data can be written in a file, can be divided into different log files, such as access logs, error logs, etc.
Log collection is divided into two forms:
- WEB server through acquisition, such as httpd, Nginx, Tomcat comes with logging, there are many Internet companies have their own massive data collection tool for system log collection
- Custom collect user behavior, such as monitor user behavior with JavaScript code
What is Buried
Buried acquisition is necessary corresponding to the position information reporting, such visits a page, including user information, device information, the user or the operating behavior on the page, including the length of time and the like is buried point, each buried point corresponds to a camera, collecting user behavior data, the multi-dimensional data analysis to restore the true cross user scenarios, and the user needs
Buried on the implant statistical code where you need statistics
Buried on how to do: https://blog.csdn.net/feishangbeijixing/article/details/86445704