Data collection: how to automate data collection

Data collection: how to automate data collection

A trend in the data, is influenced by multiple dimensions, to collect as much data dimensions, while ensuring the quality of data in order to obtain high-quality data mining structure

Data Sources four categories: open data sources (government, business, universities), reptiles crawl (web, APP), the log collection (acquisition front-end, back-end script), the sensor (image velocimetry, thermal)

How to use open source data

Open data source can be considered from two dimensions, one dimension units, such as government, enterprises, colleges and universities; one is the trade dimension, such as transportation, finance, energy and other areas, if looking for a data source fields, such as finance field, you can search directly open financial data sources

How do reptiles crawl use

Code written in Python Reptile

  • Use Requests crawling content, use Requests library to crawl pages of information, the HTTP Requests library is a Python library, this library to crawl through pages of data
  • Content parsing using XPath, XPath is an abbreviation for XML Path, XML Path Language is a language for determining the position of a portion of an XML document, XPath can be indexed by the position of elements and attributes
  • Use Pandas save data by saving data Pandas crawling, or written to, XLS database MYSQL

Of course, it can not be programmed to crawl pages of information, such as: train collector, octopus, set off search

How to use the log collection tool

The basic sensor acquisition based on a particular device, the information collected by the device can be collected

Why were the log collection? By analyzing the user visits, improve system performance, thereby improving the carrying capacity of the system

Logging the whole process of users visit the Web site: Who at what time, through what channels (search engines, enter the URL), what action system if an error occurs, the data can be written in a file, can be divided into different log files, such as access logs, error logs, etc.

Log collection is divided into two forms:

  • WEB server through acquisition, such as httpd, Nginx, Tomcat comes with logging, there are many Internet companies have their own massive data collection tool for system log collection
  • Custom collect user behavior, such as monitor user behavior with JavaScript code
What is Buried

Buried acquisition is necessary corresponding to the position information reporting, such visits a page, including user information, device information, the user or the operating behavior on the page, including the length of time and the like is buried point, each buried point corresponds to a camera, collecting user behavior data, the multi-dimensional data analysis to restore the true cross user scenarios, and the user needs

Buried on the implant statistical code where you need statistics

Buried on how to do: https://blog.csdn.net/feishangbeijixing/article/details/86445704

Published 75 original articles · won praise 9 · views 9171

Guess you like

Origin blog.csdn.net/ywangjiyl/article/details/104740396