Big data-data collection

Big data data collection

The big data system is generally divided into several levels: data collection, data calculation, data service, and data application.

At the data collection layer, it is mainly divided into log collection and data source data synchronization.

Log collection

According to the type of product, it can be divided into:

  • Log collection of browser page
  • Client log collection

Browser page collection:

Mainly collect page browsing logs (PV/UV, etc.) and interactive operation logs (operation events).

The collection of these logs is generally implemented by implanting standard statistical JS code on the page. But this process of implanting code can be manually written by the development students during the page function development stage, or it can be dynamically implanted by the server when the corresponding page is requested when the project is running.

In fact, statistical JS can be sent to the data center immediately after collecting the data, or it can be sent to the data center after being appropriately aggregated. This strategy depends on the needs of different scenarios.

After the page logs are collected, they need to be clarified and pre-processed on the server side.
Such as cleaning fake traffic data, identifying attacks, normal completion of data, elimination of invalid data, data formatting, data isolation, etc.

Client log collection:

Generally, a dedicated statistics SDK is developed for data collection of the APP client, also known as burying point.

The client data collection has high business characteristics and high customization requirements. Therefore, in addition to some basic data of the application environment, data is collected from the perspective of "according to events", such as click events and login events. , Business operation events, etc.

Basic data can be collected by the SDK by default. After other events are defined by the business side, the SDK interface is called according to the specification.

Because more and more apps now adopt the Hybrid solution, that is, the combination of H5 and Native, for log collection, it involves not only the logs of the H5 page, but also the logs on the Native client. In this case, the data can be collected and sent separately, or the data can be combined and sent.

Under normal circumstances, it is recommended to merge the data on H5 to Native, and then send it uniformly through the SDK. The advantage of this is that it can not only ensure that the collected user behavior data is complete in the behavior chain, but also can adopt some compression processing schemes through the SDK to reduce the amount of logs and improve efficiency.

In the data collection on the APP, another important thing is the unique ID. All data must be associated with the unique ID for better analysis. As for the unique ID of the mobile device, I mentioned it in the previous article. In detail.

Another important principle for log collection is "standardization" and "standardization". Only when the collection method is standardized and standardized can the collection cost be minimized, the log collection efficiency can be improved, and the subsequent statistical calculations can be realized more efficiently.

Data source data synchronization

According to the synchronization method, it can be divided into:

  • Direct data source synchronization
  • Generate data file synchronization
  • Database log synchronization

Direct data source synchronization:
refers to directly connecting to the business database, and reading the data of the target database through a standardized interface (such as JDBC). This method is relatively easy to implement, but if the business volume is relatively large, the performance may be affected.

Generated data file synchronization:
refers to the data file generated from the data source system, and then synchronized to the target database through the file system.
This method is suitable for scenarios where data sources are relatively dispersed, and verification must be done before and after data file transmission. At the same time, file compression and encryption are required to improve efficiency and ensure safety.

Database log synchronization:
refers to synchronization based on the log files of the source database. Most databases now support the generation of data log files, and support the use of data log files to restore data. Therefore, this data log file can be used for incremental synchronization.
This method has less impact on system performance and higher synchronization efficiency.

Data collection is not a goal in itself. Only data collection that is available, usable, and can serve the final application analysis is fundamental.

Guess you like

Origin blog.csdn.net/weixin_47580822/article/details/113826525