Data service operators of large data collection

Integrated data collection process includes introducing, formatting.

First, the data collection process to integrate data from different sources. Data integration architecture to consider the storage, collection methods, interface mode, the acquisition cycle and so on.

In memory architecture, the data may be considered provided the temporary area (Staging Area) in the data source side, staging area may also be considered in the acquisition side of the platform. Setting data to the temporary area on reasonably sized and the amount of data accumulated speed to prevent data overflow.

In the access mode, the application may require different access methods. Acquisition includes the single batch acquisition and collection of two types, a small amount of data for the high time-critical application, may be collected using a single embodiment, may be formed immediately after the data is synchronized to the data warehouse. Such operation log for audit purposes and to be collected by way of a single, after the operation log generated when the real-time synchronized to the data warehouse. For multiple files and relatively low real-time requirement data, the number of files can wait a certain size or a certain period of time, or the batch acquisition pushed to the data warehouse.

In the interface mode, the data acquired batch, FTP mode can be considered, for a single collection of data, Web Services interface to the API or may be employed.

In terms of the acquisition cycle, typically the shorter the acquisition cycle, the higher the real-time data, more timely analysis of the resulting data. Companies can set different acquisition cycle depending on the application needs to be considered temporary data storage area can meet the requirements.

In terms of import data, the data size introduced into three types.

The first is the large amount of data and data definitions to be imported scene, such as data definition includes indexing, and partitioning, consider using a large file import mode, you can guarantee the integrity of the data source.

The second is a data source for simple structure, import files, large-scale data can be imported by way of a batch file, so you can see the error generated during the import process, and correct, ensure the quality of the imported data.

The last small amount of data for a single file, such as certain code table and configuration files, can be imported by a data packet by introducing tool, which is relatively simple and flexible.

Data normalization of the data collection phase of work is very important, because the data analysis must be based on a unified standard, and a variety of data sources on a particular data there are often differences of form and content. For example, in A data source, the date format of "year - month - day" in the form of storage, while the B data source to "Month - Day - Year" form a memory, hence the need for these two data sources in a unified format.

And some types of data stored in different fields, such as the data source in A, the age field is stored in string format, and B data stored in integer format sources, two fields need to be unified as a data type. As well as the content of data stored in different data sources is not the same, but the expression is the same thing. For example, A data source "sex" is "M" and "F" stands for "male" and "female", and B data source "sex" is "1" stands for "male", and "0" stands for "female", it is necessary to achieve the two data sources "gender" unity in semantics.

Reason for the different data sources differ in the same information data is not considered when designing the system to other systems or different information and application provider do not follow a common numbering plan.

Guess you like

Origin blog.51cto.com/14640779/2458464