How to build data sets?

The construction of the data table, in essence, is to reduce duplication of data, improve data sharing capabilities, good connection data, the corresponding is OneData , OneService and OneEntity three methodology. OneData required number of bins that all data is only processed once, corresponding to the number of bin design level, requires a uniform dimension for the level of detail data, a measure of the same size and processing only once, the data summary layer, the same size indicator exists only a copies. OneService is a unified query service, the original boundary data development and application development is relatively vague, what logic should be by the data developed, which should be completed by the application developer, we even found some calculations in a large Redis cluster inside to complete mass data calculation processing, the cost is very large, and can not be shared. Data services to draw the boundaries of data and applications, data processing services provided are good indicators of data, applications through data services, direct access to the results of calculation, calculation logic to force the public to sink to the data level, improve the ability to share data . OneEntity mainly to solve the problem of data connections, the same user, whether because the user is logged in the same model, there may be duplicate records, how to identify the two ID is the same user, so that all the user has only one ID identification, this is OneEntity problem to be solved.

For three methodology, our experience is necessary through systematic manner, will regulate the sediment into the system, ensure the effectiveness of construction. In order to support the construction of the data table, we have developed a whole big data link product, Netease Mammoth 6.0 , its structure is as follows:

 

Full link Big Data products Netease Mammoth 6.0 built on Hadoop basis, including 16 sub-product (image above green module identification section), the data covering production, complete link management, in the "ease of use, ease." product design, we adopted a "modular" design pattern products, each product focus a typical scenario, according to their business needs, with a number of selective product applications to solve business problems currently facing. Meanwhile Mammoth 6.0 has a scalable product architecture, based on the business side can provide the basic capabilities of the product, expand into new products.

Full link Big Data products Netease Mammoth 6.0 based on big data development, operation and maintenance tasks, data integration and other big data platform, an increase of two main sections, one OneData system, which is based meta data center as a base, offers on meta data centers 5 one in Taiwan related products: the number of warehouse design center, data center assets, the quality of the data center, data systems and indicators map.

The number of warehouse design centers: in accordance with the subject field, business processes, hierarchical design approach to dimensional modeling as the basic theoretical basis, in accordance with the dimensions, metrics model design, ensure that the model, the field has a uniform naming convention.

Data center assets: The main role is to sort data assets, based on data lineage, data access heat, do treatment costs.

Data Quality Center: mainly, the data after the verification by the rich auditing and monitoring rules to ensure that data for the first time the problem was discovered, avoid ineffective downstream calculations, analyze the scope of the data.

System of indicators: business management indicators caliber, computational logic and data sources, through process-oriented way to build demand from index, development index, index posted a full collaborative process.

Data Map: Provides a quick search of metadata, query the data dictionary, data lineage, data feature information, the equivalent of a metadata portal center.

Another section is OneService system, is the corresponding data services. Data services provide external Restful API , masking underlying various data sources, processed indicators, export to Greenplum , MySQL , Redis , HBase inside queries, data services will be accessed by the user Restful API into low-level access to various data sources . Data services can be considered is the number of positions gateway.

On data services, is the application layer, there can be divided into two categories, one is a generic data applications, including reporting systems, large-screen systems, self-service analysis system itself does not have the attributes of the industry, any business can use; the other one is industry-based data applications, such as electricity supplier supply chain system, the media, public opinion system. In our data sets division, versatility of application data is also included within the scope of the station, because the station is essentially to provide common capabilities for data sets that provide shared data.

Guess you like

Origin www.cnblogs.com/163yun/p/12463453.html