Recent thinking big data services platform and mammoth big data platform [excerpt]

 Mammoth Big Data platform after last year's rapid development, has become a big data development tool of choice for many products in the company, originally positioned as a portal for the development of such a platform website to schedule management as the core, the company has been in the some big data integration tools, provides a visual user interface, unified user rights management mechanism. Crude insight into the development process users can find a very familiar feeling in Mammoth, DS access, upload and MR scheduling control tasks, HIVE queries and so on. As users feedback, Mammoth is constantly evolving, more and more components covered in, interactive and continuous improvement processes. However, there is such a framework which is the ultimate form of mammoths do? The answer is naturally negative, it can be said, in front of the mammoth only a glimpse into the tip of the iceberg large ecological data, the stacking device can not become a true ecosystem, but the cumulative tools. Only organic integration of the various components together to form complete workflow solutions, supplemented by public services closer to the business, in order to make the platform play a greater role in the entire process using a large data development.

 

This paper mainly discusses the ideal platform for big data from several different aspects of the platform, clients, core, service morphology.

1 subdivision clients

Object big data platform should really be services are those people, this is a personal issue at the beginning of the project took just constantly thinking of.

Mammoth's first name is "Mammoth big data development platform", which also means that from the earliest version, the main goal is to Mammoth data services company Developer within each project. Data development work in the company is ETL (Extract-Transform-Load), the synchronization is fine for each data source, or the use of MR do Hive extracted data conversion, data binding operation scheduling system maintains a set of streams. Data can be said that the core of the development of data systems. They are processors of data between the data and the ultimate objective data applications a bridge.

However, as more and more users on Mammoth, dependencies between the platform and the underlying system more and more important, the platform administrator also mammoth in an important class of users. Another goal is to Mammoth platform as a whole, big data platform only entrance, all the tasks and scheduling are submitted by Mammoth, Mammoth by the audit authority to take over the management and the like. Followed by some of the hadoop administrator, DBA, etc. platform management role also has a strong will to join management and statistical queries in Mammoth, it is possible to facilitate the load, the task status submitted by some of the operation and maintenance and other operating systems as well as view Wait.

In addition to data development, system administrators, there is a third category of users, Mammoth is currently not yet take into account the user - the user data, as data in product operations in the proportion is growing. More and more people began to use data, including analysts and ordinary operational decision-makers. People with knowledge of certain data if they can get some self-help of simple data, pressure data will be liberated developers. Nowadays past data that developers a unified process for all modes of data acquisition request is unreasonable, it is not extensible. In this mode, the data will become very slow response, data development is also very easy to become a bottleneck, their precious time can easily be overwhelmed by a variety of business, but the real task of building data but likely to be shelved.

Services segment, this is the future trend, the needs of users does not make sense is not the same, how to make all the characters feel easy to use platform issue is urgent need to consider. The different categories of users to divide the functional module platform is a viable solution, it was divided in Mammoth latest version in the user data development and data, through different inlet guides.

 

Core 2 Data Platform

What is the real core value of big data platform is based on a set of interactive UI underlying systems? of course not. Task scheduling? Just so? What is a big data platform real core?

Data warehousing, yes, I think that is the real core of the data warehouse data platform. Data warehouse referred to here is a broad definition, it should be said that the data warehouse is used to store data from multiple databases and other sources, through the conversion process and provides a unified user interface for applications the complete query and analyze data a set of system. Inmon defined number of positions: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of decision-making process management.

A data warehouse is a category of data management, data warehouse around the center, the most important aspects of the development process is the data ETL, and on the following points ETL always not open around, but can easily be overlooked topic.

  • Metadata Management

The first is managing metadata. The role of metadata is reflected in these aspects, first of all metadata is data integration a necessity, followed by metadata helps users understand the data, and then followed Metadata is key to ensuring the quality of data, it can help you understand the ins and outs of the data. Finally metadata is the foundation to support data requirements change. In particular, it requires metadata management system to manage these descriptions of data, including meta-level meta-information information technology and business level. For example hierarchical model records data warehouse, which data belong to which tables of detail, which belongs to the polymeric layer. Each field in each table correspond to what is the meaning, what is the standard format, multi-dimensional analysis model which is a fact table, which is a dimension table, dimension levels, levels, attributes, define metrics, and filters. The flow of data, and so are the kinship data must be tracked. Need to have a mechanism to quickly retrieve the metadata on the basis of these, are not familiar with the data warehouse to help users quickly locate the desired data acquisition.

Good set of metadata management system are the following things must be completed:

  1. A complete data dictionary repository

  2. Data kinship

Data warehouse contains the raw data, processes the intermediate data, width data table, the data mart and the like. A reasonable data warehouse must be done fast and accurate data tracking and management belongs to the data by topic, by update cycle, a differentiation according to the level of granularity, or at a higher level of data integrity and consistency of the description. In the past, students often develop data for describing data tables, data files recorded using tools such as Wiki or git. This has two problems, the first is relatively free format description of the data can not be described in terms of a unified data warehouse metadata requirements; secondly, this metadata is difficult to get timely updates, once there are other users in the data warehouse the metadata is modified, it is difficult to timely and reliable response, will eventually lead to lag metadata, failure and chaos.

For the Mammoth, the latest version has been added to part of the data management functions. For help users understand the data in this layer, Mammoth table, table provides field record, search, collection and other functions, can be associated Mammoth Hive table that already exists, the data to develop the students can relate to the data dictionary table Hive Repository, the classification of the data, tag and further described and illustrated. Ordinary data users can quickly retrieve the data dictionary, and get help on their own data tables, data columns. Future, Mammoth also comes with the function of kinship data, data lineage is based on data developed by the data do longitudinal tracking. Kinship is the core task scheduling system, according to each task to obtain input and output Job scheduling system, the execution log and so on. The output of each input item associate task organized into a directed acyclic graph (DAG), DAG nodes is input and output files, DAG is a side registered in the scheduling system JOB. For the realization, MR Job Spark RDD are or contain blood relationship, Hive org.apache.hadoop.hive.ql.tools.LineageInfo also have a special tool can get the context of the input and output tables.

  • Data quality control

In addition to metadata management, monitoring data quality is also a very important aspect of the quality of the data can be composed of integrity, consistency, accuracy and timeliness of four basic elements. Including Profiling, Auditing, Correcting three important processes:

profiling: a summary analysis of the data, to check whether data is available, statistics and other information and data collection. The Analyze similar to a conventional database, a more comprehensive profiling comprises a number of records in the collection, maximum and minimum, maximum and minimum length, cardinality, null number, mean, median, distribution information and the like unique value. Obtained from a number of key quantitative indicators in the feature data capture potential outliers, and some tools can even be given data quality score.

Auditing: data review, verify, based on four basic aspects of data quality, timeliness direct comparison is accomplished by monitoring task ETL, integrity and the integrity of records including field integrity. The most common abnormality integrity field is in an excessive number of null values ​​in the statistics, records the full version of the common exceptions include the number of records too much, too little and other anomalies. Consistency same encoding rules including logic rules and the same encoding rule may be determined based on the established rules, and logical consistency of data is relatively complex, in regular attributes, there are rules among attributes. Question the accuracy of the data is generally orders of magnitude truncation of garbled characters, characters such mistakes, the median can be analyzed, and data distribution to the average abnormal,

correcting: data correction, including filling missing values, delete duplicate records, consistent data and correcting abnormal data. For missing records, re-fetched from the original data, the missing fields need for missing values ​​predicted or estimated to weight required to judge the constraints unique value, inconsistent record will need to rely on familiarity with the data source, designated flower the rule of. In general most of the abnormal data is difficult to correct, a lot of abnormal data is unlikely to be a hundred percent reduction. Last resort it is abnormal data filtering, the abnormal data excluded from the data warehouse to avoid interference.

Mammoth currently only ETL task of monitoring data quality management, Mammoth can each Task ETL's, Job even to the Hadoop Job monitoring, alarm or for failure timeout task. However, consistency, integrity, veracity is blank, this one is the mammoth development effort focused on the next stage. Taking into account the data correcting subjectivity and uncertainty, data quality assurance Mammoth focused on profiling and audition, hoping to soon as possible to help users find abnormal data, allowing users to start as soon as possible remedies. Mammoth data quality assurance of a plug-in module, allows the optional integrated type task scheduling system. For the high importance of data, computing resources are relatively well-off users, data quality assurance can well improve the service experience, to ensure the normal operation of a data warehouse system.

 

3 data development model change

In the conventional data processing, data integration, ETL or ELT development, data provided by a professional aid professional to complete, such as the Oracle OWB, MS SSIS the like. In the era of big data, core data warehouse to hadoop time, and there is no suitable open source ETL tool to assist job. We can see a lot of data to complete the development of the students handwriting from MapReduce data cleansing, data transformation and so a series. To maintain this set of code and make it normal operation is a big project, which is like the modern application development, developers maintain the same set of assembly code!

Handwriting MapReduce short board will be apparent, as the underlying API its high complexity, a relatively high degree of freedom. At very high degree of freedom, as well as the readability of the code specifications, maintainability becomes a problem. While MapReduce code in case it wants to optimize the performance is better, but compared to the readability, maintainability, the performance advantages of these promises are not the most important. In actual projects often can be seen in the transfer of work, the pile of MR project code is very difficult to maintain, to take over colleagues fearfully at the time of modification. Finally it wants to take over the code extends to much trouble, development efficiency is very low.

To solve this problem, the community has a variety of tools. Hive, Cascading etc., Hive can use the SQL language to summarize complex task MR, combined Hive scalable UDF, UDAF, SQL can be said as a more concise, more general language to assume all of the ETL development tasks maintenance SQL can effectively solve the problem of handwriting MapReduce, but also as a very wide audience of SQL, to further reduce barriers to entry, allowing data to develop more focused on the business process, rather than a bunch of struggling to maintain the underlying code.

Mammoth will also try to provide a range of ETL tool SQL, developers hope that this convenient way to help develop faster entry for the construction of a data warehouse. Such a transformation from the data storage begins, "data structure" of the core ideas throughout the development process. The data in Hadoop HDFS file transfer from the unit to become a two-dimensional table, which also better able to link up with traditional relational database as a data warehouse based ETL process. In the new development model, Mammoth will need to work behind the HDFS distributed file system hide, put an end to the perspective of the file to look at the data warehouse. Before importing the user log data from Datastream Serde you need to specify the log, the table definition, and by means of this tool Storm streaming or Spark Streaming data preprocessing. Contact data lineage to the one mentioned in the analysis, the use of Hive tool can be used to quickly build SQL statements on the table and link table. Can be shared between data and develop UDF UDAF, the user needs to edit the mammoth UDF UDAF and unified management, allowing the common group, while it is possible to provide Mammoth UDF / UDAF development framework, user-defined functions written quickly.

 

4 Data Management Platform

Here that is not just for data management platform systems management platform, currently Serve Mammoth management functions include user resource allocation. User directory permissions approval, user group management. Only these are not enough.

Platform management functions, should expose everyone on the platform, the operation of each project. Including:

  • User resource footprint

User resources include storage, computing, and so on, storage resources, better understanding that all user files are taking up much space distributed file system, this data will help to better storage resource planning cluster, found that the amount to grow too fast user to help them rational planning resources. Various tables, top ranked projects in the operation and maintenance personnel can load a glance the status of the current cluster. Computing resources are relatively more difficult to measure, it can only be obtained based on the use user queues.

  • User tasks operation

Users task to get the operation by the task scheduling system, the user how many Task, bear with how much each Task Job, Job is what type, each Job / Task separately how much time you need to perform, because the highest probability of failure which Job, failure is What so on, these statistics are very important. Mammoth currently Azkaban scheduling system can be said to be a difficult to plan and the amount of system resources, currently between users in the absence of sufficient resources for all users to use a dispatch system, this lack of scheduling scheduling system itself isolated, it can be said it is vulnerable, as a platform for rational planning management task scheduling resources you need to rely on more than these statistics. One reason Job failure analysis is an urgent problem, inexperienced developers often locate data efficiency cause of the error is not high, but the platform can provide an effective system of rules to extract error log information, deduce the wrong reasons. Combined. If the user appears certain JOB very high frequency due to low memory commit fails, it can cause the system to assist the user to adjust the expansion or trigger time to reach correction. If multiple occurrences resulting database permissions Sqoop task fails, you can contact the DBA to troubleshoot a problem missing ACL or white list as soon as possible.

 

5 Future

A complete big data platform is based on the underlying technology and the needs of users continue to update the development, Mammoth need to provide a complete solution for ideas, the introduction of more and better features and tools to meet the needs of different groups of people for the data. Real-time data processing and richer data synchronization, log retrieval, KeyValue system, the algorithm modules are future plans to introduce the content. User demand for data is more and more rich, in the era of big data is not a fresh recruit solutions, and only adopted piecemeal measures, targeted the main segments of the scene, one by one break is the recent development direction . Mammoth mind will not change, reducing the threshold of big data to provide a more comprehensive, easy to use big data services.

Guess you like

Origin www.cnblogs.com/yako/p/11206346.html