Big data platform architecture hierarchy

1. Data source layer: including traditional databases, data warehouses, distributed databases, NOSQL databases, semi-structured data, unstructured data, crawlers, log systems, etc., is the data generation mechanism of the big data platform.

2. Data sorting layer: including data cleaning, data conversion, data processing, data association, data annotation, data preprocessing, data loading, data extraction, etc. The role of this layer is to process raw data into product data.

3. Data storage layer (data center): stores cleaned data that can be used in production systems, such as metadata, business databases, model databases, etc. This layer directly faces the application system and requires high reliability, high concurrency, and high Precision.

4. Data modeling and mining layer: This layer implements deep processing of data. According to business needs, it establishes a statistical analysis model suitable for business, establishes a big data operation and processing platform, and uses algorithms such as data analysis, data mining, and deep learning. Production data digs out the intrinsic value of data to provide data and decision support for business systems.

5. Industry application layer: In-depth analysis of industry data characteristics, combing industry data product needs, and establishing data application products suitable for different industries.

6. Data visualization: provide data display and data sharing services in various ways such as intelligent reports, special reports, BI displays, platform interfaces, etc.

There is no standard for the hierarchical division of the big data platform architecture. In the past, I have done big data application planning, which is also very entangled, because the application classification is also vertically and horizontally, and I still feel that it reflects a "usable" principle, clear and easy to understand. Can guide the construction, here the big data platform is divided into "five horizontal and one vertical".

 

For details, see the example below. This picture is more classic and a result of compromise. It can be mapped to many big data architecture diagrams on the Internet.


 

According to the flow of data, it is divided into five layers from bottom to top. It is actually very similar to the traditional data warehouse. The data systems are conceptually connected. They are the data acquisition layer, data processing layer, data analysis layer, data access layer, and applications. Floor.

 

At the same time, the big data platform architecture is different from the traditional data warehouse at the same level. In order to meet different scenarios, more technical components will be used to reflect the characteristics of a hundred flowers. This is a difficulty.

 

Data collection layer: It includes both traditional ETL offline collection, real-time collection, Internet crawler analysis and so on.

 

Data processing layer: According to different data processing scenarios, it can be divided into HADOOP, MPP, stream processing, etc.

 

Data analysis layer: mainly contains analysis engines, such as data mining, machine learning, deep learning, etc.

 

Data access layer: It is mainly to realize the separation of reading and writing, and to strip the application-oriented query capabilities and computing capabilities, including real-time query, multi-dimensional query, conventional query and other application scenarios.

 

Data application layer: According to the characteristics of the enterprise, different types of applications are divided. For example, for operators, there are precision marketing, customer service complaints, base station analysis, etc., location-based passenger flow, label-based advertising applications, etc.

 

Data management: This is a vertical, mainly to achieve data management and operation and maintenance, it spans multiple layers to achieve unified management.

 

1. Data collection layer

 

Offline batch collection uses HADOOP, which has become the mainstream engine of current streamline collection. Based on this platform, data collection applications or tools need to be deployed.

 

For example, BAT is a product developed by itself. General companies can use the commercial version. Now there are many options for this type, such as Huawei BDI, etc. Many companies have technical strengths, but when they start, they often have a weak understanding of application scenarios and work with details. Very poor, which makes it difficult to meet the requirements of the products made, such as the lack of statistical functions, which is very different from BAT. Traditional companies should be cautious when purchasing such products.

 

Being able to make and make products is a matter of two realms. Of course, small Internet companies can also make useful collection tools for themselves, but it is difficult to abstract and create a real product. BAT self-research has actually formed a huge The advantages.

 

Real-time acquisition is now standard on big data platforms. It is estimated that the mainstream is FLUME + KAFKA, and then combined with stream processing + in-memory database. This technology is certainly reliable, but such open source things are good, but once problems occur The resolution cycle is often longer.

 

In addition to using FLUME, in order to achieve real-time collection of the ORACLE database table, you can also use OGG / DSG and other technologies to achieve real-time log collection, which can solve the load problem of the traditional data warehouse to draw the full scale.

 

Crawlers have gradually become the standard for many companies to collect, because the new data on the Internet mainly depends on it, you can get a lot of online information through the analysis of the web page, what public opinion analysis, website ranking, etc. It is recommended that every company should establish an enterprise level If it is not in the planning of your big data platform, you can think about it. If you ca n’t get the data, there is nothing to say.

 

The construction of an enterprise-level crawler center is quite difficult, because not only the crawler is needed, but also the establishment of URLs and application knowledge bases, the need for Chinese word segmentation, reverse sorting and text mining based on web page text. This set is very challenging. At present, there are many open source components, such as solr, lucent, Nutch, ES, etc., but to use it well, the road is long.

 

Another thing is, if possible, the author recommends upgrading the data collection platform to a data exchange platform, because in fact there is a lot of data flow in the enterprise, not only one-way data collection, but also a lot of data exchange, such as the need Pour data to GBASE, pour data from HBASE to ASTER, etc. For the application, this value is great.

 

Since data collection and data exchange have many functions that are very similar, why not integrate it is also convenient for unified management, I feel that a lot of enterprise data exchange is application-driven, interface management is messy, this is also my suggestion.

 

In general, it is very difficult to build a big data collection platform. From a customer's perspective, at least the following three requirements must be met:

 

Diversified data collection capabilities: support real-time incremental data collection (using flume, message queue, OGG and other technologies) and batch data distributed collection capabilities (SQOOP, FTP VOER HDFS) for multiple data such as tables, files, messages, etc. There is an order of magnitude improvement in performance over traditional ETL, which is fundamental.

 

Visual quick configuration capability: Provides a graphical development and maintenance interface, supports graphical drag-and-drop development, free of code writing, and reduces the difficulty of collection. Each configuration data interface takes a short time to reduce labor costs.

 

Unified scheduling management and control capabilities: To achieve unified scheduling of collection tasks, it can support multiple technical components of Hadoop (such as MapReduce, Spark, HIVE), relational database stored procedures, shell scripts, etc., and support multiple scheduling strategies (time / interface notification / manual).

 

2. Data processing layer

 

Hadoop's HIVE is a distributed alternative to traditional data warehouses. Scenarios such as data cleaning, filtering, conversion, and direct summarization used in traditional ETL are suitable. The larger the amount of data, the higher the cost performance. But so far, the data analysis scenarios it supports are also limited. Simple offline massive analysis and calculation is what it is good at. Correspondingly, the speed of complex cross-correlation operations is very slow.

 

To a certain extent, for example, enterprise customers' unified view wide table is relatively inefficient with HIVE, because it involves the integration of multi-party data, but it is not impossible to do it. The slowest at most, it still needs to be balanced.

 

Hadoop could not support the scale of X000 clusters. At present, the data volume of many enterprises should exceed this amount. In addition to enterprises such as Ali and their own R & D capabilities (such as ODPS), should they also move to split Hadoop clusters according to business Roads such as Zhejiang Mobile have split up multiple hadoop clusters such as fixed network, mobile network, and innovation.

 

Hadoop SPARK is very suitable for the iteration of machine learning, but whether it can be applied to data correlation analysis on a large scale and whether it can replace MPP to a certain extent still needs practice to verify.

 

MPP should be said to be the best alternative to traditional data warehouses using a distributed architecture. After all, it is actually a relational database of a variety. It provides complete support for SQL. After the conversion analysis of HIVE, the fusion of data warehouse Modeling is more than enough to use it for performance, and its cost performance is better than traditional DB2. For example, after practical use, Gbase30-40 clusters can exceed 2 top-mounted IBM 780.

 

MPP now has many products, it is difficult to judge the merits and demerits, but some practical results can be said, GBASE is good, many of the company's systems have been run on it, mainly domestic, technical service guarantee is relatively reliable, ASTER has yet to wait and see, since Bringing some algorithm libraries has some advantages. GreenPlum and Vertica have never been used, so it is hard to say.

 

It is now said that MPP will eventually be replaced by the Hadoop framework. After all, such as SPARK and others are gradually stable and mature, but in the short term, I think it is still very reliable. If the data warehouse is to adopt a progressive evolution method , MPP is indeed a good choice.

 

Now many companies such as China Mobile and eBAY are adopting this kind of mashup structure to adapt to different application scenarios, which is obviously a natural choice.

 

The troika of the big data platform is indispensable for stream processing.

 

For many companies, it is obviously a nuclear weapon-like existence, and a large number of application scenarios require it, so it must be built. For example, the real-time and quasi-real-time data warehouse scenarios unimaginable in the IOE era have become very important in stream processing It's simple. It used to be a painful thing to count a real-time indicator before. At present, such as an anti-fraud real-time system, the system is applied for deployment in one day.

 

I have only tried STORM and IBM STREAM. I recommend IBM STREAM. Although it is a commercial version, its processing capacity is not a little bit more than STORM. It is said that STORM is basically not updated, but the amount of data is not large. From a perspective, a commercial version such as IBM is a good choice, more than enough to support various real-time application scenarios.

 

Stream processing clusters use stream processing technology combined with in-memory databases for real-time and quasi-real-time data processing. IBM Streams stream processing clusters carry the company's real-time business:


 

3. Data analysis layer

 

Let ’s talk about the language first. R and Python are a pair of current friends in the open source field of data mining. If I want to make a choice, I ca n’t say it. I feel that Python is more inclined to engineering. For example, there is direct support for word segmentation and R drawing. Ability is exceptionally powerful. But they used to focus on sample statistics, so the support for large-scale data is limited.

 

SPARK is an option. It is recommended to use SPARK + scala. After all, SPARK is written in scala and can quickly support many native features.

 

TD's MPP database ASTER also has a lot of algorithms embedded in it. It should be optimized based on a parallel architecture. It seems to be an option. I have done a few degrees of communication in the past. The speed is indeed very fast, but the use of data is very limited. support.

 

After the traditional data mining tools are unwilling, SPSS now has IBM SPSS Analytic Server, which strengthens the support for big data hadoop, and the business staff use feedback is good.

 

Perhaps future machine learning will also form a mix of high and low, high-end users use spark, low-end users use SPSS, but also to adapt to different application scenarios.

 

In any case, the tool is only a tool, and ultimately depends on the ability of the modeling engineer to control.

 

4. Data open layer

 

Some engineers directly output HIVE as a query. Although it is unreasonable, it also shows that the calculation and query require completely different technical capabilities. Even in the field of query, different technologies need to be selected according to different scenarios.

 

HBASE is very easy to use, based on column storage, the query speed is in the millisecond range, and it is also a lever for the general tens of billions of records query. It has a certain high availability. The detailed list query and index library query in our production are very good. Application scenarios. But reading data only supports reading by key or key range, so rowkey should be designed.

 

Redis is a KV database, and the read and write speed is faster than HBASE. Most of the time, HBASE can do, Redis can also do, but Redis is based on memory, mainly used in the key-value memory cache, there is the possibility of losing data, the current It will be used for real-time query of tags. Most of the Internet or advertising companies that have cooperated use this technology, but if the data is getting larger, then HBASE is the only option.

 

In addition, real-time online query applications based on Internet logs provided by IMPALA are also being used to implement distributed memory-based SQL correlation analysis on the marketing platform using SQLFire and GemFire. Although the speed can be, but there are many BUGs, the cost of introduction and transformation is relatively large .

 

Kylin is currently a killer tool based on multi-dimensional analysis of hadoop / SPARK. There are many application scenarios, and I hope to have the opportunity to use it.

 

5. Data application layer

 

Each enterprise should plan its own application according to its actual situation. In fact, it is difficult to develop an application blueprint. The higher the upper layer of the big data architecture, the more unstable it is, because the change is too fast. The following is an application plan that is generally used by operators at the current stage. Figure for reference:

 


 

6. Data management

 

The management of the big data platform is divided into application management and system management. From an application point of view, for example, we have established the DACP visual management platform, which can adapt to 11 big data technology components and can achieve transparency for various technical components. Access capabilities, at the same time through the platform to achieve the entire life cycle management from data design, development to data destruction, and the standards, quality rules and security strategies are solidified on the platform, to achieve pre-management, in-process control and post-audit audit, audit Comprehensive quality management and safety management.

 

Others, such as scheduling management, metadata management, and quality management, of course, needless to say, because the source of development is controlled, the complexity of data management will be greatly reduced.

 

From the perspective of system management, the company incorporates the big data platform into a unified cloud management platform management (private cloud). The cloud management platform includes visual operation and maintenance tools that support one-click deployment and incremental deployment, and a multi-tenant-oriented computing resource management and control system. (Multi-tenant management, security management, resource management, load management, quota management, and metering management) and a comprehensive user rights management system provide enterprise-level big data platform operation and maintenance management capability support. Of course, such ambitious goals must be achieved. A day's work.

 

Summarize some revolutionary values ​​of the big data platform.

 

In the era of big data, the architecture of most enterprises will inevitably be distributed, scalable and diversified. The so-called long-term separation is no longer a technology that can dominate the world. This impacts the traditional technology outsourcing model of centralized enterprises. The challenge is huge.


 

In the era of big data and cloud computing, there are so many technology components, and a new technology must be adopted. Opportunities and risks coexist:

 

For the commercial version of the big data platform, the enterprise is faced with the services of the partners can not keep up, because the development is too fast, for the open source version, the enterprise is facing the challenge of its own operation and maintenance capabilities and technical capabilities, more practical requirements for autonomous capabilities high.

 

At present, companies such as BAT, Huawei, and the new Internet are sweeping through the talents. The challenge for talents such as operators is huge, but it also contains opportunities. In fact, for those who are committed to big data It is also a good choice to engage with operators and other enterprises, because on the one hand the enterprises are transforming, on the other hand the data volume is large enough, and there are more opportunities for technology dominance.

Published 15 original articles · praised 3 · 10,000+ views

Guess you like

Origin blog.csdn.net/edward_2017/article/details/91954931