Common big data platform architecture design ideas

In recent years, along with the IT learning technology and big data, machine direction of the continuous development of algorithms, more and more enterprises have realized the value of existing data, the data itself as a valuable asset management, the use of big data and machine learning ability to dig, to identify, using data assets. If the lack of effective data overall architectural design or part of the deficit, will lead the business layer is difficult to directly take advantage of Big Data Big data, big data and business had a huge gap appeared chasm lead to business occurs during the use of large data data agnostic, demand is difficult to achieve, difficult to share data and other issues, this article describes some of the data platform design approach to help businesses reduce the difficulty and pain points in data development.

This article includes the following sections:

  1. The first part of this article to introduce a big data infrastructure components and related knowledge.

  2. The second part describes the kappa and lambda architecture architecture.

  3. The third section describes the general architecture big data under the kappa and lambda architectural patterns

  4. Part IV describes the bare-end data architecture data system under the difficulties and pain points.

  5. Section V excellent overall design of large data architecture

  6. From the fifth part is the introduction of data through a variety of platforms and components to these large data components combine to create an efficient, user-friendly data platform to improve the efficiency of business systems, business development, so that is not afraid of complex data development components, without paying attention to the underlying implementation, only you need to use SQL to complete the development of one-stop, complete data reflow, so that the data is no longer a big data engineers have skills.

First, the big data technology stack

Big Data whole process involves a lot of modules, each module is more complex, the figure below lists the modules and components as well as their functional properties, there will be follow-up to introduce topics related to the field of knowledge module details, such as data collection, data transmission, real-time computing , calculated offline, large data storage and other related modules.

 

 

Two, lambda and kappa architecture architecture

Now basically all the big data architecture is based on lambda and kappa architecture, different companies in these two architectural patterns designed to meet the data architecture of the company. lambda architecture allows developers to build large-scale distributed data processing system. It has good flexibility and scalability, but also has good fault tolerance for hardware failure and human error, with respect to lambda architecture can be found to many related articles on the Internet. The kappa architecture addresses two sets of data processing systems exist lambda architecture, leading to a variety of cost, which is currently approved flow direction of the integration of research, many companies have started using this more advanced architecture. Want to learn the system big data, you can join the big data technology learning buttoned Junyang: 522 189 307

 

Lambda architecture

 

Kappa architecture

 

Third, the big data architecture at the kappa and lambda architecture architecture

 

At present the major companies are basically using the kappa or lambda architecture architecture model, these two modes of large data overall architecture might look like the following in the early stages of development:

 

Fourth, the pain end-point data

 

While the above architecture will look large variety of data components linked together to implement integrated management, but contact data development people will feel more intense, bare architecture business data such development requires a lot of attention to the use of the underlying instrument, the actual data development There are many pain points and difficulties, in particular in some of the following aspects.

 

  1. The lack of a data development IDE to manage the entire data link development, long-term processes can not manage it.

  2. No standard data modeling system, resulting in different data engineer to understand the different indicators calculated caliber wrong.

  3. Large high data component development requirements, general business to use direct Hbase, ES and other technology components will produce a variety of problems.

  4. Basically, every company big data team will be very complex, involving many areas, a problem difficult to locate hard to find the corresponding person in charge.

  5. Difficult to break silos of data, cross-team inter-departmental data sharing is difficult, they do not know what each other's data.

  6. The need to maintain two sets of batch computing model calculations and flow calculation, it is difficult to get started developing, need to provide a stream of batch SQL uniform.

  7. The lack of firm-level metadata system planning, the same data in real-time and offline reuse difficult calculation, each carding a variety of development tasks should be.

Basically most companies on data management platform and provides all the above problems and pain points on the open capacity. In a complex data architecture, data suitable for parties, every aspect of a function is not clear or unfriendly, will make it more complicated to change a complex link. Want to solve these pain points, you need to carefully polished every aspect, the top technology components seamlessly together, so business like writing SQL to query the database is as simple as using data from end to end.

Five excellent overall big data architecture design

Offers a variety of platforms and tools to help data platform: data gathering platform data sources, a key platform for data synchronization, data quality and modeling platform, a metadata system, a unified data access platform, real-time and off-line computing platform, resource scheduling platform, one-stop development IDE.

 

Sixth, the metadata - the cornerstone of big data system

Metadata is open data sources, data warehouse, data applications, recorded a complete link data from production to consumption. Metadata contains static tables, columns, partitions information (ie MetaStore). Dynamic task dependency mapping relationship table; model defines the data warehouse, data life cycle; and ETL scheduling information, input and output Metadata is data management, basic data content, data applications. For example, constructed by using the task metadata, tables, columns, map data between User; DAG task dependencies, scheduling task execution sequence; build tasks illustration, quality management task; BU personal or asset management, computing resource consumption Overview and so on.

Can be considered the entire large data flows are relying on metadata management, there is no complete set of metadata design, there will be difficult to track the above data, the authority is difficult to control, difficult to manage resources, difficult to share data and so on.

Many companies are relying hive to manage metadata, but personally think that in a certain stage of development still need to build their own platform to match the metadata related to architecture.

Seven, flow calculation batch integration

If maintaining two computing engines such as Spark offline computing and real-time computing Flink, then the user will cause great distress, both need to learn flow calculation also requires knowledge of batch computing knowledge. If real-time with Spark or Hadoop, you can develop a custom DSL language description of the syntax to match different computing engines, users do not need to focus on the upper underlying implementation details, only need to master a language with DSL Flink offline, you can complete Spark and access and Flink like Hadoop compute engine.

Eight, real-time and offline ETL platform

I.e. ETL Extract-Transform-Load, to describe the data from the source terminal via extraction (extract), the conversion (transform), load (load) to the destination process. The term more commonly used in data warehouse ETL, but the object is not limited to the data warehouse. In general ETL platform in data cleansing, data format conversion, data completion, data quality management, and it has a very important role. As an important intermediate layer data cleaning, general ETL to have at least the following several functions:

  1. Support multiple data sources, such as a message system, file system, etc.

  2. Supports multiple operators, filtering, segmentation, conversion, output data source query capabilities complement operator congruent

  3. Support dynamic logical change, such as the aforementioned operator to submit non-stop service can be done by posting changes jar dynamic way.

 

 

Nine, intelligent unified search platform

 

Most data queries are driven by demand, a demand to develop one or several interfaces, interfaces written document, open to business party calls, this model there are many problems in the big data system:

  1. This architecture is simple, but the interface is very coarse granularity, flexibility is not high, poor scalability, reuse rate. With increasing business needs, a substantial increase in the number of interfaces, high maintenance costs.

  2. Meanwhile, the development efficiency is not high, which for the vast amounts of data the system will obviously cause a lot of duplication of development, it is difficult to achieve logic and data multiplexing, severely reducing business side of applicable experience.

  3. If there is no unified search platform directly exposed to Hbase and other library services, follow-up of digital rights management operation and maintenance will be more difficult to access large data component is equally painful for the applicable business side, the slightest mistake will arise various problems .

     

Queries to solve the big problem of data query pain points through a set of intelligent

 

Ten, the number of bins Standard System Modeling

As the business increased complexity and scale data, confusing data calls and copies of different waste of resources, duplication of data definition of indicators brought brought ambiguity, data usage increasing the threshold. I witness to the actual business Buried and several warehouse use, for example, a trade name with some form fields are good_id, some called spu_id, there are many other names, who want to make use of these data will cause great distress. So there is no complete set of large data modeling systems, data governance will bring great difficulties, in particular in the following areas:

 

  1. Data standards are inconsistent, even the same name, but the definition of caliber have been inconsistent. For example, only a uv such indicators, there are a dozen definitions. The problem is caused by: all uv, I use what? Are uv, why the data is not the same?

  2. Enormous research and development costs, each engineer needs to know every detail of the development process from beginning to end, and for the same "pit" Everyone stepped back again, resulting in a waste of time and energy to research and development personnel costs. This is also the goal of the author encountered problems, developers want to extract the actual data difficult.

  3. There is no uniform standard specification management, resulting in a waste of resources such as double counting. The data table level, particle size is not clear, so that duplicate storage is also serious.

 

Therefore, the development of several large data warehouse table design must adhere to the principles of design, data platform development platform designed to restrain unreasonable, such as Alibaba OneData body. In general, developers have to go through the data in accordance with the following guidelines:

 

Interested can refer to Alibaba OneData design system.

XI, a key integration platform

Simple can be various kinds of data to a data collection platform key, data transmission through the internet to the data ETL seamless internet. ETL through and metadata platform open, standardized Schema definition, then the data is converted, split flows into real-time and off-line computing platform, any subsequent for the data off-line and real-time processing, only the application metadata table privileges to the development tasks to complete the calculation. Data acquisition support multiple kinds of data sources, the binlog e.g., log collection, the front end Buried, Kafka message queue

Twelve, data development IDE- efficient end-to-tool

Efficient one-stop solution data development tools, real-time calculation can be done through the IDE and offline computing tasks development, will all get through these platforms provide one-stop solutions. Data development IDE provides data integration, data development, data management, data quality and data services such as a full range of products and services, one-stop development and management interface, the data IDE complete data transmission, conversion and integration operations. Storing data from different incoming data, and transformation and development, and finally sends the processed data to the other data synchronization system. By efficient large data development IDE, basically engineers make big data can be masked various pain points, the ability to combine the above-mentioned multiple platforms, developers can make big data as easy to write SQL.

About development tools can reference data Ali cloud DataWorks.

Difficulties also need to address end to end several other auxiliary capacity, there is no longer described, interested students can study on their own.

XIII. Other

Complete R & D data system also includes an alarm monitoring center, dispatch center resources, isolating computing resources, data quality testing, one-stop data processing system, there will no longer continue to discuss it.

He published 191 original articles · won praise 3 · views 30000 +

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104988460