Big data frame, to go from Lambda ETL of IOTA

After so many years of development, has moved from the big data BI 1.0 / Datawarehouse era, through large data 2.0 Web / APP transition into the era of big data IOT 3.0, and followed by a change in data architecture.

▌Lambda architecture

In the past Lambda each data architecture to become a big data platform company the necessary infrastructure, it solves a big company data off-line batch processing and real-time data processing needs. A typical Lambda architecture as follows:

Data starting from the underlying data source, through the various formats of data into the large platform, through Kafka, Flume other data components collected in the big data platforms, and then divided into two lines is calculated. A line into the flow computing platforms (e.g. Storm, Flink or Spark Streaming), to calculate a number of real-time indicators; another line into the batch data processing offline computing platform (e.g. Mapreduce, Hive, Spark SQL), to calculate T + 1 related business indicators, which need to see the next day.

Lambda architecture after years of development, the advantage of a stable, real-time computing section for computing the cost of control, batch processing can be used at night time to calculate the overall batch, so the real-time calculation of peak and off-line calculation separately, this architecture supports data early development of the industry, but it also has some fatal flaws, and become increasingly unsuited to the needs of data analysis business in the era of big data 3.0. Shortcomings are as follows:

●  data in real-time and batch caliber problems caused by inconsistent results : Because batch and real-time computing is taking the two computing framework and calculation program, calculated results are often different, often see a number a day to see the data, see the next day but yesterday's data has changed.

●  batch calculation can not be completed within the calculated window : In the IOT era, the magnitude of data increases, often found only four or five hours a night time window, it has been unable to complete more than 20 hours during the day and accumulated data, to ensure that work in the morning before the time the data has become a headache for each of the data team.

data source changes must be re-developed, long development cycle : each change in format of the data source, business logic changes are required to make modifications to the development of ETL and Streaming, overall very long development cycle, business react quickly enough.

●  server to store large : the typical data warehouse design, produce large amounts of intermediate results table, resulting in the rapid expansion of data, increasing server memory pressure.

▌Kappa architecture

Lambda architecture required for maintenance of the above disadvantages of the two procedures, etc., LinkedIn's Jay Kreps combined with practical experience and personal experience made Kappa architecture. Kappa architecture is the core idea of ​​the whole amount of data to solve the problem by improving the flow of processing computing systems, making real-time computing and batch process using the same set of code. In addition Kappa architecture will only be considered for double counting of historical data when necessary, and if you need to repeat the calculation, the next Kappa architecture can start many instances of double-counting.

A typical Kappa architecture as shown below:

The core idea of Kappa architecture, including the following three points :

1. Kafka or similar MQ queue system to collect a variety of data, the amount of data you need to save a few days to a few days.

2. When the full required amount of recalculating and re-calculation example from a stream, reading from the beginning of data processing, and outputs the result to a new store.

3. When the new instance done, stop the old instance of stream computing, and to delete some old results.

Kappa architecture is that the advantages of real-time and off-line code unification, easy to maintain and unified the problem of data caliber. The disadvantage Kappa is also very clear :

●  Streaming for high-throughput powerless historical data: All data is calculated by the flow, even if also difficult to adapt to the era of real-time requirements of IOT data query response by increasing the number of concurrent instances.

●  long development cycle: In addition due to the acquisition of non-uniform data format, each time a different Streaming need to develop procedures Kappa architecture, resulting in long development cycle.

●  Server cost of waste: the core principle Kappa architecture relies on high-performance external storage redis, hbase service. But these two system components, and it is not designed to meet the full amount of data storage design, a serious waste of server costs.

▌IOTA architecture

And in IOT tide, the computing power of smart phones, PC, intelligent hardware more powerful, real-time data requirements and business needs demand response capacity is also growing, the traditional center of the past, non-real-time data processing ideas now no longer meet the needs of large data analysis, I propose a new generation of big data IOTA architecture to solve the above problem, the whole idea is to set a standard data model, the edge computing technology to generate all the dispersion calculations, calculation and query data among processes to a unified data model throughout, thereby enhancing the efficiency of the overall budget, while meeting the immediate needs of the calculation, use the Ad-hoc query to query a variety of underlying data:

IOTA overall technical structure is divided into several parts:

The Data Model the Common ● : overall business has always been throughout the data model, this model is the core business, to keep the SDK, cache, history data, query engine consistent. For users in terms of data analysis can be defined as "the main - that - Bin" or - this abstract model "Object events" to meet a variety of queries. APP with the familiar user model as an example, with the "main - that - Bin" model is described in "X Users - Event 1 - A page (2018/4/11 20:00)." Of course, depending on the business needs, you can also use "products - event", "location - Time" model, and so on. The model itself may be protocol (e.g. Protobuf) defined implemented SDK end, the central storage mode. Here is the core from the memory to the processing SDK is a unified Common Data Model.

Edge SDKs & Edge Servers : This is the end of data collection, not just a simple SDK in the past, in the case of complex calculations, will be given more complex calculations SDK, the end device is transformed into a unified data model for transmission. For example, Wi-Fi intelligent data collected from the AC side becomes "X MAC addresses - appear - A floor (2018/4/11 18:00)" The master - that - Object Structure for the camera will by Edge AI Server, transformed into "Face features of X - enter - a railway station (2018/4/11 20:00)." APP can be simple or page-level "X users - Event 1 - A page (2018/4/11 20:00)" mentioned above, for the APP and H5 page is concerned, there is no computational effort required only buried format can be.

●  Real Time the Data : Real-time data cache, in order to achieve the purpose of this section is calculated in real time, receiving huge amounts of data in real time into the historical database can not be massive, so there will be a delay indexing, file history data fragmentation and other issues. Therefore, there is a real-time data cache to store recently a few minutes or a few seconds of data. This may be used Kudu Hbase or other components to achieve. This part of the data will be incorporated into the historical data by which Dumper. Here the data model and the data side SDK consistent model, are Common Data Model, such as "master - that - Object" model.

●  Historical the Data : historical immersion area, partly to save a large amount of historical data, in order to achieve Ad-hoc queries, the index will automatically create relevant historical data query to improve the overall efficiency, enabling complex queries in seconds ten billion feedback data . For example HDFS can be used to store historical data, data models here are still SDK-side data model is consistent with the Common Data Model.

●  Dumper : Dumper main job is to put a few seconds or minutes of a recent real-time data, according to the aggregation rule, indexed and stored in the history storage structure which can be used map reduce, C, Scala to write, the relevant data from Realtime Data area writing Historical Data area.

●  Query Engine : query engine, providing a unified foreign query interfaces and protocols (such as SQL JDBC), and the Realtime Data Historical Data Query to merge together in order to achieve real-time data for Ad-hoc queries. Such as a common calculation engine can use presto, impala, clickhouse and so on.

●  Realtime Feedback Model : by Edge computing technology, the edge of the end to have more interaction can be done, can be controlled by the end of the Edge SDK Realtime Data to set the rules, for example, reduce the frequency of data upload, voice control quickly feedback, and so trigger certain conditions and rules. Simple event processing will be done through the local terminal IOT, for example, to identify suspect now there are many cameras with the function itself.

IOTA big data architecture, mainly has the following characteristics :

to the ETL of: ETL and associated development has been pain point and data processing, the IOTA architecture by Common Data Model design, focus data in one specific area calculations can be calculated from the SDK end, the central end only acquisition, indexing and query, to improve the overall efficiency of data analysis.

●  Ad-hoc instant query : In view of the overall calculation process mechanism, the phone side, the smart IOT event occurred, can be delivered directly to the cloud into the real time data area, can be used to query the front Query Engine. Now the user can use a variety of inquiries, found directly before the incident occurred a few seconds, rather than waiting for ETL or Streaming data processing and research and development.

●  edge computation (Edge-Computing) : the central authorities past overall calculation, dispersed data generation, storage and query side, consistent with data generated Common Data Model. At the same time, also give Realtime model feedback, at the same time allow the client to transfer data feedback immediately, without the need for a central terminal after processing all events have to be re-issued.

As shown above, there are a variety of IOTA architecture implementation, in order to verify IOTA architecture, Analysys also independently designed and implemented "second count" engine, now supports live Analysys 550 million monthly internal device side calculation, but also based on the "second count" engine that can be deployed independently developed within the enterprise customers, user analysis and digital marketing, "Analysys Ark", can be accessed ark.analysys.cn experience.

In the era of Big Data 3.0, Lambda large enterprise data architecture has been unable to meet the user's daily needs big data analysis and lean operations, to the ETL of IOTA big data architecture is the future.

Original: http://www.sohu.com/a/228020781_115326

Guess you like

Origin blog.csdn.net/BD_fuhong/article/details/93487941