Next-Generation Architecture for Big Data Analytics--IOTA

What is IOTA? Are you ready for the next generation of big data architecture?

After so many years of development, it has moved from the BI/Datawarehouse era of Big Data 1.0, through the Web/APP transition of Big Data 2.0, to the Big Data 3.0 era of IOT, and with it comes changes in data architecture.

▌Lambda Architecture

In the past, the Lambda data architecture has become a must-have architecture for every company's big data platform. It solves the needs of a company's large data batch offline processing and real-time data processing. A typical Lambda architecture is as follows:

image.png

Data starts from the underlying data source, enters the big data platform through various formats, and is collected by data components such as Kafka and Flume in the big data platform, and then divided into two lines for calculation. One line enters the streaming computing platform (such as Storm, Flink or Spark Streaming) to calculate some real-time indicators; the other line enters the offline computing platform for batch data processing (such as Mapreduce, Hive, Spark SQL) to calculate T+1 related business metrics, which take two days to see.

After years of development, the Lambda architecture has the advantage of being stable. The computing cost of the real-time computing part is controllable. Batch processing can be performed in batches at night, which separates real-time computing and offline computing peaks. This architecture supports data The early development of the industry, but it also has some fatal shortcomings, and in the era of big data 3.0, it is increasingly unsuitable for the needs of data analysis business. The disadvantages are as follows:

● Data caliber problem caused by inconsistency between real-time and batch calculation results: Because batch and real-time calculation use two calculation frameworks and calculation programs, the calculated results are often different. Often, a number is seen as one data on the day, and the next day. Yesterday's data has changed instead.

● Batch calculation cannot be completed within the calculation window: In the IoT era, the amount of data is getting bigger and bigger, and it is often found that there is only a time window of 4 or 5 hours at night, and it is no longer possible to complete the accumulated data of more than 20 hours during the day, ensuring that you can go to work in the morning Getting data out on time has become a headache for every big data team.

●Data source changes must be re-developed, and the development cycle is long: Every time the format of the data source changes, the business logic needs to be developed and modified for ETL and Streaming. The overall development cycle is long and the business response is not fast enough.

● Large server storage: The typical design of a data warehouse will generate a large number of intermediate result tables, resulting in rapid data expansion and increasing server storage pressure.

▌Kappa Architecture

In view of the above shortcomings such as the need to maintain two sets of programs in the Lambda architecture, Jay Kreps of LinkedIn proposed the Kappa architecture based on practical experience and personal experience. The core idea of ​​the Kappa architecture is to solve the problem of full data processing by improving the stream computing system, so that real-time computing and batch processing use the same set of codes. In addition, the Kappa architecture believes that the historical data will only be recalculated when necessary, and if repeated calculations are required, many instances can be started under the Kappa architecture for repeated calculations.

A typical Kappa architecture is shown below:

image.png

The core idea of ​​Kappa architecture includes the following three points:

1. Use Kafka or a similar MQ queue system to collect various data, you need a few days of data to store for a few days.

2. When full recalculation is required, restart a stream computing instance, read the data from the beginning for processing, and output it to a new result storage.

3. When the new instance is finished, stop the old stream computing instance and delete some old results.

The advantage of the Kappa architecture is that it unifies real-time and offline code, which is convenient for maintenance and unifies the problem of data caliber. The disadvantages of Kappa are also obvious:

● Stream processing is incapable of high throughput of historical data: All data is computed through streaming, and even by increasing the number of concurrent instances, it is difficult to adapt to the immediacy requirements of data query responses in the IOT era.

● Long development cycle: In addition, due to the inconsistency of the collected data formats under the Kappa architecture, different Streaming programs need to be developed each time, resulting in a long development cycle.

● Waste of server costs: The core principle of the Kappa architecture relies on external high-performance storage redis and hbase services. However, these two system components are not designed to meet the full data storage design, which is a serious waste of server costs.

▌IOTA Architecture

Under the tide of IOT, the computing power of smartphones, PCs, and smart hardware devices is getting stronger and stronger, and business needs require data to respond to demand in real time. It is no longer suitable for the current big data analysis needs. I propose a new generation of big data IOTA architecture to solve the above problems. The overall idea is to set a standard data model and use edge computing technology to disperse all computing processes in data generation, calculation and query. During the process, a unified data model is used throughout, thereby improving the overall budget efficiency and meeting the needs of real-time computing. Various Ad-hoc Query can be used to query the underlying data:

image.png

The overall technical structure of IOTA is divided into several parts:

● Common Data Model: The data model that runs through the entire business. This model is the core of the entire business. It is necessary to keep the SDK, cache, historical data, and query engine consistent. For user data analysis, it can be defined as an abstract model such as "subject-predicate-object" or "object-event" to satisfy various queries. Taking the familiar APP user model as an example, the "subject-predicate-object" model is described as "X user - event 1 - page A (2018/4/11 20:00)". Of course, "product-event", "place-time" models, etc. can also be used, depending on business needs. The model itself can also be defined on the SDK side according to the protocol (such as protobuf), and the central storage method. The core here is that from SDK to storage to processing is a unified Common Data Model.

● Edge SDKs & Edge Servers: This is the data collection side, not just the simple SDK in the past. In the case of complex computing, the SDK will be given more complex computing, which will be transformed into a unified data model on the device side. to transmit. For example, for the data collected by smart Wi-Fi, it becomes the subject-predicate-object structure of "X user's MAC address-appearance-A floor (2018/4/11 18:00)" from the AC side. Through Edge AI Server, it is transformed into "X's Face Feature - Entry - A Railway Station (2018/4/11 20:00)". It can also be the simple APP or page-level "X user - event 1 - A page (2018/4/11 20:00)" mentioned above. For APP and H5 pages, there is no computational workload, only the requirement Buried point format.

● Real Time Data: Real-time data buffer area. This part is for the purpose of real-time calculation. It is impossible to receive massive data into the historical database in real time, which will cause problems such as indexing delay and historical data fragmentation files. Therefore, there is a real-time data buffer to store the last few minutes or seconds of data. This can be implemented using components such as Kudu or Hbase. This part of the data will be merged into the historical data through Dumper. The data model here is consistent with the SDK-side data model, both of which are Common Data Models, such as the "subject-predicate-object" model.

● Historical Data: Historical data immersion area, this part stores a large amount of historical data. In order to realize ad-hoc query, relevant indexes will be automatically established to improve the overall historical data query efficiency, so as to realize the feedback of tens of billions of data in seconds complex query . For example, you can use HDFS to store historical data. The data model here is still the same as the Common Data Model on the SDK side.

● Dumper: The main job of Dumper is to store the real-time data of the last few seconds or minutes, according to the aggregation rules and indexing, into the historical storage structure, which can be written by map reduce, C, and Scala, and the related data can be converted from Realtime The Data area is written to the Historical Data area.

● Query Engine: The query engine provides a unified external query interface and protocol (such as SQL JDBC), and combines Realtime Data and Historical Data for query, thereby realizing real-time ad-hoc query of data. For example, common computing engines can use presto, impala, clickhouse, etc.

● Realtime model feedback: Through Edge computing technology, more interactions can be done at the edge, and the Edge SDK can be controlled by setting rules in Realtime Data, for example, the frequency of data upload is reduced, and the voice control is faster. Feedback, triggering of certain conditions and rules, etc. Simple event processing will be done through the local IOT terminal. For example, the identification of suspects now has many cameras with this function.

The IOTA big data architecture mainly has the following characteristics:

● De-ETLization: ETL and related development have always been the pain points of big data processing. The IOTA architecture focuses on data calculation in a specific field through the design of the Common Data Model, so that the calculation can be started from the SDK side, and the central side only does collection, Build indexes and queries to improve the efficiency of overall data analysis.

● Ad-hoc real-time query: In view of the overall computing process mechanism, when the mobile terminal and smart IOT events occur, they can be directly sent to the cloud to enter the real time data area, and can be queried by the front-end Query Engine. At this point, users can use various queries to directly check the events that occurred in the previous few seconds, instead of waiting for ETL or Streaming data development and processing.

● Edge-Computing: Unify the past to the center for overall computing, and disperse it to data generation, storage, and query. Data generation conforms to the Common Data Model. At the same time, it also gives Realtime model feedback, so that the client can send feedback immediately while transmitting data, and it is not necessary to send all events to the central end after processing.image.png

As shown in the figure above, the IOTA architecture has various implementation methods. In order to verify the IOTA architecture, Analysys also independently designed and implemented a "second calculation" engine. Based on the "Second Calculation" engine, "Analysys Ark" has been developed which can be independently deployed in enterprise customers for digital user analysis and marketing. You can visit ark.analysys.cn for experience.

In the era of Big Data 3.0, the Lambda big data architecture can no longer meet the needs of enterprise users for daily big data analysis and lean operations, and the de-ETLized IOTA big data architecture is the future.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325193876&siteId=291194637