Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

Foreword

Apache Flink has been recognized as the industry's best flow calculation engine. However, Flink do not really confined to a stream processing engine. Positioning both Apache Flink is a flow batch, machine learning and other computing functions of large data engine.

In recent times, Flink have made great breakthroughs in many big data scene batch and machine learning. On the one hand batch Flink calculated after Ali had to optimize the order of magnitude of improvement. On the other hand, Flink community are gradually force in tableAPI, Python, and ML and many other areas, continue to improve the user do Data science and AI computing experience. Additionally, Flink also increased gradually and other open source software integration experience, including Hive, there Notebook (Zeppelin, Jupyter) and so on.

Flink learning path

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

Flink Ali, bit, byte beating, the US group, the only product development status Ctrip is what?

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

In the current situation 1.Flink Ali

Based on Apache Flink Alibaba platform built in 2016 formally launched, and begin to implement these two scenes from the search and recommendation Ali Baba. All current business Alibaba, Alibaba include all subsidiaries have adopted the real-time computing platform based Flink built. Meanwhile Flink computing platform to run on the open source Hadoop cluster. As the use of Hadoop YARN Resource Management Scheduler to HDFS as a data storage. Therefore, Flink can and open source software Hadoop big data seamlessly.

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

Currently, this Flink built based real-time computing platform not only serve the internal Alibaba Group, and cloud-based Flink product support by Ali cloud cloud products to the entire API developer ecosystem.

2.Apache Flink in pieces

In the bit, substantially all of the data can be divided into four blocks:

1, data tracks: track data and order data tend to be of special interest to the business side. At the same time because each user in a taxi with

After, you have to see their real-time trace, so these data have a strong real-time requirements.

2, the transaction data: transaction data by bit,
3, Buried data: Buried service data pieces each side, comprising a terminal and a rear end of all the service data,

4, log data: The entire logging system has some particularly strong demand in real time.

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

3. bytes beating Jstorm migration practice to Apache Flink's
below this picture shows a byte beating the company's business scene

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

First, the application layer ads, AB test, push, data warehousing and other services; secondly intermediate layer abstracts for python user a template, users only need to write their own business code in the template, with a yaml configuration spout, bolt the composition of DAG FIG; finally the calculation engine run on Jstorm.

Probably around 17 years in July, when the number of clusters Jstorm probably about 20, cluster size up to 5000 machines.

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

4.Apache Flink US corporations in practice and application of
real-time computing platform for the status and background of the US Mission

Real-time platform architecture

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

The figure presented is a brief framework of the current US mission in real-time computing platform. The bottom layer is the data cache, you can see the US group measured log data for all classes, Kafka are collected through a unified log collection system. Kafka layer as the largest data transfer, US group supporting a lot of business lines, including pulling off, and part of real-time processing business. On top of data cache layer, it is an engine layer, which is the left side of the real-time calculation engine we currently offer, including the Storm and Apache Flink (hereinafter referred to as Flink). Before the Storm is deployed standalone mode, Flink due to the environment in which it is now operating, the US group chose On YARN mode, in addition to the calculation engine, we also provide some real-time storage function for storing intermediate state calculations, the result of the calculation, and the dimension data, this type of storage currently contains Hbase, Redis and ES. In the above calculation engine, tends to be varied layer, this layer is mainly data-oriented development of the students. Real-time data development faces many problems, such as debugging tuning aspects of the program will develop a lot of difficulty than ordinary program. In this layer platform data, the US Mission to provide user-oriented, real-time computing platform, not only can host job, you can also achieve tuning diagnostics and alarm monitoring, in addition to real-time data retrieval and rights management and other functions. In addition to providing real-time computing platform for data-oriented development of the students, the US group is doing now building also includes a metadata center. This is the future we want to be a prerequisite for SQL, the metadata center is an important part of carrying real-time streaming system, we can understand it as a real-time system of the brain, which can store data Schema, Meta. The top-level architecture is that we now support real-time computing platform business, not only contains real-time query and retrieval of log online business, but also covers the current very popular real-time machine learning. Machine learning often involve search and recommendation scene, two scenes of the most notable features:
First, will produce massive real-time data;
Second, the flow of QPS quite high. At this point we need to extract real-time computing platform work load-bearing part of the real-time features, to achieve the recommended search service applications. Another type is more common scenario, including real-time features polymerization, zebra Watcher (monitoring may be considered a class of service), and other real time data warehouse.

These are the real-time computing brief US group currently based platforms.


5. the Apache Flink in practice only product will be the
main content of this paper includes the following aspects:

1. Status of the only product in real-time platform

2. Apache Flink (hereinafter referred to as Flink) in practice only product will

3. Flink On K8S

4. Follow-up Plan

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

6. Ctrip platform based on real-time characteristics of Apache Flink

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

Feature platform system architecture

 

Lamda architecture is now the standard architecture, the offline part by the spark sql + dataX composition. KV is now used to store system Aerospike, redis with the main difference is the use of SSD as main memory, we measured the pressure down most of the scenes in the order redis read and write performance with the same data.

Real-time part: flink use as a computational engine, tell us about the user's way:

• Registration Data source: currently supports real-time data sources mainly Kafka and Aerospike, where the data Aerospike in real-time or off-line if it is configured features on the platform, will be automatically registered. Kafka data source file to be uploaded corresponding schemaSample

• computational logic: With SQL expression

• Define Output: output after Aerospike table definition may be required and Kafka Topic, Insert or Update configured to push data to a user key to complete the above operation, all the information is written to the platform json profile. Next platform configuration file and before the prepared flinkTemplate.jar (containing all the necessary platforms flink function) submitted to the yarn, starting flink job.

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

As the article content restrictions, the contents of the case papers I also made a brief introduction, Technical articles, I do not do too much introduction, we need this practice manual] [Flink technical documentation, you can forward the concern small series , private letter Xiaobian "documents" to get way to get ~ ~ ~ ~

Article takes you understand Ali, bit, byte beating, the US group, Ctrip's core technology Flink

 

I Come learn Ha, thank you for your support!

Published 29 original articles · won praise 17 · views 6609

Guess you like

Origin blog.csdn.net/qq_1813353297/article/details/105227648