Tencent Cloud Big Data Practical Case

Content source: On May 20, 2017, Tencent Senior Software Engineer Wu Youqiang gave a speech sharing on " Tencent Cloud Big Data Actual Combat" at the "Mesozoic Technology Salon Series: Internet Big Data" . IT masters said that as an exclusive video partner, it is released after review and authorization by the organizer and speaker.

Words Reading: 1954 | 3 minutes to read

Guest speech video review and PPT , please click: http://t.cn/RgMHJEC

Summary

Tencent Cloud is a public cloud platform built by Tencent for businesses and individuals. Tencent senior software engineer Wu Youqiang will share with us the practice of big data in Tencent Cloud .

1. Introduction to TDF (Data Workshop)

Introduction to TDF

A lightweight cloud-based big data product derived from Tencent Cloud Data Intelligence Big Data Suite, providing a SQL-based big data computing framework.

It is suitable for scenarios that require dynamic and flexible acquisition of big data computing capabilities for batch computing, log processing or data warehouse applications.

Because users on the public cloud need to be simple, there must be a visual integrated development environment in which data blood relationship management, engineering/workflow management, user management, and alarms/logs can be carried out. Import the data into the data storage through some tools, then process the data, and finally output the data. The task and resource scheduling of the lower layer is used to schedule user tasks to run on various resources. The bottom layer is the infrastructure of Tencent Cloud .

Second, CDP (data pipeline) implementation details

CDP overall architecture-design

The picture above is the design we just started before development. There are many customer data points on the far left, such as log, DB Binlog, self-built Kafka, and custom data. We will use some tools to develop a Flume plug-in to help it upload data to the cloud.

When the data arrives in the middle part, the data is checked and processed. After the processing is completed, it can be imported into TDF, COS or other storage in real time through plug-ins according to user needs.

CDP overall structure-current

The above picture is the work we have implemented so far. We developed a Flume plug-in by ourselves to send data to the data receiver endpoint of Tencent public cloud in real time. The data receiver will decide whether to use Kafka or CKafka according to the user's choice. CKafka is also a self-developed messaging system compatible with the conversion protocol developed by Tencent Cloud. It is developed based on C++, and its performance will be much better than native. Import the data into Nifi for secondary development, and finally export it to Hive.

Introduction to Flume

FlumeNG is a distributed, reliable and usable system. It can efficiently collect, aggregate, and move massive logs from different data sources, and finally store them in a centralized data storage system. From the original Flume OG to the current Flume NG, the architecture has been refactored, and the current NG version is completely incompatible with the original OG version. After restructuring, Flume NG is more like a lightweight gadget, very simple, easy to adapt to various methods of log collection, and supports failover and load balancing.

Flume's architecture mainly has the following core concepts:

Event: A data unit with an optional message header.

Flow: The abstraction of Event migration from the source point to the destination point.

Client: Operate the Event at the source point and send it to the Flume Agent.

Agent: An independent Flume process, including components Source, Channel, and Sink.

Source: Used to consume the Event passed to the component.

Channel: A temporary storage for transferring Event, which stores the Event passed by the Source component.

Sink: Read and remove the Event from the Channel, and pass the Event to the next Agent in the Flow Pipeline (if any).

Flume plugin

Flume supports plug-in development. The easiest way is to directly copy existing plug-ins for modification.

The endpoint we provide requires permission verification, which is mainly based on some accounts of Tencent Cloud . In this way, the client can be encrypted or formatted for storage in real time.

First, we are a multi-user system, and second, we must prevent excessive user data. The data size limit can meet more than 90% of user needs, and the data size limit is determined according to its own configuration.

In the transmission process, we use some custom protocols, which are formatted based on avro, mainly to facilitate serialization and deserialization of data.

Kafka client transformation supports CKafka

CKafka (Cloud Kafka) is a distributed, high-throughput, high-scalability messaging system that is 100% compatible with the open source Kafka API (version 0.9). Ckafka is based on the publish/subscribe model, through message decoupling, which enables producers and consumers to interact asynchronously without waiting for each other. Ckafka has the advantages of data compression, supporting offline and real-time data processing at the same time, and is suitable for scenarios such as log compression collection and monitoring data aggregation.

CKafka is mainly open to some VIP users on the public cloud. VIP can only be bound to the corresponding virtual machine, which ensures its security. But we use the intranet IP to access directly, so we need to adjust the client's interactive protocol and replace the VIP with the real IP by some means to ensure the smoothness of the data. There are also custom management APIs and packaged Java SDK.

unless

ApacheNiFi is an easy-to-use, powerful and reliable data processing and distribution system. Apache NiFi is designed for data streaming. It supports powerful and highly configurable data routing, conversion and system intermediary logic based on directed graphs, and supports dynamic pull data from multiple data sources. Apache NiFi was originally a project of NSA, now it is open sourced and managed by the Apache Foundation.

Main features:

Web-based user interface: seamless experience design, control and monitoring.

Highly configurable: data loss tolerance and guaranteed delivery; low latency and high throughput; dynamic priority; stream can be modified at runtime; back pressure (Back presure).

Data source: Track data flow from beginning to end.

Design for expansion: build your own data processor; support rapid development and effective testing.

Security: Support SSL, SSH, HTTPS encrypted content, etc.; multi-tenant authorization and internal authorization/policy management.

Hive plugin

Get metadata: Get Hive table structure information, whether it supports Streaming API writing.

Data writing: insert insert, support multi-partition batch insert; support streaming; can write hdfs directly.

CDP future?

1. Support etl function, group the front end and do some real-time calculations.

2. Support real-time calculation and analysis. Users need to be able to directly get the structure to display on the front end, instead of performing calculations and analysis on other systems.

3. Support real-time SQL. Real-time computing may be more expensive for some users, and most people who do statistics will have a higher degree of mastery of SQL. Real-time SQL is to perform SQL query calculation on data.

4. Visual image operation interface. The needs of users are becoming more and more diversified. Many products on Tencent Cloud require data to be used. We hope that in this way, users can choose their own data sources.

That's all for my sharing today, thanks for listening!

Guess you like

Origin blog.csdn.net/NicolasLearner/article/details/109269799
Recommended