Evolution of Big Data platform Road | Taobao & Didi & US Mission

Disclaimer: This reference to the article to be finishing on the basis of big data platform Taobao / bit / US group published on. And reference links are given at the end of the text.
The three thanked the company's technical staff selfless dedication, if the article resulting in violations, please delete contact me. I respect the fact that on the basis of reorganized language and content, aims to reveal the composition of the readers and the development of a comprehensive big data platform.
In this paper, the case without my permission shall not be reproduced, otherwise pursue copyright liability.

By Big Data technology and architecture
scene description: I hope this inspire those students are building big data platform.

Keywords: big data platform

Big Data platform for computing, increasing the amount of data generated by today's society, with storage, calculation and show as a platform for the purpose. Big Data technologies is the ability to quickly obtain valuable information from various types of data. Techniques suitable for large data, including massively parallel processing (MPP) database, data mining grids, distributed file systems, distributed databases, cloud computing platform, the Internet, and scalable storage system.
Summary, big data platform along with the continuous development of business, data growth, increasing demand for data, analysis and data mining scene and gradually formed. This article describes the development process of Taobao, drops and beauty group three Internet companies big data platform, to provide the basic idea of building big data platform for everyone.

Taobao

Taobao may be the Chinese Internet industry earlier set up their own big data platform company, below is Taobao early Hadoop big data platform, more typical.

file
Ladder Data Warehouse Architecture - drawing from the "big data platform Taobao Road"

Taobao large data platform base is divided into three parts, the top is the data source and data synchronization; middle ladder 1, which is Taobao Hadoop large data clusters; The following are application of large data, using the calculation result of large data clusters.

The main source data from Oracle and MySQL database backup and log system and crawler system, these data into Hadoop cluster data synchronization via the gateway server. Which DataExchange non real-time full-volume database data synchronization, DBSync real-time synchronization database incremental data, TimeTunnel real-time synchronization of data logs and reptiles. All data is written to the HDFS.

file
Data synchronization tools - drawing from the "big data platform Taobao Road"

In computing tasks by Skynet will be in Hadoop scheduling system, priority, scheduling jobs and submitted in accordance with the implementation of cluster resources and jobs. The calculation result is written to HDFS, and then after DataExchange synchronized to the MySQL and Oracle databases. Below the platform in the data cube, recommendation system reads data from the database, the user can respond to real-time operation request.

The core Taobao big data platform Skynet scheduling system architecture located on the left side of the map, submitted to the task on the Hadoop cluster requires sequential priority scheduled for execution on a Hadoop cluster has been defined task to be scheduled execution, when the database, log, crawler system also needs to import data scheduled for execution, when the execution result Hadoop export database application systems need scheduled for execution. We can say that the whole big data platform is to operate under the unified planning and scheduling arrangements Skynet system.

DBSync, TimeTunnel, DataExchange data synchronization component Taobao is developed in-house, you can import and export data for different data sources and synchronization requirements. These components Taobao most have been open source, we can use and reference.

Drops

To date probably gone through three phases, the first is the business side of self-built small clusters; second stage is the big centralized cluster platform; the third stage of the SQL.
file
Figure derived from the "Evolution of Big Data Platform Road and pieces"

Offline computing platform architecture as follows. Drops offline Big Data platform is based on Hadoo 2 (HDFS, Yarn, MapReduce) and Spark and Hive building, on this basis, developed its own scheduling system and development system. Scheduling system and other systems as the front, large data scheduling and job execution priority order. Development platform is a visual SQL editor, you can easily check the table structure, develop SQL, and posted to the big data clusters.

file
Figure derived from the "Evolution of Big Data Platform Road and pieces"

In addition, severe drops also HBase use, and related products (HBase, Phoenix) to do some custom development, and maintains a real-time, two large offline data platform with the level of HBase platform, its architecture is shown below .

file
Figure derived from the "Evolution of Big Data Platform Road and pieces"

The results from the real-time computing platform and offline computing platform is saved to HBase, and then an application to access HBase by Phoenix. The Phoenix is ​​a build SQL engine on HBase, you can access data on HBase way through SQL.

For maximum convenience of business development and management side stream computing tasks, and pieces to build a real-time computing platform as shown below. Provided in the flow calculation based on the engine StreamSQL IDE, alarm monitoring, diagnostic systems, kinship, task management and control capabilities.
file
Figure derived from the "Evolution of Big Data Platform Road and pieces"

Yoshi团

Our architecture angle data stream is the whole US mission data platform architecture, big data platform data sources from the MySQL database and log database to obtain the MySQL binlog through the Canal, output to the message queue Kafka, logs are output to the Kafka by Flume , but also back to the ODPS.

file
Figure derived from the "US group Big Data platform."

Kafka calculate the flow of data will be calculated and batch two engines respectively consumption. Storm stream processing using the calculated result is output to the database or HBase. Batch calculated using Hive analysis and calculation results are output to the query system and BI (business intelligence) platform.

Data analysts can interactively query access to data through BI product platform, you can also view has been handled well by visual analysis indicators commonly used reporting tools. Company executives also see the company's main business indicators and reports by secret system on this platform.

file

This figure is a deployment architecture diagram offline data platform, the bottom three basic services, including Yarn, HDFS, HiveMeta. Different computing scenarios provide different computing engine supports. If the new company is, in fact, there is some architecture selection. Cloud Table HBase do their own packing seal. We use Hive to build a data warehouse, data mining and machine learning with Spark, Presto support queries on Adhoc, may also write complex SQL. There is no correspondence between Presto deployed to Yarn, Yarn is synchronized with, Spark is running on Yarn. Hive is currently dependent on Mapreduce currently trying Hive on tez testing and deployment on the line.

In addition, we learned that the number of positions in the construction of real-time, the US group has migrated from the original Storm to Flink, Flink's API, fault tolerance and state persistence mechanisms can solve part of the problem encountered in using the Storm. Flink not only supports a large number of commonly used SQL statements, covering the common development scenarios. Flink's Table and can be managed by TableSchema, support for rich data types and data structures, and data sources. It can be easily and existing metadata management system or the system configuration management binding.

The whole process big data management platform US group managed by scheduling platform. Internal development platform for developers to use data access big data platform, ETL (extract, transform, load) to develop, submit the job tasks and data management.

Reference links and author:
Wei less the Java
https://www.jianshu.com/p/58869272944b

Taobao big data path of
http://www.raincent.com/content-85-7736-1.html

And pieces of big data computing platform migration path
https://blog.csdn.net/yulidrff/article/details/85680731

US group Big Data platform
https://blog.csdn.net/love284969214/article/details/83652012

Big Data technology and architecture
Welcome to my public concern scan code number, reply] [JAVAPDF can get a 200 Autumn trick interview questions!

Guess you like

Origin www.cnblogs.com/importbigdata/p/11517061.html