How well-known manufacturers to build a big data platform & Architecture

Today we look at Taobao, the US group and pieces of big data platform, on the one hand to further study the giant big data platform architecture, on the other hand also learn how manufacturers engineers easel composition. Through these architecture diagram manufacturers, you will find that not only big data platform design of these well-known manufacturers of similar, architecture diagram of the painting there are routines you can find.

Taobao big data platform

Taobao may be the Chinese Internet industry earlier set up their own big data platform company, below is Taobao early Hadoop big data platform, more typical.

Taobao large data platform base is divided into three parts, the top is the data source and data synchronization; middle ladder 1, which is Taobao Hadoop large data clusters; The following are application of large data, using the calculation result of large data clusters.

The main source data from Oracle and MySQL database backup and log system and crawler system, these data into Hadoop cluster data synchronization via the gateway server. Which DataExchange non real-time full-volume database data synchronization, DBSync real-time synchronization database incremental data, TimeTunnel real-time synchronization of data logs and reptiles. All data is written to the HDFS.

In computing tasks by Skynet will be in Hadoop scheduling system, priority, scheduling jobs and submitted in accordance with the implementation of cluster resources and jobs. The calculation result is written to HDFS, and then after DataExchange synchronized to the MySQL and Oracle databases. Below the platform in the data cube, recommendation system reads data from the database, the user can respond to real-time operation request.

The core Taobao big data platform Skynet scheduling system architecture located on the left side of the map, submitted to the task on the Hadoop cluster requires sequential priority scheduled for execution on a Hadoop cluster has been defined task to be scheduled execution, when the database, log, crawler system also needs to import data scheduled for execution, when the execution result Hadoop export database application systems need scheduled for execution. We can say that the whole big data platform is to operate under the unified planning and scheduling arrangements Skynet system.

DBSync, TimeTunnel, DataExchange data synchronization component Taobao is developed in-house, you can import and export data for different data sources and synchronization requirements. These components Taobao most have been open source, we can use and reference.

US group Big Data platform

美团大数据平台的数据源来自 MySQL 数据库和日志,数据库通过 Canal 获得 MySQL 的 binlog,输出给消息队列 Kafka,日志通过 Flume 也输出到 Kafka。

Kafka 的数据会被流式计算和批处理计算两个引擎分别消费。流处理使用 Storm 进行计算,结果输出到 HBase 或者数据库。批处理计算使用 Hive 进行分析计算,结果输出到查询系统和 BI(商业智能)平台。

数据分析师可以通过 BI 产品平台进行交互式的数据查询访问,也可以通过可视化的报表工具查看已经处理好的常用分析指标。公司高管也是通过这个平台上的天机系统查看公司主要业务指标和报表。

美团大数据平台的整个过程管理通过调度平台进行管理。公司内部开发者使用数据开发平台访问大数据平台,进行 ETL(数据提取、转换、装载)开发,提交任务作业并进行数据管理。

滴滴大数据平台

滴滴大数据平台分为实时计算平台(流式计算平台)和离线计算平台(批处理计算平台)两个部分。

实时计算平台架构如下。数据采集以后输出到 Kafka 消息队列,消费通道有两个,一个是数据 ETL,使用 Spark Streaming 或者 Flink 将数据进行清洗、转换、处理后记录到 HDFS 中,供后续批处理计算。另一个通道是 Druid,计算实时监控指标,将结果输出到报警系统和实时图表系统 DashBoard。

离线计算平台架构如下。滴滴的离线大数据平台是基于 Hadoo 2(HDFS、Yarn、MapReduce)和 Spark 以及 Hive 构建,在此基础上开发了自己的调度系统和开发系统。调度系统和前面其他系统一样,调度大数据作业的优先级和执行顺序。开发平台是一个可视化的 SQL 编辑器,可以方便地查询表结构、开发 SQL,并发布到大数据集群上。

此外,滴滴还对 HBase 重度使用,并对相关产品(HBase、Phoenix)做了一些自定义的开发,维护着一个和实时、离线两个大数据平台同级别的 HBase 平台,它的架构图如下。


更多大数据架构文章,请关注《大数据技术进阶》微信公众号

来自于实时计算平台和离线计算平台的计算结果被保存到 HBase 中,然后应用程序通过 Phoenix 访问 HBase。而 Phoenix 是一个构建在 HBase 上的 SQL 引擎,可以通过 SQL 方式访问 HBase 上的数据。

小结

你可以看到,这些知名大厂的大数据平台真的是大同小异,他们根据各自场景和技术栈的不同,虽然在大数据产品选型和架构细节上略有调整,但整体思路基本上都是一样的。

不过也正是这种大同小异,让我们从各个角度更加了解大数据平台架构,对大数据平台架构有了更加深刻的认知。

Guess you like

Origin www.cnblogs.com/xiaodf/p/11611970.html