Distributed computing platform based on Mesos and Docker

In response to business growth, speed of change and large-scale computing demands in the "Internet+" era, inexpensive, highly scalable distributed x86 clusters have become a standard solution. For example, Google has deployed distributed systems on tens of millions of servers . The emergence and development of Docker and its related technologies have brought new imagination to large-scale cluster management. How to effectively combine the two? This article will introduce the practice of Shuren Technology's distributed computing platform based on Mesos and Docker.

Distributed System Design Guidelines

Scalability

First of all, distributed systems must be large-scale systems with good Scalability. For cost reasons, many large-scale distributed systems generally use cheap PC servers instead of large, high-performance servers.

no single point of failure

Inexpensive PC servers often encounter various problems in large-scale use. The hardware of PC servers cannot be highly reliable. For example, a large number of hard drives in Google's data center fail every day, so distributed systems must be Hardware fault tolerance ensures that there is no single point of failure. In this very unstable and unreliable hardware computing environment, to build a distributed system to provide highly reliable services, software must be fault-tolerant. Distributed systems have two design considerations for the requirement that a single point of failure is not allowed. One is enterprise-level applications of the service class. Each service background instance must have multiple copies. One or two hardware failures will not affect all Service instance; another data storage application, each piece of data must have multiple backups to ensure that the data will not be lost even if some hardware is broken.

High reliability

In addition to a single point of failure, high reliability is also guaranteed. In a distributed environment, for enterprise-level service applications, load balancing and service discovery are required to ensure high reliability; for data services, in order to achieve high reliability, the overall data must first be sharded according to a certain algorithm (because One server cannot be installed), and then follow the same algorithm to perform shard search.

data locality

Another distributed design concept is data locality, because network communication overhead is the bottleneck of distributed systems. To reduce network overhead, computing tasks should be made to find data instead of data.

Comparison of Distributed System and Linux Operating System

Since the optimization space for vertical expansion is too small (the performance limit of a single server is obvious), the distributed system emphasizes horizontal expansion and horizontal optimization. When the computing resources of the distributed cluster are insufficient, it is necessary to add servers to the cluster. Improve the computing power of distributed clusters. The distributed system needs to manage all the servers of the cluster in a unified way, and shield the underlying management details, such as fault tolerance, scheduling, communication, etc., so that developers feel that the distributed cluster is logically a server.

Compared with the stand-alone Linux operating system, although the distributed system has not yet matured to become a "distributed operating system", it, like stand-alone Linux, has to solve the necessary functions of five types of operating systems, namely resource allocation, process management, task scheduling, Inter-process communication (IPC) and file system can be solved by Mesos, Docker, Marathon/Chronos, RabbitMQ and HDFS/Ceph respectively, corresponding to Linux Kernel under Linux, Linux Kernel, init.d/cron, Pipe/Socket and ext4, as shown in Figure 1.

 

 

 

Figure 1 Comparison of distributed systems and Linux operating systems 

 

 

Mesos-based distributed computing platform

Mesos resource allocation principle

At present, our Mesos cluster is deployed on public cloud services, using more than 100 virtual machines to form a Mesos cluster. Mesos does not require the computing node to be a physical server or a virtual server, as long as it is a Linux operating system. Mesos can be understood as a distributed Kernel that only allocates cluster computing resources and is not responsible for task scheduling. Based on Mesos, different distributed computing platforms can be run, such as Spark, Storm, Hadoop, Marathon and Chronos. Computing platforms such as Spark, Storm, and Hadoop have task scheduling functions. You can directly use the Mesos SDK to request resources from Mesos, and then schedule computing tasks by yourself, and are tolerant to hardware faults. Marathon provides task scheduling for service-oriented distributed applications, such as corporate websites and other services that need to run for a long time. Usually website applications do not have task scheduling and fault tolerance capabilities, because website programs are less able to deal with complex issues such as which machine to restore after a background instance dies. Such service-based distributed applications without task scheduling capabilities can be scheduled by Marathon. For example, Marathon schedules and executes 100 background instances of website services. If an instance hangs, Marathon will restore this instance on other servers. Chronos provides task scheduling for distributed batch processing applications, such as periodic processing of logs or periodic scheduling of offline tasks such as Hadoop.

The biggest benefit of Mesos is that it can do fine-grained resource allocation to distributed clusters. As shown in Figure 2, the left is the coarse-grained resource allocation, and the right is the fine-grained resource allocation.

Figure 2 Two ways of Mesos resource scheduling

There are three clusters on the left side of Figure 2. Each cluster has three servers, and three distributed computing platforms are installed. For example, three Hadoop are installed on the top, three are Spark in the middle, and three are Storm below. The three different frameworks are managed respectively. . On the right is a Mesos cluster that manages 9 servers uniformly, and all tasks from Spark, Hadoop or Storm are mixed and run on 9 servers. Mesos first improves the resource redundancy rate. Coarse-grained resource management definitely brings a certain amount of waste, while fine-grained resources improve resource management capabilities. The Hadoop machine is idle, Spark is not installed, but Mesos can respond immediately to any schedule. The last one is data stability, because all 9 units are managed by Mesos. If Hadoop is installed, Mesos will schedule the cluster. This computing resource is not shared, and storage is not easily shared. If Spark is used for network data migration, it will obviously affect the speed. Then the method of resource allocation is resource offers, which is to choose the schedulable resources in the window, Mesos is Spark or Hadoop and so on. In this way, the allocation logic of Mesos is very simple, as long as it keeps reporting which resources are available. The Mesos resource allocation method also has a potential disadvantage, that is, the decentralized allocation method, so it may not bring the globally optimal method. But this data resource shortcoming is not very serious for now. At present, the resource contribution rate of a computing center is difficult to reach 50%, and most computing centers are very idle.

Mesos resource allocation example

The following specific examples illustrate how to use Mesos resource allocation. As shown in Figure 3, the middle is Mesos Master, the bottom is Mesos Slave, and the top is Spark and Hadoop running on Mesos. Mesos Master reports available resources to Spark or Hadoop. Assuming that Hadoop has a task to run, Hadoop selects a Mesos Slave node from the available resources reported by the Mesos Master, and then the task will be executed on the Mesos Slave node. This is the task to complete a resource allocation, and then the Mesos Master continues Make resource allocation.

Figure 3 Example of Mesos resource allocation

task scheduling

Mesos只做一件事情,就是分布式集群资源分配,不管任务调度。Marathon和Chonos是基于Mesos来做任务调度。如图4所示,Mesos集群混合运行来自Marathon和Chronos的不同类型的任务。Marathon和Chonos基于Mesos做任务调度时,一定是动态调度,也就是每个任务在执行之前是不知道它将来在哪一台服务器上执行和绑定哪一个端口。如图5所示,9台服务器组成的Mesos集群上混合运行各种Marathon调度的任务,中间一台服务器坏掉以后,这台服务器上的两个任务就受影响,然后Marathon把这两个任务迁移到其他服务器上,这就是动态任务调度带来的好处,非常容易实现容错。

图4 Mesos集群运行不同类型的任务

图5 Marathon动态任务调度

为了减少硬件故障对应用服务的影响,应用程序要尽量做到无状态。无状态的好处是在程序受到影响时不需要进行任何恢复,这样这个程序只要重新调度起来就可以。无状态要求把状态数据放到存储服务器或者是消息队列里面,这样的好处是容错时恢复起来会变得很方便。

服务类的高可靠性

对于服务类型的任务,分布式环境保证服务的高可靠性,这需要负载均衡和服务发现。在分布式环境下做负载均衡有一个难点就是后台这些实例有可能发生动态变化,比如说某一个节点坏掉了,这个节点上的实例会受到影响,然后迁移到其他节点上。然而传统负载均衡器的后台实例地址端口都是静态的。所以在分布式环境下,为了做负载均衡一定要做服务发现。比如,某个服务之前有四个事例,现在新添加了两个实例,需要告诉负载均衡器新增加的实例的地址和端口。服务发现的过程是由几个模块配合完成,比如说Marathon给某个服务增加了新的实例,把新调度的实例地址端口写到Zookeeper,然后Bamboo把Zookeeper里存放的该服务新的实例的地址端口信息告诉负载均衡器,这样负载均衡器就知道新的实例地址端口,完成了服务发现。

数据类的高可靠性

对于服务类型的应用,分布式系统用负载均衡器和服务发现来保证高可靠性的服务。对于数据类型的应用,分布式系统同样要保证高可靠的数据服务。首先要做数据分片,一台服务器存不下所有数据就分成多份来存,但对数据进行分片必须按照某个规则来进行分片,后面查找时要按照同样的规则来进行分片查找,就是一致性。假定最原始的方案我们用Hash计算做成方法,在线性空间上分了三份以后,我要在数据分成三块机器来存,三台机器都存满了时,再把数据进行分配的时候不再把它分配到直线线性空间上,而是把它分配到环状空间上,把起点和终点连接起来,连成一个数据环,如图6所示,这样相应的数据点就放在这一块。如果要添加一个新的数据中心就在环上新切出来这块,这样很方便,切出来这一部分代表这一部分数据都应该放到新的芯片上,所以把原来子数据分片挪到嵌入式的分片上。

图6 数据分片

还有可能删除数据,我们把黄色的数据放到红色的数据上,这是环的好处。实际为了做到高可靠性,任何一个数据可能假定映射到黄色部分以后,这些黄色的部分只要映射到任何一个黄色的区域都会存在同一片机器上,同一片机器底层会有多个副本和做数据的备份,这是实际数据分片的一个实例。这是怎么做数据的高可靠性。这些数据分片,还有负载均衡,都是为了对应分布式分片硬件带来的不可靠和失效,这是我们用分布式系统最大的特点。

基于Docker的分布式计算平台

Docker工作流

我们主要用Docker来做分布式环境下的进程管理。Docker工作流如图7所示,我们不仅把Docker应用到生产阶段,也应用到开发阶段,所以我们每天编辑Dockerfile,提升Docker Images,测试上线,发Docker镜像,在我们内部私有Docker regis里面,再调到我们Docker集群生产环境里面,这和其他的Docker工作流没有什么区别。

图7 Docker工作流

在Mesos提交Docker任务

因为Mesos和Docker已经是无缝结合起来。通过Marathon和Chronos提交服务型应用和批处理型应用。Marathon和Chronos通过RESTful的方式提交任务,用JSON脚本设定应用的后台实例个数、应用的参数、以及Docker Images的路径等等。

分布式环境下的进程通信

在分布式环境下应用服务之间通信,是用分布式消息队列来做,我们用的是RabbitMQ。RabbitMQ也是一个分布式系统,它也要保证高可靠性、解决容错的问题。首先RabbitMQ也有集群,如图8所示,六个节点组成了一个RabbitMQ的集群,每个节点之间是互为备份的关系,任何一个坏掉,其他五个还可以提供服务,通过冗余来保证RabbitMQ的高可靠性。

图8 RabbitMQ集群

其次,RabbitMQ也有数据分片机制。因为消息队列有可能很长,长到所有的消息不可能都放到一个节点上,这时就要用分片,把很长的消息队列分为几段,分别放到不同的节点上。如图9所示是RabbitMQ的联盟机制,把一个消息队列打成两段,一段放在上游一段放在下游,假定下游消息队列的消息被消费完了就自动把上游消息队列里的消息移到下游,这样一个消息队列变成非常长的时候也不怕,分片到多个节点上即可。

图9 消息队列分片

分布式文件系统

最后讲一下分布式文件系统HDFS和Ceph。Hadoop文件系统HDFS,如图10所示,每个数据块有三个备份,必须放在不同的服务器上,而且三个备份里面每个机架最多放两份,这么做也是为了容错。Ceph是另一种流行的开源分布式文件系统。Ceph把网络存储设备抽象成一张逻辑硬盘,然后“挂载”到分布式集群的每台服务器上,原理上非常像是Linux操作系统Mount一块物理硬盘。这样一来,用户程序访问Ceph的文件系统就跟访问Linux本地路径一样,非常方便。

图10 分布式文件系统

分布式环境下的监控

分布式环境下,程序不是运行在本地,而是在集群上面,没有监控就等于程序运行在黑盒子下,无法调优,必须要有监控。分布式环境下的监控分为两个部分,一是性能监控,另一个是报警。性能监控要知道每个应用程序运行状态是什么样,即每一个应用程序占了多少CPU内存、服务的请求处理延迟等。我们是用Graphite来做应用程序性能监控;还有其他系统,比如MongoDB、Hadoop等开源系统,我们用Ganglia来做性能监控,比如CPU内存硬盘的使用情况等。报警是要在关键服务出现故障时,通知开发运维人员及时排解故障,我们用Zabbix来做报警。(责编/周建丁)

作者简介:王璞,先后在硅谷供职于StumbleUpon、Groupon和Google,擅长海量数据处理、分布式计算以及大规模机器学习。2014年回国创办数人科技,基于Mesos和Docker构建分布式计算平台,为企业客户提供大数据分析处理一站式解决方案。

 

http://www.csdn.net/article/2015-06-09/2824906

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326599213&siteId=291194637