A Preliminary Study of Mesos Persistent Storage

Persistence is a key work of the next version of Mesos, and it is also a problem that must be solved to improve the utilization of Mesos distributed environment resources. This article is based on the content of his speech in the second phase of Mesos Meetup, explaining the idea of Mesos solving the problem of persistent storage, and introducing two related features of the upcoming Mesos 0.23: Persistent Volumes and Dynamic Reservations.

How to integrate MySQL, Mongodb and other storage services or stateful services into Mesos is a question that needs to be considered when using Mesos. We believe that only when MySQL cluster, MongoDB cluster, RabbitMQ cluster, etc. all use Mesos' resource pool can Mesos become a truly distributed operating system and improve the resource utilization of the entire distributed environment. However, in the current Mesos 0.22, the community has not given a generic solution, and some teams have only open sourced projects for individual cases. For example, Twitter's open source Mysos is a Mesos framework that runs a MySQL instance. The availability of Mysos etc. aside, I think what Mesos needs is a unified, common solution.

Recently, the Mesos community announced the Mesos 0.23 release plan in its mailing list (Proposal) , which mentioned 3 important features: SSL, Persistent Volumes and Dynamic Reservations, the latter two are storage-related features. These functions are the prototype of the Mesos community to solve the persistence problem. Here I will share the general idea of Mesos to solve this problem. In addition, since Mesos 0.23 has not been released at the time of writing this article, there may be some discrepancies with the final version. Moreover, as can be seen from the original Paper of Mesos, Mesos is born for short-term or stateless computing tasks, so the stability and feasibility of using Mesos to manage long-term services have yet to be verified by practice. .

How can we currently solve the persistence problem

Move stateful services outside of the Mesos cluster

Obviously, this goes against our original intention of using Mesos, and we cannot use Mesos to improve the resource utilization of nodes outside the cluster. However, in many application scenarios, for example, the resource usage of stateful services tends to be constant most of the time, and using existing mature solutions to quickly build stable Redis and RabbitMQ clusters is still the solution with the least effort.

Use the local filesystem on the Mesos node

Mesos can publish tasks to the specified slave by restricting the role, so that we can let the task of the stateful service persist data to the data disk of the slave, and do not use the data disk as the resource of the cluster, so as to avoid persistence The data is recycled by Mesos, and the new task can recover the previous data. Take the MySQL database as an example, publish MySQL (Dockerized) to the only slave with role=MySQL through marathon, and map the data volume /data on the slave to the directory /var/lib/mysql of the docker container, so that the MySQL The task will persist the data to the /data directory on the slave, and MySQL tasks distributed to this slave in the future can still restore the data in /data.

There are many problems with this solution: First, in order to ensure the CPU, Memory, etc. required for the MySQL task to run, we need to statically reserve the resources on the slave. In this way, no matter whether the MySQL task is using these resources, other tasks in the cluster cannot use it. This problem can be partially solved by the distributed file system we mentioned below; second, the resources in the directory /data cannot be released by the cluster, although the data in /data is persistent, we may still Delete outdated data at a time to recover resources. Under the current architecture, it can only be done by means outside the cluster. Third, data conflict may occur. Multiple MySQL tasks use the same host directory /data, not only data Conflicts are unavoidable, and the data size cannot be limited. In the current situation, it can be temporarily solved by restricting a slave to run only one MySQL task; fourth, there is nothing to do with the MySQL cluster, so it is impossible to ensure high service availability.

Configuring a Distributed File System for a Mesos Cluster

In order to ensure that the stateful service can access persistent data on any slave, we can configure DFS for the cluster, which solves the problem of static reserve resources. But with it comes the network delay of data transmission, and the tolerance of database like MySQL or PG to network delay needs to be verified. At the same time, under this architecture, Mesos cannot manage the network transmission between DFS nodes, which increases the complexity of the cluster accordingly. An obvious solution is to sacrifice network resource utilization and configure a high-performance network environment for DFS outside the Mesos cluster to avoid DFS occupying the network resources of the cluster.

Local file system + stateful service build-in data sharding function

Many stateful services inherently support data sharding, such as Cassandra, MariaDB Galera, MongoDB, etc. These are the most suitable stateful services for accessing Mesos clusters. Among them, Mesosphere has implemented [cassandra-on-mesos]( https://github.com/mesosphere/cassandra-mesos ) -a Cassandra mesos framework. In addition, I also found on the Internet about [MariaDB Galera on mesos] ( http://sttts.github.io/galera/mesos/2015/03/04/galera-on-mesos.html ) Blog - Publishing MariaDB Galera with Marathon.

Taking Cassandra as an example, Cassandra Framework is restricted from using multiple specified slave node resources, and Cassandra can persist data shards in the local file system on these nodes. Obviously, there is no network delay problem in this case. At the same time, since Cassandra's tasks are managed by the Mesos cluster, the network transmission is also cluster-controllable and will not increase the complexity of the cluster.

To sum up, in the current version, in order to persist data, in addition to the static partition of cluster resources, we also need to provide disks and even network resources outside the Mesos cluster. Obviously, this greatly reduces the resource utilization of the cluster.

disk isolation and monitoring

Before discussing persistent volumes and dynamic reservations, it is necessary to understand Mesos' disk isolation and monitoring mechanism. Although the disk is cheap, in order to ensure the normal operation of other tasks on the Mesos slave, Mesos still needs to limit the disk quota of the tasks on the slave.

First of all, the task disk mentioned here is a generic concept. It may be an exclusive file system created by a physical disk or an LVM volume. In this case, if the task tries to write more data than it has applied for It will receive the signal ENOSPC, that is, No Space left on Device, (since the author did not study the specific implementation details, mesos may also handle the situation of data size exceeding through the kernel event callback), obviously the task will Interrupted, this setting is more suitable for production environments, also known as hard enforcement; the other is the corresponding soft enforcement, the disk of the task is a directory under the file system shared with other tasks running on the slave. In this case, Mesos monitors the usage of the task disk by periodically running the du command, so the amount of data that the task writes to the disk may exceed the disk size it requests.

Secondly, for the above shared file system, Mesos implements shared filesystem isolation. Of course, this only supports the isolation of Mesos build-in containers. It maps the path of the container to the path of the slave through the command mount -n --bind, and the task After completion, the kernel will automatically perform the cleanup work of mount; for the isolation of docker containers, Mesos uses Docker's own volume mapping mechanism.

Third, in order to avoid a task with heavy disk IO blocking another task's disk IO operations, the Mesos community is discussing using Cgroups blkio controller for disk IO isolation.

Finally, should Mesos interrupt the task when the amount of data the task writes to disk exceeds its requested disk size? For the first type of task disk mentioned above, the task will obviously be terminated due to an exception, and then the data will be garbage collected by Mesos; for the second type of task disk, because the current Mesos does not support persistent disks, it is interrupted The task will not be able to recover the previous data, so Mesos will not interrupt the task by default, and Mesos will hand over this situation to the Framework.

For more details, please refer to issue [enforce disk quota in MesosContainerizer]( https://issues.apache.org/jira/browse/MESOS-1588 ).

Persistent Volumes

In Mesos-0.23, which will be released soon, Mesos can provide persistent volumes for tasks. After the task runs, the data in the persistent volumes will not be reclaimed by Mesos. At the same time, the new task can read the data stored by the previous completed task through the persistent volume. Different from providing disk resources outside the Mesos cluster mentioned above, the persistent volume here is a resource within the Mesos cluster and is managed by Mesos. In addition, even if the slave is restarted, or the slave info/id is changed, the data in the persistent volume will not be lost.

Based on persistent volumes, Mesos offers two types of resources to the Framework: regular resources and persistent resources. Among them, regular resource is the resource in version 0.22. Regular resource is suitable for stateless service. After the task is completed, CPU, RAM and disk resources will be recycled by Mesos; persistent resource is a new concept in version 0.23. In addition to including persistent volume labels, Resources such as CPU and RAM are also included, which may not be the same as our intuitive feeling. The reason is very simple, the cluster needs to avoid this situation, that is, the slave can provide persistent volumes for tasks but the CPU/RAM is already occupied, so Mesos needs to package some or all of the CPU/RAM and persistent volumes together to offer to the framework. For a Framework that provides stateful services, it will execute tasks on persistent resources, and after the task is completed, the data in the persistent volume will be retained, which is similar to how google Borg works. Here is an example of an offer with persistent resources:

{“id” : { “value” : “offerid-123456789”},
 “framework_id” : “MYID”,
 “slave_id” : “slaveid-123456789”,
 “hostname” : “hostname1.prod.twttr.net”
 “resources” : [
   // Regular resources.
   { “name” : “cpu”, “type” : SCALAR, “scalar” : 10 }
   { “name” : “mem”, “type” : SCALAR, “scalar” : 12GB }
   { “name” : “disk”, “type” : SCALAR, “scalar” : 90GB }
   // Persistent resources.
   { “name” : “cpu”, “type” : SCALAR, “scalar” : 2,
     “persistence” : { “framework_id” : “MYID”, “handle” : “uuid-123456789” } }
   { “name” : “mem”, “type” : SCALAR, “scalar” : 2GB,
     “persistence” : { “framework_id” : “MYID”, “handle” : “uuid-123456789” } }
   { “name” : “disk”, “type” : SCALAR, “scalar” : 10GB,
     “persistence” : { “framework_id” : “MYID”, “handle” : “uuid-123456789” } }
 ]
 ...
}

The community here gives an example of a Framework using persistent volumes [an example framework for testing persistent volumes]( https://reviews.apache.org/r/32984/diff/2/ )

In addition, since tasks may need to share the same persistent data, Mesos will support resource sharing in the future. For example, a MySQL framework may run a mysqld server task and another task that periodically backs up the database, and both tasks need to access the same persistent data file.

For more details, please refer to the issue [persistent resources support for storage-like services ( https://issues.apache.org/jira/browse/MESOS-1554 )

Dynamic Reservations

As mentioned earlier, for the stateful service framework, the framework needs to run a new task on a specific one or several slaves to be able to read the data stored in the persistent volume by the completed task. In version 0.22, in order to ensure that the above operations can be performed on these specific slaves at any time, the slaves will statically reserve resources for the corresponding roles at startup, that is, each slave needs to be configured for each role that supports stateful services. Resources cannot be used by other frameworks, and the flexibility is particularly poor. This situation has been improved after the implementation of Dynamic Reservations in 0.23. Compared with the static role, the cluster introduced a new reserved role.

When starting a task, the framework can set "reserved_role" to declare a dynamically reserved resource (whether it is a regular resource or a persistent resource), where "reserved_role" is the role that the framework registers with Mesos;
When the task is completed, the resource with "reserved_role" set will be re-offered to the corresponding framework;
When a framework receives a resource with "reserved_role", it will know that the resource is dynamically reserved. If the framework no longer needs these dynamically reserved resources, it can release these resources through the new API "release". However, the framework cannot release statically allocated resources for use by other frameworks.

For more details, please refer to issue[Dynamic Reservation] ( https://issues.apache.org/jira/browse/MESOS-2018 )

Outlook

Based on Mesos 0.23, how can we make stateful services use Mesos cluster resources better and faster? From the perspective of the workload of integrating stateful services with Mesos, stateful services can be divided into two categories:

A stateful service without a central node . For example, Cassandra, because the functions of each node are equal, we can use docker to package the Cassandra program into an excutor, and publish it to a specific persistently configured slave node through an existing schedule such as Marathon to provide services.
A stateful service in master-slave or leader-follower mode . For example, HDFS, MongoDB, MySQL, etc., the node functions of this type of service are different, including name node, config node, data node and so on. We need to dockerize executors for various nodes of each service, and develop schedules, and at the same time solve problems such as fault tolerance and backup to achieve production ready.

In summary, the release of Mesos 0.23 will greatly improve the way stateful services use Mesos clusters, which also makes it possible to integrate the entire enterprise environment into Mesos.

http://www.csdn.net/article/2015-07-02/2825111