Talking about the distributed computing framework

If you ask what is the relationship mapreduce and spark, or what is common property, you may answer that they are big data processing engine. If you ask spark and tensorflow it, it might be a little confused, maybe areas of concern are not the same ah. But ask spark MPI with it? This will farther. Although asked somewhat imprecise, but they all have a common part, this is a topic we're talking about today, a larger topic: distributed computing framework.

Whether mapreduce, or spark or will tensorflow, they are utilizing distributed capabilities, run some calculations, to solve some specific problems. From this level, which defines a "distributed computing model", that presents a calculation method, by this calculation method, we can solve a lot of computing distributed data problem. The difference between them lies in the different distributed computing model is proposed. Mapreduce As the name suggests, it is a very basic map-reduce style computing model (like not say the same). Spark model defines a set of RDD, the DAG is a series of map / reduce consisting essentially. Tensorflow computational model is a map, but the map tensorflow than spark to appear to be more "complex" point. You need a definition for each of the nodes and edges in the graph. According to these definitions, you can guide tensorflow how to calculate this figure. This particular definition of Tensorflow makes it more suitable for certain types of computing processing, in terms of tensorflow is a neural network. RDD model and makes it more suitable for that spark is not interrelated data parallel task. Is there a common, simple, yet high-performance distributed computing model? I felt quite hard. General performance can not be made often means optimized for specific situations. Written specifically for the task and do common tasks distributed, of course, can not do simple.

Insert an aside, distributed computing model has an accompanying content, is scheduling. Although less attention, but this is a distributed computing engine essential things. mapreduce scheduling is yarn, spark scheduling has its own built-in scheduler, tensorflow too. MPI it? Its schedule is almost no scheduling, assuming that all clusters have the resources, all the tasks by ssh to pull up. Scheduling should actually be divided into resource scheduler and task scheduler. The former is used to apply some resources managers some hardware resources, which is used to calculate the figures issued tasks to these remote resources for computing, in fact, is the so-called two-stage scheduling. Some TensorflowOnSpark such projects in recent years. The nature of such projects is actually used resource scheduling spark, coupled with the calculation model tensorflow.

When we finished writing a stand-alone program, and faced questions on the amount of data when a natural idea is that I can not get it to run in a distributed environment? If you can not add or change a little change can make a distributed technology, it would be great. Of course, reality is more cruel. Typically, for a general program, users need to manually write its own distributed version, such as the use of the framework of MPI and the like, to control their own distribution data summary that he failed to make the task of disaster recovery (usually no capacity disaster). If the goal is to process a batch of data is exactly batch processing, you can use mapreduce or spark predefined api. For this type of task, computational framework has helped us to part (Scaffolding Code) outside the business well. Similarly, if our mission is to train a neural network, using the framework tensorflow pytorch like enough. Meaning of this passage is that a problem if you have to deal with the corresponding frame, then make use of them just fine. But without it? In addition to their own realization there is no other way to do it?

Today notes that a project, Ray , claiming that you only need to slightly modify your code, you can make it into a distributed (in fact this project has long been released, but there has been no deliberate attention to it). Of course, this is limited to the code Python, as in the following example,

+------------------------------------------------+----------------------------------------------------+
| **Basic Python**                               | **Distributed with Ray**                           |
+------------------------------------------------+----------------------------------------------------+
|                                                |                                                    |
|  # Execute f serially.                         |  # Execute f in parallel.                          |
|                                                |                                                    |
|                                                |  @ray.remote                                       |
|  def f():                                      |  def f():                                          |
|      time.sleep(1)                             |      time.sleep(1)                                 |
|      return 1                                  |      return 1                                      |
|                                                |                                                    |
|                                                |                                                    |
|                                                |  ray.init()                                        |
|  results = [f() for i in range(4)]             |  results = ray.get([f.remote() for i in range(4)]) |
+------------------------------------------------+----------------------------------------------------+

So simple? So I thought openmp(note not openmpi). come and see,


#include<iostream>
#include"omp.h"

using namespace std;

void main() {
#pragma omp parallel for
    for(int i = 0; i < 10; ++i) {
        cout << "Test" << endl;
    }
    system("pause");
}

Import header file, adding a line preprocessing directives on it, immediately becomes parallel execution code. Of course openmp not distributed, but the compiler enables the parallelization of code required to compile part of a multi-threaded operation itself is a process, so the degree of parallelism received limited number of CPU threads. If the CPU is a dual-threaded, it can only 2 times acceleration. On some servers, CPU, thread 32 may be a single-core, 32-fold acceleration can enjoy natural (parallelized portions). But these are not important, the user's perspective, Ray and this practice is not openmp somewhat similar? You do not need to do too much code changes, the code will be able to become a distributed execution (of course openmp to be even more peculiar, because for the compiler does not support openmp it is the one-line comment only).

So Ray is how to do this it? In fact, Ray's practice said to be relatively simple, is to define a number of API, similar to the MPI communication primitives defined. When used, these API "injected" into the appropriate position of the code, the code becomes mixed with the user code number of API calls Ray frame layer, in fact, the entire code is formed a computation graph. The next thing is to wait for Ray to return to complete this calculation chart just fine. Ray's paper gave an example:

@ray.remote
def create_policy():
    # Initialize the policy randomly.
    return policy
@ray.remote(num_gpus=1)
class Simulator(object):
    def __init__(self):
        # Initialize the environment.
        self.env = Environment()
    def rollout(self, policy, num_steps):
        observations = []
        observation = self.env.current_state()
        for _ in range(num_steps):
            action = policy(observation)
            observation = self.env.step(action)
            observations.append(observation)
        return observations
@ray.remote(num_gpus=2)
def update_policy(policy, *rollouts):
    # Update the policy.
    return policy
@ray.remote
def train_policy():
    # Create a policy.
    policy_id = create_policy.remote()
    # Create 10 actors.
    simulators = [Simulator.remote() for _ in range(10)]
    # Do 100 steps of training.
    for _ in range(100):
        # Perform one rollout on each actor.
        rollout_ids = [s.rollout.remote(policy_id)
        for s in simulators]
        # Update the policy with the rollouts.
        policy_id = update_policy.remote(policy_id, *rollout_ids)
    return ray.get(policy_id)

The picture shows the resulting calculated

01_graph

Therefore, users need to do is add the appropriate Ray API calls in your code, and then their code actually becomes a distributed computing a plan. As a comparison, let us look at the definition of tensorflow map,

import tensorflow as tf
# 创建数据流图:y = W * x + b,其中W和b为存储节点,x为数据节点。
x = tf.placeholder(tf.float32)
W = tf.Variable(1.0)
b = tf.Variable(1.0)
y = W * x + b
with tf.Session() as sess:
    tf.global_variables_initializer().run() # Operation.run
    fetch = y.eval(feed_dict={x: 3.0})      # Tensor.eval
    print(fetch)                            # fetch = 1.0 * 3.0 + 1.0
'''
输出:
4.0
'''

As can be seen, tensorflow is explicitly they need their own, clearly showing the definition of a node, placeholder Variableetc. (these are the specific type of graph nodes), whereas in FIG Ray is an implicit manner defined. I think the latter is a more natural way, standing on a developer's point of view, while the former is more like the code in order to use tensorflow his own logic to fit the wheel.

那么 ray 是不是就我们要寻找的那个即通用、又简单、还灵活的分布式计算框架呢?由于笔者没有太多的 ray 的使用经验,这个问题不太好说。从官方介绍来看,有限的几个 API 确实是足够简单的。仅靠这几个 API 能不能达成通用且灵活的目的还不好讲。本质上来说,Tensorflow 对图的定义也足够 General,但是它并不是一个通用的分布式计算框架。由于某些问题不在于框架,而在于问题本身的分布式化就存在困难,所以试图寻求一种通用分布式计算框架解决单机问题可能是个伪命题。

话扯远了。假设 ray 能够让我们以一种比较容易的方式分布式地执行程序,那么会怎么样呢?前不久 Databricks 开源了一个新项目,Koalas,试图以 RDD 的框架并行化 pandas。由于 pandas 的场景是数据分析,和 spark 面对的场景类似,两者的底层存储结构、概念也是很相似的,因此用 RDD 来分布式化 pandas 也是可行的。我想,如果 ray 足够简单好用,在 pandas 里加一些 ray 的 api 调用花费的时间精力可能会远远小于开发一套 koalas。但是在 pandas 里加 ray 就把 pandas 绑定到了 ray 上,即便单机也是这样,因为 ray 做不到像 openmp 那样如果支持,很好,不支持也不影响代码运行。

Long-winded so much, in fact, wanted details from so many engines come out and think about what in the end is a distributed computing frameworks, each is designed to solve the problem, what advantages and disadvantages. Finally, take a view Gangster end of this article. David Patterson said in a speech "New Golden Age For Computer Architecture", the general hardware closer and closer to the limit, to want to achieve greater efficiency, we need to design-oriented architecture (Domain Specific Architectures). This is an emerging era of computing architecture, each architecture is to solve its problem areas faced appearance, must contain special optimization of its problems. Versatility is not the user to solve the problem of the starting point, but more of a framework designer of "wishful thinking", the user will always be concerned about the problem areas. In this sense, the field should be oriented computing architecture is the right direction.

Disclaimer: limited it is limited, as in the statements content may be incorrect. Welcome criticism.

Guess you like

Origin yq.aliyun.com/articles/704637