【论文阅读】SketchML: Accelerating Distributed Machine Learning with Data Sketches

Search micro-channel public number: 'AI-ming3526' or 'this small computer vision' for more artificial intelligence, machine learning, dry goods

csdn:https://blog.csdn.net/qq_36645271

github:https://github.com/aimi-cn/AILearners

Papers report PPT download: You can find the resources I uploaded


Title: SketchML: Accelerating Distributed Machine Learning with Data Sketches
Chinese translation: SketchML: Use sketch acceleration data distributed machine learning
Source Paper: 2018 ACM SIGMOD International Conference on Management of Data


I. Summary

  1. BACKGROUND: Many distributed algorithms ML (SGD) training dropped from stochastic gradient are related to the network transmission gradient, thereby compressing the transmission gradient is very important.
  2. Problems encountered: the conventional method is not suitable for low accuracy gradient and non-uniform distribution of the sparse conditions.
  3. Issues raised: Is there a compression method can effectively located in key reason for the sparse non-uniform gradient of composition?
  4. The proposed solution:
    1. Quantile-Bucket Quantification with compressed gradient value.
    2. MinMaxSketch to compress barrel with the index value, to resolve hash collision.
    3. Compression gradient built with Delta-Binary Encoding Tondo to an incremental manner.
  5. Advanced: first time data sketch combined with ML, and our experiments show that the method 10 times faster than existing methods.

Second, the introduction

  1. Background and Motivation
    With the unprecedented growth in data volumes, the centralized system can not operate effectively ML task. Therefore, ML deployed in a distributed environment is inevitable. In this context, a major question is how effective exchange gradient between nodes, because communication is often the total cost.
  • Case 1: Large model
  • Case 2: The cloud computing environment
  • Case 3: geographical distribution of machine learning
  • Case 4: Things
    in the above case, machine learning, at the same time to ensure the correctness of the algorithm, reducing the importance of gradient through the network. In general, the use of compression technology to solve this problem. Conventional compression methods may be grouped into two categories: lossless compression and lossy compression methods. Lossless method repeats integer data can not be used for non-floating point and duplicate keys gradient gradient value. Undermine the proposed method based on threshold or cut quantitative strategies to compress floating-point gradient, but based on the cut-off threshold is too radical, not the ML convergence.
    From the above analysis, the existing method of compression algorithm is not strong enough for large-scale gradient optimization. In this exciting challenge, we studied the question:
    should we use that data structure compression factor gradient vector?
  1. Summary of technical contribution
    1. Data model
      , we focus on a class of stochastic gradient decrease (SGD) training machine learning algorithms, such as logistic regression and support vector machine. In a distributed setting, we select data parallel strategy to divide the data set onto W workstations.
    2. How to compress gradient value
      first goal was key compression gradient values. Since the uniform quantization does not apply to non-uniform distribution of the gradient, an alternative data structure is Sketch probabilistic algorithm that is widely used to analyze the data stream. Existing algorithms include Quantile Sketch and sketch Frequency Sketch. Quantile Sketch for distribution estimate of the project, and Frequency Sketch the frequency estimate for the project.
    3. How to compress gradient key
      second goal is gradient compression key key pair. And able to withstand the different values of low accuracy path gradient, gradient bond susceptible to errors. Therefore, we need a lossless compression method gradient key, otherwise it is impossible to ensure proper optimization convergence of the algorithm. Since the key-value pair is key sort, which means that the key is in ascending order, so we recommend the use of delta storage key format.
    4. Assessment
      in order to systematically evaluate our proposed method, we implemented a prototype on the Spark. In a real cluster Tencent, we use two large data sets to run a series of machine learning work. Our proposed framework sketches faster than the most advanced methods 2-10 times.

Third, prior knowledge

  1. Symbol Definition
  • W: the number of workstations.
  • N: the number of training examples.
  • D: dimensional model.
  • g: a gradient vector.
  • d: The number of nonzero gradient vector dimension.
  • (Kj, vj): sparse gradient vector of the j th non-zero pairs.
  • m: size divided sketch.
  • q: the number of division points is divided into.
  • s, t: maximum and minimum sketch rows and columns, s is the number of hash table, t represents the number of cells in the hash table.
  • r: maximum and minimum number of groups sketch.
  1. Divided sketch
    Quantile Sketch of a small data structure used to approximate the exact distribution of the item value. The main components of the quantile Quantile Sketch summary, quantile summarized by a small number of data points from the original composition. Quantile summary has two main operating --merge and prune. The combined operation of two data merge into a quantile, whereas prune operations reduces the number of merge data, to avoid exceeding the maximum size.
  2. Sketch frequency
    data stream Another common situation is repeated entries. Since the value of the larger scope of the project, it is impossible to store all possible projects, and therefore proposed to estimate the frequency of different Frequency Sketch value of the item.

Four, SketchML framework

  1. Framework Overview
    The framework has three main components, i.e. sub-bit quantization tub, MinMaxSketch dynamic and incremental binary encoding. The two components together prior to compression gradient value, and the third key component compression gradient.
  • Coding phase.
    1. Quantile Sketch for generating candidate segmentation, we use to bucket sort summed gradient value.
    2. Gradient value represented by the index of the tub.
    3. By applying a hash function keys will be inserted into the bucket index MinMaxSketch.
    4. Converting the gradient increments bond, represented herein by the increment key.
    5. We use the binary encoding with fewer bytes to encode delta keys, instead of using a four-byte integer.
  • Decoding stage.
    1. Increment key to return to the original key.
    2. Use key recovery queries MinMaxSketch.
    3. Each bucket index value is obtained from FIG.
    4. To recover the value of the index query by using bucket bucket value.
      Here Insert Picture Description
  1. Quantile-Bucket Quantification
  • Step 1: quantile division.
    1. We scan all the gradient values ​​and insert them quantile sketch.
    2. q quantile for extracting candidate from quantile quantile sketch. Specifically, we generated average order q {0,1 / q, 2 / q, ..., q-1 / q}.
    3. We quantile value and the maximum value as the split, with {rank (0), rank (1 / q), rank (2 / q), ..., rank (1)}. Note that the value of the item is located between two consecutive division number is N / q, divided by the number of items which means that we undesignated of quantity rather than value. Dividing each interval between the two values ​​have the same number of gradient.
  • Step 2: bucket sort.
    1. We call each interval between two split into a bucket. Dividing the smaller of the tub is lower threshold, the larger is the higher threshold segmentation.
    2. The threshold of the tub, each gradient value belongs to a particular bucket. For example, the value 0.21 in Figure 3 is classified into the fourth tank.
    3. Each bucket is represented by average, i.e. an average value.
    4. Converting each gradient value corresponding to the mean of the tub.
      Here Insert Picture Description
  • Step 3: index coding.
    Although we use the average value of the gradient of the tub to quantify, but the space consumed is the same. In order to reduce the cost of space, we have chosen the alternative storage bucket index. We will encode the average barrel for barrel indexing. For example, use 0.21 quantized to the tub after the fourth represents the mean, we scratch the index to be encoded by barrels, i.e. 0.21 corresponds to the number 3.
  • 步骤4:二进制编码。
    通常,桶的数量是一个小整数。我们通过将bucket索引编码为二进制数来压缩它们。如果q=256,一个字节就足够对bucket索引进行编码。通过这种方式,我们将占用的空间减少,我们可以在很大程度上减少传输的数据。
    Here Insert Picture Description
  1. MinMaxSketch
  • 插入的阶段。
    1. 每个输入项由原始密钥和已编码的桶索引(kj,b(vj))组成。
    2. 使用s哈希函数计算哈希码。在图5中,有三个哈希函数,h1(-)、h2(-)和h3(-)。
    3. 在第i个哈希表中选择一个哈希bin后,比较当前值H(i,hi(kj))和b(vj)。如果H(i,hi (kj)) > b(vj),我们用b(vj)替换当前值。否则,我们不更改当前值。
  • 查询阶段。
    1. 输入为梯度键,用kj表示。s哈希函数应用于kj,每个哈希函数从哈希表中选择一个哈希bin。
    2. 给定不同行的s个候选项,选择最大的作为最终结果。在图5中,三个候选项是{0,2,2},我们选择2作为结果。
      Here Insert Picture Description
  1. Delta-Binary Encoding
    通过对梯度键数据分布的分析,发现梯度键具有三个特征。首先,键是非重复的。其次,键按升序排列。第三,尽管在许多高维应用中键可以非常大,但是相邻键之间的差别要小得多。基于以上三点,我们建议只存储键的增量。
  • 步骤1:增量编码
    梯度键存储在数组中。我们从头到尾扫描数组,并计算两个相邻键之间的差值。然后,我们得到键的增量,我们称之为键。
  • 步骤2:二进制编码
    通过增量编码,很明显增量键要比原始键小得多。如果我们以整数或长整数的格式存储delta键,那么压缩是没有意义的,因为所消耗的内存空间和通信成本保持不变。为了解决这个问题,我们将不同的空间分配给不同的delta键,并将它们编码为二进制格式。
    Here Insert Picture Description
    五、 实验
  1. 实验设置
  • 实现。我们在Spark上实现了一个原型。训练数据集在executor上进行分区。每个executor读取子集并计算梯度。driver聚合来自executor的梯度,更新经过训练的模型,并将更新后的模型传播给executor。这个过程迭代直到收敛。
  • 集群。实验中使用了两个集群。集群1是我们实验室的一个十节点集群,使用这个集群来评估我们提出的方法的有效性。集群2是腾讯公司一个拥有300个节点的生产集群,使用这个集群来比较三种模型的端到端性能。
  • 数据集。实验中使用了三个数据集。第一个数据集KDD10由KDD CUP 2010发布的公共数据集,包含1900万个实例和2900万个特性。第二个数据集KDD12是KDD10的下一代。第三个数据集CTR是腾讯公司的专有数据集。
  • 统计模型。我们选择了三种常用的机器学习模型:2-正则化逻辑回归(LR)、支持向量机(SVM)和线性回归(Linear)。
  • 基线。实验将SketchML与两个竞争对手:Adam SGD和ZipML进行了比较。
  • 指标。每个历元的平均运行时间和相对于运行时间的损失函数。
  1. 提出模型的效率
    1. 运行时间
      根据如图所示的结果,我们提出的方法可以显著加快三种不同ML算法的执行速度。
      Here Insert Picture Description
    2. 消息大小和压缩率
      压缩的主要优点是减小了消息的大小。如图显示执行期间的平均消息大小和压缩率。
      Here Insert Picture Description
    3. CPU开销
      为了评估压缩带来的计算开销,我们进行了一个实验,结果如图所示。我们的方法平均引入了25%的CPU使用量。CPU使用率峰值没有明显的影响。
      Here Insert Picture Description
    4. 批量大小和稀疏性影响
      由于我们的方法压缩稀疏梯度,它提出了一个问题,数据稀疏性如何影响性能。在我们的设置中,梯度的稀疏性受批大小的影响。因此,我们通过改变批量大小的比例来改变稀疏性。
      Here Insert Picture Description
      增量二进制编码的通信成本直接受到数据稀疏性的影响。因此,我们在图中记录了增量二进制编码对数据稀疏性变化的性能。
      Here Insert Picture Description
  2. 端到端表现
  1. KDD12数据集
  • 逻辑回归。如图所示,无论是运行时间还是收敛速度SketchML比Adam和ZipML运行得快得多。
  • 支持向量机。支持向量机的结果与逻辑回归的结果相似。Adam是最慢的,其次是ZipML。
  • 线性回归。对于线性回归,Adam和ZipML每历元分别花费903秒和330秒,而SketchML只需要96秒。
    Here Insert Picture Description
    Here Insert Picture Description
  1. CTR数据集
  • 逻辑回归。在这个较大的数据集中,Adam仍然运行得最慢,其次是ZipML。SketchML比其他两种方法分别快3.8倍和2.7倍。
  • 支持向量机。SketchML比Adam和ZipML快4.59和3.88倍。与逻辑回归和线性回归相比,支持向量机更容易在该数据集上收敛。
  • 线性回归。SketchML需要32秒来训练一个线性回归历元,而Adam和ZipML需要97秒和78秒。
    Here Insert Picture Description
    Here Insert Picture Description
  1. 可扩展性
    对于逻辑回归、支持向量机,线性回归来说,可移植性大致相同,治理只报告逻辑回归的可移植性。如图所示,当执行者数目从5增加到10时,三种方法的性能都有所提高。
    Here Insert Picture Description

VI Summary
In order to speed up the distributed machine learning, this paper presents a sketch-based approach, that SketchML, to compress key gradients communication. First, we present a method using quantile sketch and bucket sort to represent the binary encoding gradient value with a smaller bucket index. Then, we designed an algorithm to approximate compression MinMaxSketch bucket index. In addition, we propose a method of incremental binary encoding gradient key. Analysis of the error of the method theoretically. The experimental results on a series of large-scale data sets and machine learning algorithms show, SketchML can be 10 times faster than the most advanced methods.


AIMI-CN AI learning exchange group [1015286623] For more information on AI, scan code plus group:
Here Insert Picture Description
sharing technology, fun in life: Welcome to our public number, push AI Series News articles per week, welcome your interest!
Here Insert Picture Description

Guess you like

Origin blog.csdn.net/qq_36645271/article/details/91044383