Heterogeneous Computing Series (2): Heterogeneous acceleration technologies emerging in the field of machine learning

image

Author | Yi Xiaomeng, Guo Rentong

Planning | Yuying "Heterogeneous computing" refers to a joint computing method that uses processors of different architectures in a system. In the AI ​​field, common processors include: CPU (X86, Arm, RISC-V, etc.), GPU, FPGA and ASIC. (Sorted in descending order of versatility) This article is the second article in a series of heterogeneous computing, focusing on heterogeneous acceleration technologies emerging in the field of machine learning. Machine learning and heterogeneous computing

In the field of machine learning, the application of heterogeneous computing technology is a topic that has attracted much attention from industry and academia in recent years. In the context of rapid data growth, heterogeneous computing technology is an important way to improve the efficiency of "people" and "machines" in the development process of machine learning applications. This article will combine the development of closed-loop machine learning applications to introduce related heterogeneous acceleration technologies that have emerged recently.

image

As shown in the figure above, the closed-loop development of machine learning applications includes data integration, feature extraction, model design, training, and verification. First, the original data needs to be collected and sorted, and then the data is analyzed and the data features are extracted as the model input. In the model design process, the model type, optimization algorithm and configuration parameters need to be selected. After the model training is completed, data scientists are required to adjust various upstream links based on the results of model verification, such as adding new data sources, expanding data features, adjusting model selection and parameter design, and then retraining and verifying the model until more Satisfactory results were obtained after two iterations.

Let me talk about the "people" in the above process. The phenomenon of "as many labors as there are intelligence" is more common in production applications. There are a large number of manual decision-making links in the above process, and data scientists are required to make reasonable decisions based on professional knowledge and experience. Due to the diversity of application scenarios, general designs usually cannot meet the specific needs of machine learning systems in various scenarios. Data scientists need to combine actual problems, through a large number of observations and analysis, as well as many attempts and tuning before they can get a truly suitable design. With the increasing enrichment of machine learning theoretical methods and application scenarios, data scientists are facing an unprecedented number and difficulty of decision-making. As the difficulty of work increases, the impact of manpower on the efficiency of machine learning system development will gradually increase, and even become a bottleneck in the entire process.

Let's talk about "machines" again. From the perspective of machine efficiency, a large amount of data processing and calculation operations are involved in the above iteration process. For example, in the data integration link, it involves the associated analysis and cleaning operations of a large amount of data in different dimensions from multiple data sources. In the feature extraction process, the statistical feature analysis of the original data and the construction and coding of the feature data require a lot of floating-point operations and matrix operations. In the model training and verification link, machine learning model training and inference calculations are involved, including a large number of numerical calculations, matrix operations, and floating-point operations. The rapid growth of data makes machine learning applications increasingly demanding on the performance of computer system data processing. The computational efficiency of the above links will directly affect the efficiency of manual participation and the overall iterative efficiency of the machine learning system.

Heterogeneous acceleration technology has brought huge room for improvement in the efficiency of "people" and "machines". Current heterogeneous acceleration algorithms cover data integration, feature extraction, model training and other links. Compared with traditional CPU-based algorithms, heterogeneous parallel algorithms can achieve one to two orders of magnitude acceleration, which significantly improves the computing efficiency of the machine. On the other hand, heterogeneous acceleration technology helps data scientists obtain calculation results faster, and can effectively accelerate AutoML's solution space search process, and improve design and tuning efficiency.

The following sections will focus on data integration, feature extraction, model design and optimization, and model training, and introduce emerging heterogeneous computing technologies.

Data Integration

Data integration is at the upstream of the machine learning development process, including data source integration, data extraction and data cleaning. Due to the large differences in various application scenarios, the data sources and data types are numerous and complex, and the methods and tools involved in the data integration stage are quite rich. Among them, the database, data processing engine, and data analysis library play an important role, respectively dealing with tasks such as data aggregation, docking, general data processing, and customized data processing.

In terms of database, ZILLIZ launched the GPU analysis engine MegaWise [1][2] for the PostgreSQL ecosystem, Alibaba provided GPU acceleration in AnalyticDB [3], BlazingSQL [4] built GPU-accelerated SQL analysis based on RAPIDS [5] engine. Heterogeneous acceleration technologies that have recently emerged in the database field are focused on AP. These new analysis engines have achieved 10 to 100 times acceleration effects for specific loads such as data loading, transformation, filtering, aggregation, and connection.

In terms of data processing engine, Spark 3.0 will introduce scheduling support for GPU [6]. In addition, in the preview version, SparkR and SparkSQL have also introduced columnar processing mode. Heterogeneous computing resource scheduling and columnar processing have laid a good foundation for the heterogeneous acceleration of Spark core components. In addition, it also provides conditions for heterogeneous acceleration of UDF for advanced users with customized needs.

In terms of data analysis library, Nvidia launched cuDF [7]. Since version 0.10, a round of large-scale refactoring has started. While continuing to improve the performance of the underlying library, the API of the Python layer has also been extended. Up to the current 0.13 version, a set of Pandas-like APIs has been gradually completed. The current maturity of the interface can support the collaborative data processing of Pandas and cuDF.

In terms of processing specific data types, OpenCL provides GPU acceleration capabilities for image processing [8], NVIDIA provides string-oriented GPU acceleration processing function libraries in the cuStrings [9] project, and ZILLIZ will launch in its upcoming open source Arctern project GPU acceleration engine for geographic information data processing [10].

Feature extraction

The feature extraction process extracts the key information in the original data and encodes it into structured data, and the result will be used as the input data of the model to participate in the training and verification process of the model. The calculation operations involved in the feature extraction process mainly include data feature analysis, transformation and selection, such as the calculation of statistical features such as mean, variance, covariance, and correlation coefficients, data transformation operations such as normalization and whitening, as well as PCA, SVD and other features Select operation. The above operations generally involve the same or similar processing procedures for large amounts of data, and are suitable for using heterogeneous acceleration technologies to improve computing efficiency.

In terms of data statistical feature analysis, cuDF [11] provides a calculation interface for common statistics such as maxima, expectations, variance, kurtosis, and skewness. In addition, cuDF also supports UDF. UDF is compiled into cuda kernel and executed on GPU through JIT technology, so as to realize user-defined data characteristic analysis. Currently, this function is weaker than pandas UDF, and only supports numerical and Boolean calculations.

数据变换方面,英伟达面向高维数据运算发布了 cuPy 项目。该项目使用 ndarray 结构对高维数据进行组织,并在此基础之上提供了大量的异构加速数据运算功能,其中包括傅里叶变换以及线性代数矩阵变换等常用数据变换功能。

特征选取方面,英伟达推出的 cuML 项目提供了一套异构加速的机器学习基础算法库。该项目自 2018 年发布以来持续地扩展对常用的机器学习算法的异构加速支持,当前包含了 SVD、PCA、UMAP、TSNE、Random Projection 等特征成分分析功能。

模型设计与调优

在提取特征之后,数据科学家们需要根据实际的机器学习问题以及训练数据的特征对机器学习模型经行设计和调优。模型设计包括对机器学习模型的类型、模型训练中求解优化问题的算法以及模型参数进行选择。在模型训练完成之后,还需要验证模型的结果准确度,并相应的对模型设计进行迭代调优。在传统的机器学习系统中,该环节完全由人进行决策,其效率严重依赖于数据科学家和算法工程师的专业知识和经验。

为了减少机器学习过程中对人力和专业知识的依赖,近年来学术界和产业界对 AutoML 相关技术投入了大量的关注和尝试。AutoML 致力于自动化完成模型设计,并根据模型验证结果对模型的设计空间进行自动搜索,从而达到近似最优的模型选择和配置。AutoML 减少了机器学习过程中的人工参与,从而有望提高机器学习迭代过程的效率。

当前尚未出现针对 AutoML 的异构加速项目或者算法库。然而,不论是人工还是自动化的模型设计都需要对模型的训练和验证过程进行大量迭代,在这方面异构计算技术已经被普遍用于计算过程的加速。

模型训练

机器学习的模型训练部分存在大量的运算密集型任务,其运算负载不仅取决于算法逻辑,也取决于训练集、数据集的量。随着数据的爆炸式增长,模型训练的任务负载也显著提升,传统的基于 CPU 的方案在性能、设备成本、能耗等几个方面迎来较大挑战。因此,异构加速技术成为解决上述挑战的重要途径,更高的模型训练速度也将直接提高模型迭代中人工环节的参与效率。

数据集处理方面,cuML 提供了 train_test_split,与 sklearn 中的接口行为类似,用于划分训练集和测试集。

算法方面,cuML 提供了一套 GPU 加速的 ML 算法包。在当前 0.13 版本中,常用算法如 linear regression, SGD, random forest, SVM, k-means 等都有涵盖,另外还提供了对时间序列预测分析的支持,包括 HoltWinters, kalman filter, ARIMA 三个模型。在早期版本中,受制于显存容量,cuML 对于大模型或大训练集的支持不尽人意。cuML 自 0.9 版本提供多节点 / 多卡方案(MNMG),当前已有的 MNMG 算法包括:K-means, SVD, PCA,KNN,random forest。

基于树的算法方面,XGBoost 早在 16 年底就开始了算法的 GPU 加速工作,并于 17 年支持多卡。cuML 在近期的版本中也对基于树的算法进行了性能优化 [12],自 0.10 版本提供与 XGBoost GPU 加速算法的对接支持 [13]。

总结与展望

异构计算在机器学习应用的开发闭环中对于提高“人”与“机”的效率展现出巨大潜力,部分库、系统与产品已经应用于生产环境。但异构计算技术在人工智能领域仍处于快速发展期,进一步丰富工具链以及完善与已有生态的整合是异构计算技术加速落地的重要挑战。当前异构计算技术的主要推动力是英伟达等技术巨头,也涌现出一批如 ZILLIZ、Kinetica、OmniSci 等新兴技术团队,主流的计算框架如 Spark 等也逐步提高对异构计算的原生支持。可以预见,异构计算将成为人工智能应用领域的重要技术趋势,在提高产品演进效率、降低设备与人工成本方面发挥至关重要的作用。

相关链接:

[1] MegaWise 简介 https://zilliz.com/cn/docs/megawise_intro

[2] MegaWise 技术初探:面向异构计算的查询优化与编译 https://zhuanlan.zhihu.com/p/100407033

[3] 阿里如何实现海量数据实时分析技术-AnalyticDB https://www.cnblogs.com/barrywxx/p/10141153.html

[4] BlazingSQL https://blazingsql.com/

[5] RAPIDS https://rapids.ai/

[6] Apache Spark 3.0 预览版正式发布,多项重大功能发布 https://www.infoq.cn/article/oBXcj0dre2r3ii415oTr

[7] cuDF https://github.com/rapidsai/cudf

[8] OpenCL https://developer.nvidia.com/opencl

[9] cuString https://github.com/rapidsai/custrings

[10] Arctern https://github.com/zilliztech/arctern

[11] cuPy https://cupy.chainer.org/

[12] Accelerating Random Forests up to 45x using cuML https://medium.com/rapids-ai/accelerating-random-forests-up-to-45x-using-cuml-dfb782a31bea

[13] A New, Official Dask API for XGBoost https://medium.com/rapids-ai/a-new-official-dask-api-for-xgboost-e8b10f3d1eb7

作者简介:

Yi Xiaomeng, a senior researcher at ZILLIZ, received a PhD in computer system architecture from Huazhong University of Science and Technology in 2017 and joined the Huawei public cloud architecture design team. The main research areas are cloud resource scheduling and heterogeneous resource scheduling. Research results have been published in IEEE Network Magazine, IEEE TON, IEEE ICDCS, ACM TOMPECS and other journals and conferences.

Guo Rentong, PhD in Computer Software and Theory, Huazhong University of Science and Technology, Technical Director of ZILLIZ. The main research areas are heterogeneous computing, cache systems, and distributed systems. Research results have been published in conferences and journals such as USENIX ATC, ICS, DATE, IEEE TPDS, etc. He joined the deep learning team of Huawei Cloud and is currently engaged in the construction of a heterogeneous data analysis system at ZILLIZ.


Guess you like

Origin blog.51cto.com/15060462/2675609