At the end of the article, computing power | tool recommendation: high-performance wheels designed for GPU

https://mp.weixin.qq.com/s/w1iN4PgA-cp75lAihcr2aw

By 超神经

GPU 和数据库各有所长,GPU 擅长处理机器学习等任务,而数据库擅长有特定要求的计算,比如复杂的连接计算。

目前有一些提供 GPU 加速的数据库解决方案产品,其中有大家熟悉的 MapD、Kinetica,我们今天要介绍是一款年轻的开源产品 BlazingSQL。

At the end of the article, computing power | tool recommendation: high-performance wheels designed for GPU

BlazingSQL is a GPU-accelerated database query tool built on RAPIDS. BlazingSQL extends RAPIDS and enables users to run SQL queries directly on Apache Arrow in GPU memory.

In addition to the degree of GPU adaptation and speed, which is much faster than other similar products, most SQL data warehouses require enterprises to extract and copy data by themselves, while BlazingDB can directly read data from Apache Parquet, which simplifies data channels Architecture can also support high-performance loads.

More importantly, BlazingSQL has also received investment from NVIDIA and Samsung, and has maintained a very good cooperative relationship with NVIDIA.

Performance evaluation

To compare the performance between tools, you have to compare the bechmark test, first run an end-to-end analysis workload.

  • The steps are: data lake> FTL feature engineering> XGBoost training

  • We built two clusters at comparable prices on GCP, using Apache Spark and BlazingSQL respectively.

At the end of the article, computing power | tool recommendation: high-performance wheels designed for GPU

*The final result is that BlazingSQL runs 5 times faster than Apache Spark.

*Under the same workload, the new version runs 20 times faster than Apache Spark.

At the end of the article, computing power | tool recommendation: high-performance wheels designed for GPU

A good horse with a good saddle

The reason why Blazing SQL can get efficient running results is also because GCP's T4 GPU is used extravagantly, which is a new entry-level GPU that is cheap but has strong performance.

Using the new T4 GPU cut our costs by half, and in order to keep the price consistent, we reduced the Apache Spark cluster to 4 CPU nodes.

At the end of the article, computing power | tool recommendation: high-performance wheels designed for GPU

But the final result is that even if the GPU memory is halved, the overall workload will be significantly accelerated.

Blazing SQL engineers also developed a GPU execution kernel built specifically for GPU DataFrames (GDF), which is called the "SIMD expression interpreter."

It takes a lot of space to describe the SIMD expression interpreter. I will simply share some details here about how it works and why it produces such a performance improvement.

The performance improvement of the SIMD expression interpreter is mainly through these key steps:

  1. The machine supports multiple inputs. These inputs can be GDF columns, text, and functions.

  2. When loading these inputs, the SIMD expression interpreter optimizes the allocation of registers on the GPU, which increases the occupancy rate of the GPU and ultimately improves performance.

  3. In addition, the virtual machine processes these inputs and generates multiple outputs simultaneously. For example, assuming the following SQL query: SELECT colA + colB * 10, sin(colA) — cos(colD) FROM tableA

It is these efforts that make BlazingSQL have such a big improvement in efficiency.

Free GPU computing power

Happy Lantern Festival!
Nerve Miss Sister sent the calculation benefits of the Lantern Festival!

Our partner manufacturers are conducting internal testing activities for the public cloud of machine learning.
Currently, 50 internal testing places are open, including the usage time of CPU and GPU (NVIDIA T4)!

Add the WeChat of Miss Nervous Sister (without verification) to get the registration invitation code
At the end of the article, computing power | tool recommendation: high-performance wheels designed for GPU

Super Nervous Encyclopedia

Similarity Measure

The similarity measure is used to estimate the degree of similarity between different samples, and is often used as a criterion for classification problems.

In machine learning and data mining, you need to know the size of the differences between individuals, and then evaluate the similarity and category of individuals.

The most common ones are correlation analysis in data analysis, classification and clustering algorithms in data mining, such as K nearest neighbors and K means.

Depending on the characteristics of the data, different measurement methods can be used.

At the end of the article, computing power | tool recommendation: high-performance wheels designed for GPU

Guess you like

Origin blog.51cto.com/14929242/2535594