HBase combat | The use of HBase in artificial intelligence scenarios

In recent years, artificial intelligence has gradually become popular, especially when combined with big data. The main scenarios of artificial intelligence include image capabilities, voice capabilities, natural language processing capabilities, and user portrait capabilities. We all need to process massive amounts of data in these scenarios, and the processed data generally needs to be stored. The main characteristics of these data are as follows:

  • Large: The larger the amount of data, the more beneficial it will be for us to model later;

  • Sparse: Each row of data may have different attributes, such as user profile data. The attributes of each person are very different. It is possible that user A has this attribute, but user B does not have this attribute; then we hope that the storage system can handle this situation. The attributes that are not available do not take up space at the bottom layer, which can save a lot of space usage;

  • Column dynamic changes: The number of columns in each row of data is different.

In order to better introduce the use of HBase in artificial intelligence scenarios, the following is an analysis of a customer case in an artificial intelligence industry. How to use HBase to design a system to quickly find facial features.

At present, there are many face-related feature data in the company's business scenarios, a total of more than 34 million pieces, and each face data is about 3.2k. These face data are divided into many groups, and each face feature belongs to a certain group. At present, there are nearly 62W face groups in total, and the number of faces in each group ranges from 1 to 1W. Each group will contain different forms of face data of the same person. The distribution of groups and faces is as follows:

  • About 43% of the groups contain 1 face data;

  • About 47% of the groups contain 2-9 face data;

  • The number of faces in the remaining groups ranges from 10 to 10,000.

The current business needs mainly fall into the following two categories:

  • Find all faces under the group according to the face group id;

  • Find the specific data of a certain face according to the face group id + face id.


MySQL + OSS solution

Before the business data volume was relatively small, the storage used was mainly MySQL and OSS (Object Storage). The related tables mainly include a human face group table group and a human face table face. The format of the table is as follows:

group table:

group_id size
1 2

face table:

face_id group_id feature
"c5085f1ef4b3496d8b4da050cab0efd2" 1 "cwI4S/HO/nm6H……"

Among them, the feature size is 3.2k, which is stored after the binary data base64. This is the real face feature data.

Now the correspondence between face group id and face id is stored in MySQL, corresponding to the above group table; face id and feature data related to face are stored in OSS, corresponding to the above face table.

Because the number of human features contained in each face group is very different (1 ~ 1W), based on the above table design, we need to store the face group and each face feature id in each row, so they belong to the same face group The data in MySQL actually stores a lot of rows. For example, if the number of face features corresponding to a face group id is 1W, then 1W rows need to be stored in MySQL.

If we need to find all the faces under the group based on the face group id, we need to read many rows of data from MySQL to obtain the corresponding relationship between the face group and the face, and then go to the OSS according to the face id Get the feature data related to all faces, as shown in the left part of the following figure.

image

As we can see from the query path in the above figure, such a query leads to a very long link. It can be seen from the above design that if the query group contains a large number of faces, then we need to scan a lot of rows from MySQL, and then get the feature data of these faces from OSS. The entire query The time is about 10s, which is far from meeting the needs of the rapid development of existing businesses.


HBase solution

The above design scheme has two problems:

  • The content that originally belonged to the same piece of data could not be stored in a row due to the size of the data itself, resulting in subsequent checks that require access to two storage systems;

  • Since MySQL does not support the feature of dynamic columns, data belonging to the same face group is split into multiple rows for storage.

In response to the above two issues, we analyzed and concluded that this is a typical HBase scenario for the following reasons:

  • HBase has the characteristics of dynamic columns, supporting trillions of rows and millions of columns;

  • HBase supports multiple versions, and all modifications will be recorded in HBase;

  • HBase 2.0 introduces the MOB (Medium-Sized Object) feature to support small file storage. HBase's MOB feature is aimed at file sizes in the range of 1k~10MB, such as pictures, short videos, documents, etc. It has key capabilities such as low latency, strong read and write consistency, strong retrieval capabilities, and easy expansion.

我们可以使用这三个功能重新设计上面 MySQL + OSS 方案。结合上面应用场景的两大查询需求,我们可以将人脸组 id 作为 HBase 的 Rowkey,系统的设计如上图的右部分显示,在创建表的时候打开 MOB 功能,如下:

create 'face', {NAME => 'c', IS_MOB => true, MOB_THRESHOLD => 2048}

上面我们创建了名为 face 的表,IS_MOB 属性说明列簇 c 将启用 MOB 特性,MOB_THRESHOLD 是 MOB 文件大小的阈值,单位是字节,这里的设置说明文件大于 2k 的列都当做小文件存储。大家可能注意到上面原始方案中采用了 OSS 对象存储,那我们为什么不直接使用 OSS 存储人脸特征数据呢,如果有这个疑问,可以看看下面表的性能测试:

对比属性 对象存储 云 HBase
建模能力 KV KV、表格、稀疏表、SQL、
全文索引、时空、时序、图查询
查询能力 前缀查找 前缀查找、过滤器、索引
性能 优,特别对小对象有更低的延迟;在复杂
查询场景下,比对象存储有10倍以上的性能提升
成本 按流量,请求次数计费,
适合访问频率低的场景
托管式,在高并发,高吞吐场景有更低的成本
扩展性
适用对象范围 通用 <10MB

根据上面的对比,使用 HBase MOB特性来存储小于10MB的对象相比直接使用对象存储有一些优势。
我们现在来看看具体的表设计,如下图:

image

上面 HBase 表的列簇名为c,我们使用人脸id作为列名。我们只使用了 HBase 的一张表就替换了之前方面的三张表!虽然我们启用了 MOB,但是具体插入的方法和正常使用一样,代码片段如下:

String CF_DEFAULT = "c";
Put put = new Put(groupId.getBytes());
put.addColumn(CF_DEFAULT.getBytes(),faceId1.getBytes(), feature1.getBytes());
put.addColumn(CF_DEFAULT.getBytes(),faceId2.getBytes(), feature2.getBytes());
……
put.addColumn(CF_DEFAULT.getBytes(),faceIdn.getBytes(), featuren.getBytes());
table.put(put);

用户如果需要根据人脸组id获取所有人脸的数据,可以使用下面方法:

Get get = new Get(groupId.getBytes());
Result re=table.get(get);

这样我们可以拿到某个人脸组id对应的所有人脸数据。如果需要根据人脸组id+人脸id查找某个人脸的具体数据,看可以使用下面方法:

Get get = new Get(groupId.getBytes());
get.addColumn(CF_DEFAULT.getBytes(), faceId1.getBytes())
Result re=table.get(get);

After the above transformation, the memory of the two HBase Worker nodes is 32GB, the number of cores is 8, each node is mounted with four 250GB SSD disks, and 100W rows are written, each row has 1W columns, and one row is read. The time is about 100ms-500ms. In the case of 1000 faces per line, the time to read a line is basically about 20-50ms, which is 200~500 times higher than the previous 10s.

The following is the comparative performance comparison of each program.

Contrast properties Object storage MySQL + object storage HBase MOB
Strong consistency in reading and writing AND N AND
Query capability weak Strong Strong
Query response time high high low
Operation and maintenance cost low high low
Horizontal expansion AND AND AND


Use Spark to accelerate data analysis

We have stored the facial feature data in Alibaba Cloud HBase. This is just the first step in data application. How can we bring out the value hidden behind this data? This requires the help of data analysis. In this scenario, machine learning methods need to be used to perform operations such as clustering. We can use Spark to analyze the data stored in HBase, and Spark itself supports machine learning. However, if the open source Spark is directly used to read the data in HBase, it will affect the read and write of HBase itself.

In response to these problems, the Alibaba Cloud HBase team has optimized Spark, such as directly reading HFile, operator sinking, etc.; and providing fully managed Spark products, simplifying the use of Spark through SQL service ThriftServer and operation service LivyServer. The current technology stack of Spark is shown in the figure below.

image

Through the Spark service, we can integrate well with HBase to integrate real-time streaming and facial feature mining. The entire architecture diagram is as follows:

image

We can collect real-time data from various face data sources, and perform simple ETL operations through Spark Streaming; secondly, we use the Spark MLib library to perform facial feature mining on the data collected just now, and finally store the results. Into HBase. Finally, users can access the facial feature data that has been mined in HBase for other applications.



image


Guess you like

Origin blog.51cto.com/15060465/2677217