Construction of engineering articles, search for images

Content-based recall is a relatively common recall strategy in recommendation systems. There are common label recalls based on users or items or recalls based on users’ age and region. Generally, the implementation of this strategy is based on the open source software Elasticseach. Although the results of the recall are relatively reasonable, the novelty and surprise of the recall are relatively low. For example, when the label "Andy Lau" is used for recall, basically all items that contain Andy Lau are recalled. It is unlikely that "Dawn", "Jacky Cheung" and other four kings' items will be recalled. In recent years, as everything can be Embedding, especially the successful application of word2vec, item2vec, graph2vec and other technologies, the method of recalling item vectors through item vectors has also become a more commonly used recall strategy in recommendation systems. This article focuses on building a vector search service through the open source software Vearch, and successfully realizes the function of searching images with images.

Introduction

Recently, I have been doing recommendation and optimization of small videos, and the goal of optimization is to screen per capita. According to the recommendation experience of the previous information flow, it is hoped that more relevant small videos will be recalled for users to consume based on the user's playback records. The recommended scene of the small video is a bit similar to the current popular Douyin, which is to automatically play a short video (about 5-30s). The small video basically occupies the entire screen. Users can like and share this small video. Comment; if you don’t like it, you can swipe up to watch the next small video. Considering that the factor that determines whether the user wants to watch is unlikely to be the video title in the two rows below the video, the more factor lies in whether the cover image of the video can arouse the user's interest. Based on this project background, I wanted to build a service to search for images to recall cover images of small videos.

Because I have used gRPC to encapsulate Faiss to build a vector recall service before, plus the image classification project I have done before, the image is converted into a vector as the input of the classifier, so to do this basically two things must be solved:

  1. Preprocess the image and convert it to a fixed-length vector
  2. Enter the vector of each picture into Faiss, and use it to complete the task of vector search

For example, in the use of the blog post Faiss in the project , the author uses the SIFT algorithm to extract image features, which correspond to a 128-dimensional vector. Input the feature vector of each picture to Faiss for similar vector recall. For example, in the blog post Faiss server practice based on gRPC , the technical team of MXPlayer upgraded the user/item vector recall service originally developed based on the Flask framework. The single-machine pressure test QPS is more than 2 times higher than before. The original plan was to change the Faiss service based on gRPC to meet the current business scenario needs, but by chance, I found the open source software Vearch of JD.com. I suppressed my previous thoughts and decided to learn this open source software.

Service composition

The image search service consists of two parts, one is the vector search service, provided by Vearch; the other is the feature vector extraction of the image features, provided by the plugin python-algorithm-plugin of Vearch.

Vector Search Service-Vearch

Vearch is a flexible distributed system that performs high-performance similarity search on large-scale deep learning vectors. Its core is vector search, which is called Gamma engine based on Faiss. But in addition to vector search, Gamma can also store documents containing scalars, and quickly index and filter these scalar fields. To put it bluntly, both scalar and vector are supported, while in general, Elasticsearch only supports scalar. Faiss can only build a single-machine vector search service, while Vearch uses Gamma as a vector search engine, uses the Raft protocol to implement multi-copy storage, and provides Master and Router components to build a flexible distributed system for vector similarity search. Its architecture diagram is as follows: Vearch architecture diagramThere are three main components in the figure: Master, Router, PartitionServer, and its functions are as follows:

  • The Master is responsible for schema management and coordination of source data and resources at the cluster level
  • Router provides RESTful API for adding, deleting, modifying and checking, routing and forwarding requests and merging results
  • PartitionServer is mainly based on the raft protocol to achieve multi-copy storage, and the specific storage, indexing and retrieval capabilities are provided by the Gamma engine. From the above, we can see that Gamma is to Vearch, which is equivalent to Lucene to Elasticsearch.

Image processing service-Vearch plugin

The image processing service Vearch also provides the corresponding plugin python-algorithm-plugin . Vearch's goal is to build a high-performance similar search elastic distributed system. Text, pictures and videos can all be converted into vectors, so the Vearch team provides corresponding plug-ins for better integration into Vearch. For pictures, the plug-in provides functions such as target detection, feature extraction and similarity search. The processing logic is as follows: Search by imagethe logic is to extract vector features from the picture and store them in Vearch's Gamma engine, and provide retrieval services.

Service construction

The image search service consists of two services, one is a vector search service, provided by Vearch; the other is image feature extraction as a vector, provided by Vearch's plugin python-algorithm-plugin .

vearch

Vearch is written in Go, and its core engine Gamma is written in C++ (after all, Faiss is also developed in C++), so the service installation and deployment are relatively simple and rude, as long as the dependent lib packages (Faiss, Gamma, RocksDB) are set, and there are compiled binary files vearch, for stand-alone mode directly ./vearch -conf config.tomlperformed to start the service, and for cluster services through a final command parameters ./vearch -conf config.toml ps/router/masterto configure.

However, because the Gcc version of our online server is too low and there are no factors such as the Go environment, we use the Docker method. In order to understand in detail how the Faiss service evolved into a flexible distributed system called Vearch, the source code was compiled and installed.

# 下载源码
git clone https://github.com/vearch/vearch
# 切换到镜像编译目录
cd vearch/cloud
# 打包环境镜像 vearch/vearch_env:3.2.2,将gcc,git,faiss,rocksdb,go等安装好
# 这步打包比较慢,可以直接使用官方镜像 docker pull vearch/vearch_env:3.2.2
sh compile_env.sh
# 使用 vearch_env 编译二进制文件 vearch,主要是拉取 gamma 源码进行编译
sh compile.sh
# 打包 vearch/vearch:3.2.2, 将打包好的二进制文件vearch和依赖的库放到镜像中。
# 可以直接使用官方镜像 docker pull vearch/vearch:3.2.2
sh build.sh

The official image packaging still has room for optimization. It is recommended to use centos source for packaging and prepare source files such as Faiss, RocksDB, and Go.

Image Processing

The image processing service does not have a ready-made Docker image, and there is a problem with the image packaging provided on the Github warehouse. You can use the following warehouse for packaging.

# 下载源码(使用修正后的 Dockfile 文件)
git clone -b study https://github.com/haojunyu/python-algorithm-plugin
# 切换到镜像目录并打包镜像 vearch/images:3.2.2
# 可以直接使用打包好的镜像 docker pull haojunyu/vimgs:3.2.2
cd python-algorithm-plugin && docker build -t haojunyu/vimgs:3.2.2 .

swarm start

Because the two services have been packaged into Docker mirrored here directly on the command docker stack deploy -c docker-compose.yml vearchto start the service, docker-compose.yml reads as follows:

version: '3.3'

services:
    vearch:
        image: vearch/vearch:3.2.2
        ports:
            - "8817:8817"
            - "9001:9001"
        volumes:
            - ./config.toml:/vearch/config.toml
            - ./data:/datas
            - ./logs:/logs
        deploy:
            mode: replicated
            replicas: 1
            restart_policy:
                condition: on-failure
                delay: 10s
                max_attempts: 3
        logging:
            driver: "json-file"
            options:
                max-size: "1g"

    imgs:
        image: haojunyu/vimgs:3.2.2
        ports:
            - "4101:4101"
        volumes:
            - ./python-algorithm-plugin/src/config.py:/app/src/config.py
            - ./images/imgs:/app/src/imgs
        command: ["bash", "../bin/run.sh", "image"]
        deploy:
            mode: replicated
            replicas: 3
            restart_policy:
                condition: on-failure
                delay: 10s
                max_attempts: 3

Note: The mount file python-algorithm-plugin/src/config.py is the configuration file of the image processing service. Generally, you only need to change the following four configurations for your own situation:

  • port Refers to the port of the image processing service, the default is 4101
  • gpus Specify whether the service uses gpu, the default is not to be -1
  • master_addressAnd router_addressrefers to Vearch services master and router services

Service usage

Because the picture service and the Vearch service are highly integrated. Generally, the picture service is directly called, and the picture vector input Vearch is handed over to the picture service for processing. The detailed operation of Vearch can refer to the document .

Service monitoring

# 这里master_server指vearch主节点及其对应端口:localhost:8817
# 查看集群状态
curl -XGET http://master_server/_cluster/stats
# 查看健康状态
curl -XGET http://master_server/_cluster/health
# 查看端口状态
curl -XGET http://master_server/list/server
# 清除锁(在创建表时会对集群加锁,若在此过程中,服务异常,会导致锁不能释放,需要手动清除才能新建表。)
curl -XGET http://master_server/clean_lock
# 副本扩容缩容
curl -XPOST -H "content-type: application/json"  -d'
{
    "partition_id":1,
    "node_id": 1,
    "method": 0
}
' http://master_server/partition/change_member

Library and space operations

The concept of library and space is similar to the concept of database and table in mysql.

  • Library operations
# 查看及群众所有的库
curl -XGET http://master_server/list/db
# 创建库
curl -XPUT -H "content-type:application/json" -d '{
    "name": "sv_month"
}
' http://master_server/db/_create
# 查看库
curl -XGET http://master_server/db/$db_name
# 删除库(库下存在表空间则无法删除)
curl -XDELETE http://master_server/db/$db_name
# 查看指定库下所有表空间
curl -XGET http://master_server/list/space?db=$db_name
  • Table space operations
# 在库sv_month下创建表空间test(针对image)
curl -XPUT -H "content-type: application/json" -d '{
    "name":"test",
    "partition_num":1,
    "replica_num":1,
    "engine":{
        "name":"gamma",
        "index_size":70000,
        "max_size":10000000,
        "id_type":"String",
        "retrieval_type":"IVFPQ",
        "retrieval_param":{
            "metric_type":"InnerProduct",
            "ncentroids":256,
            "nsubvector":32
        }
    },
    "properties":{
        "itemid":{
            "type":"keyword",
            "index":true
        },
        "feature1":{
            "type":"vector",
            "dimension":512,
            "model_id":"vgg16",
            "format":"normalization"
        }
    }
}' http://image_server:4101/space/sv_month/_create

Data manipulation

  • Data insertion
# 插入本地图片数据到表空间中
curl -XPOST -H "content-type: application/json"  -d' {
    "itemid":"COCO_val2014_000000123599",
    "feature1":{
        "feature":"../images/COCO_val2014_000000123599.jpg"
    }
} ' http://image_server:4101/sv_month/test/AW63W9I4JG6WicwQX_RC
  • Data search
# 查询相似结果
curl -H "content-type: application/json" -XPOST -d '{ 
    "query": { 
        "sum": [ {
            "feature":"../images/COCO_val2014_000000123599.jpg",       "field":"feature1"
        }]
    }
}' http://image_server:4101/sv_month/test/_search

Service effect and launch

effect

After the service is successfully constructed, it is necessary to check the effect of searching the image by map, and the identification of the effect is initially based on manual, and finally based on the online indicator data. There should be expectations for the effectiveness of a service at the beginning of the establishment of the service, such as:

  1. The similarity of the same pictures is close to 100%
  2. The same type should get the same type of results, such as using a puppy to search for a puppy, using a car to search for a car, etc.

The following is a screenshot of the effect of searching with pictures: animal beauty beauty2 car

Overall the effect is quite good.

online

There are several ways to go online for recommended strategies:

  1. Direct service online, like a sorting model. This approach requires services to support high concurrency, high performance and high availability
  2. Online call + cache, like content search. This method requires the service to support high performance and high availability, and the cache has a greater probability of being hit
  3. The results are written into the cache offline, such as cf, hot, etc., which can be calculated in advance.

A total of 90,000 data of small videos newly added in the last 7 days and videos exposed in the last 30 days were imported into the stand-alone Vearch service, which could not support the impact of two buckets with an average of 48 QPS. Then the second method was used to solve the online problem. The corresponding user vector (average value of the displayed picture vector) search image vector strategy was launched in the third way.

references

  1. The use of Faiss in the project
  2. faiss-web-service
  3. Faiss server practice based on gRPC
  4. JD distributed vector retrieval system vearch
  5. vearch Chinese document
  6. vearch core engine gamma
  7. vearch image processing plugin
  8. Image search page

If this article has helped you, or if you are interested in technical articles, you can follow the WeChat public account: Technical Tea Party, you can receive related technical articles as soon as possible, thank you!
Technical tea party

This article is published by OpenWrite , an operating tool platform such as blog group posting and multi- posting

Guess you like

Origin blog.csdn.net/haojunyu2012/article/details/112695988