VCED project practice 2--Jina introduction

VCED project practice 2 – Getting started with Jina

Name

  • jina is a MLOp framework that enables cross-modal/multimodal applications to be built in the cloud. It elevates POC prototypes to quasi-production-level services. Jina’s ability to handle unstructured data makes it an advanced solution engine and every Developers have access to cloud native technologies.

basic concept

Document

  • DocumentIt is the basic multi-modal and cross-modal data in Jina and the basic element of IO in Jina.
  • It can be understood as abstracting text, images, videos, audio, 3D grids, etc. into a unified data structure
  • Official documentation

Attributes

  • id: unique document number
  • blob:Binary raw data
  • tensor:ndarray of video, picture and audio
  • text:text
  • modality: Indicates the modality corresponding to this document, which will be used when searching
  • embedding:embedding of this document
  • For more attributes, please refer to the official documentation

example

  • Instantiation with different parameters

    • from docarray import Document
      import numpy
      
      d1 = Document(text='hello')
      d2 = Document(blob=b'\f1')
      d3 = Document(tensor=numpy.array([1, 2, 3]))
      d4 = Document(
          uri='https://jina.ai',
          mime_type='text/plain',
          granularity=1,
          adjacency=3,
          tags={
              
              'foo': 'bar'},
      )
      
  • Find the nearest embedding

    • The essence of face recognition, similar image search, and multi-modal search is the process of finding the most similar/nearby embedding for the compressed embedding of the query.

    • Compared with using loops, Document.matchmethods provide a more convenient API

      • .embedNote that you must use a method or .embeddingsset the embedding of the document before use.
    • from docarray import DocumentArray, Document
      import numpy as np
      
      da = DocumentArray.empty(10) #创建一个长度为10的DocumentArray,元素为空的Document
      da.embeddings = np.random.random([10, 256]) #设置每个Document的embedding字段为随机数
      
      q = Document(embedding=np.random.random([256])) # query
      q.match(da)
      
      q.summary()
      
    • The result will return the similarity with the elements in the candidate set DocumentArray, and sort them from large to small by similarity.

      • Note that the default distance metric is cosine distance, that is, 1 − a ⋅ b / ( ∣ ∣ a ∣ ∣ ⋅ ∣ ∣ b ∣ ∣ ) 1-a\cdot b /(||a||\cdot||b|| )1ab/(ab ) , that is, the smaller the value, the more similar it is
      • image-20221115210707351

DocumentArray

  • DocumentArrayIt is multiple Documentcontainers, similar to a list in python

Executor

  • ExecutorIt is a python class that has a series of DocumentArraymethods that can be used as IO. It can be Executorregarded as a microservice

Flow

  • FlowIt is a series Executerof logical pipelines, which can be regarded as an end-to-end service.

Gateway

  • GatawayYes Flow, it is an entrance, equivalent to a route for internal communication.

flow chart

image-20221116015733488

coding style

  • The Jina project supports two styles of code

    • pythonic

      • from jina import DocumentArray, Executor, Flow, requests
        
        
        class FooExec(Executor):
            @requests
            async def add_text(self, docs: DocumentArray, **kwargs):
                for d in docs:
                    d.text += 'hello, world!'
        
        
        class BarExec(Executor):
            @requests
            async def add_text(self, docs: DocumentArray, **kwargs):
                for d in docs:
                    d.text += 'goodbye!'
        
        
        f = Flow(port=12345).add(uses=FooExec, replicas=3).add(uses=BarExec, replicas=2)
        
        with f:
            f.block()
        
    • yamlish

      • # executor.py
        from jina import DocumentArray, Executor, requests
        
        
        class FooExec(Executor):
            @requests
            async def add_text(self, docs: DocumentArray, **kwargs):
                for d in docs:
                    d.text += 'hello, world!'
        
        
        class BarExec(Executor):
            @requests
            async def add_text(self, docs: DocumentArray, **kwargs):
                for d in docs:
                    d.text += 'goodbye!'
        
      • #flow.yaml
        jtype: Flow
        with:
          port: 12345
        executors:
          - uses: FooExec
            replicas: 3
            py_modules: executor.py
          - uses: BarExec
            replicas: 2
            py_modules: executor.py
        
      • jina flow --uses flow.yml   
        
  • The pythonic code is equivalent to packaging the executor and flow in a python file

  • The yaml style code writes the executor in a python file and the flow configuration into a separate yaml. The logic of flow can be separated. This style should be used for complex projects in production environments.

Preparation

Configure pycharm to use the wsl interpreter

  • file > setting > Project:xxx >Python Interpreter
  • image-20221116021031939

Configure jupyter

  • Since there is a problem with pycharm's jupyter startup, the jupyter service must be started in the WLS terminal.

  • jupyter notebook
    
  • Copy the jupyter link and open it in a browser or in pycharm

    • image-20221116021418813
    • image-20221116021500364

demo: Hello world

  • The following example demonstrates a basic helloworld demo

    • An Excutor is defined that will add 'hello, world!' to the text attribute of Document.
    • Use Flow.sendmethod to send empty to streamDocumentArray
  • # demo_hello_world/main.py
    
    from jina import DocumentArray, Executor, Flow, requests
    
    
    class MyExec(Executor):
        """
        定义了一个executor,包含一个异步方法,
        其从网络请求中获取输入一个DocumentArray,
        并在其中的每一个Document的text属性加入"hello, world!"
        """
        @requests
        async def add_text(self, docs: DocumentArray, **kwargs):
            for d in docs:
                d.text += "hello, world!"
    
    # Flow流中中接着连续两次的MyExec,会在text字段加入两遍 "hello, world!"
    f = Flow().add(uses=MyExec).add(uses=MyExec)
    
    
    # with控制流的打开与关闭,会在结束时自动关闭
    with f:
        
        r = f.post('/', DocumentArray.empty(2))
        print(r.texts)
    
  • # run
    python demo_hello_world/main.py
    
  • result

    • image-20221115203428407

gRPC server

  • Note that the flow written in the python file ends when with f:the code block is completed, which is suitable for debugging. If you want to deploy it as a server, you need to use the yaml writing method.

  • This way of writing is scalable and cloud native

  • Put MyExecthe class into a separate executor.pyfile and write the corresponding yaml

    • # executor.py
      from jina import DocumentArray, Executor, requests
      
      
      class MyExec(Executor):
          @requests
          async def add_text(self, docs: DocumentArray, **kwargs):
              for d in docs:
                  d.text += 'hello, world!'
      
    • # toy.yml
      jtype: Flow
      with:
        port: 51000
        protocol: grpc
      executors:
      - uses: MyExec
        name: foo
        py_modules:
          - executor.py
      - uses: MyExec
        name: bar
        py_modules:
          - executor.py
      
    • # 启动项目
      jina flow --uses toy.yml
      
    • image-20221115215011096

    • At this time, a GPRC protocol service is started . Direct access through the browser is not possible. We need to write a client to access it, or use Postman for testing.

client access

  • client code

    • from jina import Client, Document
      
      c = Client(host='grpc://0.0.0.0:51000') #创建一个客户端
      result = c.post('/', Document()) #客户端post一个空的Document
      print(result.texts)
      
      
    • result

      • image-20221115220024996

Postman test

  • reference

  • img

  • You can import protofiles or use postman's reflection

  • If the service is normal, after entering the URL of the server, a drop-down menu of the calling structure will appear. Click invoke to test.

  • image-20221115221955940

demo: Use resnet to find similar images

import jina
import urllib
from jina import DocumentArray, Executor, requests, Document
from docarray.array.mixins.plot import PlotMixin
from docarray import DocumentArray
import matplotlib.pyplot as plt
import os
plt.rcParams['font.family'] = 'SimHei'

Build a mini retrieval data set from local images

img_path_list = [os.path.join('./test_img', i) for i in os.listdir('./test_img')]
img_path_list
for idx, img_path in enumerate(img_path_list):
    # f = urllib.request.urlopen(img_uri)
    plt.subplot(2, 3, idx + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.xlabel(img_path.split('/')[-1])
    img = plt.imread(img_path)
    plt.imshow(img)

image-20221116013944539

preprocessing

  • Pass in the image from the parameters, convert the image to , normalize, and resize to the same size, and finally put the channel in the first dimension, which is equivalent Documentto .uritorch.tensortorch.permute
def get_doc_from_img(img_path):
    return Document(uri=img_path).load_uri_to_image_tensor().set_image_tensor_normalization().set_image_tensor_shape(
        (300, 300)).set_image_tensor_channel_axis(-1, 0)

Define search image database

  • Use to DocumentArraysave multipleDocument
docs = DocumentArray([get_doc_from_img(i) for i in img_path_list])
docs

image-20221116014013725

Import the resnet pre-trained model and generate embedding

  • If an error is reported in this step due to network reasons, you can first set the pre-training weights and put them into the specified path according to the error message.
  • Use DocumentArray.embedmethod to get the embedding of the image
import torchvision

model = torchvision.models.resnet50(pretrained=True)
docs.embed(model).embeddings

embedding tensor

tensor([[-1.8643,  1.8063, -2.5505,  ...,  0.3757,  1.5515, -0.1677],
        [-1.0711,  1.8354, -2.3960,  ...,  0.4941,  1.7618, -0.4173],
        [-1.6836,  1.7230, -2.4809,  ...,  0.8211,  5.2785, -0.2270],
        [-0.9699,  1.8454, -2.8017,  ...,  4.9436,  3.8553,  0.9916],
        [ 2.9779,  1.3210, -4.4595,  ...,  3.3578,  1.9920, -1.8143],
        [ 3.9424,  0.8940, -3.4237,  ...,  4.5399,  2.4077, -1.0326]])

Visualization

# docs.plot_embeddings() #进行可视化

image-20221116002224785

Compare image similarity

  • The query image must first obtain the embedding through the model, and then use Document.matchthe method to calculate the similarity with each embedding of the database.
# query
d_query = get_doc_from_img(img_path_list[0])
d_query.embed(model) #也要先通过模型得到embedding
d_query.match(docs)

image-20221116014148729

query_result = [(m.scores['cosine'].value, m.uri) for m in d_query.matches]
query_result.sort()# 距离从小到大排序
plt.subplot(3, 3, 1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.xlabel(f'query image')
plt.imshow(plt.imread(img_path_list[0]))
for idx, (dis,img_path) in enumerate(query_result):
    # f = urllib.request.urlopen(img_uri)
    plt.subplot(3, 3, idx + 4)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.xlabel(f'emb distance {
      
      dis:.2f}')
    img = plt.imread(img_path)
    plt.imshow(img)
plt.tight_layout()
plt.show()

image-20221116014121084

Use Flow to deploy as a service

  • Encapsulate the operation executorand use the client to query
# executor.py
from jina import DocumentArray, Executor, requests, Document
import torchvision, os


def get_doc_from_img(img_path):
    return Document(uri=img_path).load_uri_to_image_tensor().set_image_tensor_normalization().set_image_tensor_shape(
        (300, 300)).set_image_tensor_channel_axis(-1, 0)


class MyExec(Executor):
    @requests
    async def get_similar_img(self, docs: DocumentArray, **kwargs):
        docs.embed(model)
        docs.match(img_database)


model = torchvision.models.resnet50(pretrained=True)
img_path_list = [os.path.join('./test_img', i) for i in os.listdir('./test_img')]
img_database = DocumentArray([get_doc_from_img(i) for i in img_path_list])
img_database.embed(model)  #一定要提前将检索图片embed

# toy.yml
jtype: Flow
with:
  port: 51000
  protocol: grpc
executors:
- uses: MyExec
  name: img_query
  py_modules:
    - executor.py
# client 测试
from jina import Client, Document

c = Client(host='grpc://0.0.0.0:51000') #创建一个客户端
d_query=get_doc_from_img(img_path_list[0])
# d_query.embed(model) #改为服务端进行模型embed
result = c.post('/',d_query) #结果是一个DocumentArray
print(result[0])

query_result = [(m.scores['cosine'].value, m.uri) for m in result[0].matches]
query_result.sort()# 距离从小到大排序
plt.subplot(3, 3, 1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.xlabel(f'query image')
plt.imshow(plt.imread(img_path_list[0]))
for idx, (dis,img_path) in enumerate(query_result):
    # f = urllib.request.urlopen(img_uri)
    plt.subplot(3, 3, idx + 4)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.xlabel(f'emb distance {
      
      dis:.2f}')
    img = plt.imread(img_path)
    plt.imshow(img)
plt.tight_layout()
plt.show()

Guess you like

Origin blog.csdn.net/u011459717/article/details/127877087