The best deployment method of deep learning model: Use Python to implement HTTP server as API interface

Under Apple system, if the pictures in the article cannot be displayed normally, please upgrade the Safari browser to the latest version, or use Chrome or Firefox browsers to open it. )

    After training and testing a deep learning model, if we plan to put the algorithm model online and put it into production environment deployment, then we need to do some additional processing work. Since the deep learning model has a large demand for computing power, during the online process, there are generally three ways to reduce the network scale, use dedicated hardware, and perform cloud computing through C/S architecture networking. AI lemon bloggers recommend the third method, that is, the model is deployed on the server side, the client side sends the input data to the server through the network, and the calculated result is passed to the client side. The 5G era is just around the corner, the IPv6 protocol is deployed on a large scale, and everything is about to be interconnected, especially the wireless mobile Internet as an important infrastructure is the general trend. Through networking, even the lowest-cost low-end hardware can obtain the calculation results of the deep learning model at a faster speed without loss of accuracy. For example, the ASRT voice recognition system is deployed in this way, and it has been able to provide voice recognition services for the AI ​​Lemon website for tasks such as voice search.

1 What are the model deployment methods

1.1 Reduce the network scale and then transplant

    Generally speaking, the scales of the models we train are very large. There have been methods similar to MobileNet, which were directly constructed into a lightweight neural network and directly transplanted to the mobile device of the user terminal and other embedded hardware, or the original Some large-scale networks are pruned and saved as a simplified model. The accuracy of the model obtained in this way will not be much worse than that of the original network, while at the same time it can greatly reduce the computational overhead and increase the running speed.

    However, this is not only as simple as a certain loss of accuracy, but also the pruning process of the model is also very cumbersome. It takes a lot of work to determine how to prun and to what extent. Moreover, the model still has a transplantation problem. The model trained with PyTorch will require more work than the TensorFlow, and other frameworks are even more, because Google has the "TensorFlow family bucket". TensorFlow officially releases a library that supports the Go language for efficient model inference, and its Lite version (TensorLite) for the model to run on the mobile side, and in addition to Python, it also supports Java, Swift, and Object- C and other languages. Even so, the migration process is very cumbersome, requiring a lot of reference to the TensorFlow documentation, and there are still compatibility issues with hardware devices.

    In short, no matter how you do it, it is equivalent to introducing a bunch of new problems in order to solve a problem, no matter how you look at it, it is not worthwhile.

1.2 Use dedicated hardware for the device

    This method is most suitable for environments where Internet access is prohibited or Internet access is not possible. Are there any places that cannot (in two meanings) connect to the Internet? Everyone understands everything. Once there is no limitation of networking, it is not cost-effective to use dedicated hardware. The most important thing at present is the high comprehensive cost and low cost performance. Not only the cost of model transplantation mentioned in the previous section, but also the hardware cost problem. At present, the cost of dedicated hardware is relatively high. Although it is generally said that it can be faster than GPU, a dedicated hardware is more expensive than ordinary GPU, but it can only be used to calculate deep learning models, and it needs to be adapted to the hardware when transplanting. Programming. A typical example is FPGA, and it is understood that the cost of related dedicated hardware of some companies is very high.

1.3 Using C/S architecture for cloud computing based on the network

    The "C/S" architecture is the classic architecture model in the software architecture-the "client/server" architecture. A typical C/S architecture often uses the MVC model, namely: model ( M odel), view ( V iew) , Controller ( C ontroller).

Figure 1 Schematic diagram of MVC mode

The following is an explanation of the MVC pattern from the rookie tutorial:

Model  -The model represents an object or JAVA POJO that accesses data. It can also have logic to update the controller when the data changes.

View  -A view represents a visualization of the data contained in the model.

Controller  -The controller acts on the model and view. It controls the flow of data to model objects and updates the view when the data changes. It separates the view from the model.

I won’t go into details here. You can view the original text for a more detailed explanation of MVC: https://www.runoob.com/design-pattern/mvc-pattern.html

    Through networking and using the C/S architecture, we can deploy a complete deep learning model calculation on a high-performance server. The client does not need to use expensive equipment and connect to the network to enjoy the convenience brought by AI. In this way, only a small number of high-performance computing servers need to be deployed in the cloud, and a large number of cheap hardware and equipment can use model calculation and reasoning, and the overall cost is very low. This deployment method eliminates the need for a large amount of expensive hardware on the client side, avoids the waste of idle computing resources, and avoids the cost of manpower and material resources for model transplantation.

2 Why use HTTP service

    We have determined that the networked C/S architecture should be used as the deployment of the deep learning model to be better, and there are many specific implementations of the C/S architecture. The AI ​​Lemon blogger also recommends using the HTTP protocol to do this. Why is it the HTTP protocol?

    Because the HTTP protocol is the most widely used, most compatible, and functionally applicable application layer network protocol in almost any scenario. Various forms of browser-based websites are an example, but browser-based websites have a "B/S" architecture, that is, a "browser/server" architecture, and we only talk about the C/S architecture here.

2.1 What is HTTP

    HTTP is the Hyper Text Transfer Protocol (Hyper Text Transfer Protocol), which is an application layer protocol based on the TCP/IP protocol. It is generally considered to be connectionless and stateless. The HTTP protocol has request methods such as GET, POST, PUT, and DELETE, which correspond to the operations of query, add, modify, and delete. The first two are the most commonly used. We generally use GET requests to open web pages. If there is data to submit, such as registration and login, publishing articles and comments, etc., we generally use POST requests, because GET requests have a maximum URL length limit, while POST requests generally do not. Therefore, using POST requests to upload data such as pictures, voices and videos to the server, and then the server performs deep learning calculations to return the results to us is the most suitable.

    HTTP is generally connectionless and stateless, but we can also implement it as stateful when needed, such as user registration and login. We all know that after logging in, we can use our own account to post articles and comments. . And this kind of connection and state can be achieved through Cookie and Session in the HTTP protocol. Cookie is stored on the user side (such as the browser side), and the Session is stored on the server side (HTTP service software or database can be used). Generally, the two are used together.

    For a large number of deep learning model calculations, it is usually context-independent. For example, to identify all people on picture 1 and everyone on picture 2, we generally don't need to use cookies and sessions to realize context-related states and connections. But sometimes it is also necessary. For example, a machine question answering system usually needs to consider the context of the content of the conversation with the user in the past period of time, or the speech recognition of long, long texts. At this time, a connected and stateful processing can be performed. This also only requires Cookie + Session.

2.2 Benefits of deployment based on HTTP protocol

    The AI ​​Lemon blogger just said that HTTP is widely used, almost all compilers have available library implementations, and the protocol is standardized, there is no problem of inconsistent protocol implementation, it has passed the test of time and practice, and there are a large number of ready-made Software and libraries. And when using the HTTP protocol, elastic scaling is easy to achieve, even without writing additional code, the ready-made Nginx server can be done with a complete set of configuration files!

Figure 2 Schematic diagram of a high-performance deep learning inference computing server cluster architecture

    For example, start an HTTP server for deep learning model calculation. If one server is not enough, then we can start 100, and the IP address is set to 192.168.1.100 to 192.168.1.199 (or 100 different ports can be monitored on the same machine , You need to have good server performance), and then fill in these 100 IP addresses, use Nginx to reverse proxy these 100 servers through UpStream, and forward each calculation request to one of these 100 servers by polling. If a recognition takes 1 second, then a server can only be recognized once per second. After Nginx load balancing, such a server cluster can recognize 100 times per second, increasing the throughput rate from 1 to 100! AI lemon bloggers currently use this method when deploying the ASRT speech recognition API server. Several (N) servers are deployed on the back end, which increases the speech recognition throughput rate to N times the original.

    Therefore, through the C/S architecture, we only need to deploy deep learning inference calculations on the server, and the implementation of other functions is still retained in the terminal, so that the AI ​​algorithm model can be implemented in a low-cost and high-efficiency model. Combined with cloud computing, it can be further Reduce the total cost of the enterprise.

2.3 Why is it not recommended to use a private application protocol

    The main reason why a private agreement is not recommended is that it is a private agreement, which has certain requirements for the enterprise's software engineering development and management capabilities. Re-developing a new protocol based on TCP/IP is prone to defects, and the defects of the protocol will cause various problems, and the update of the protocol version is more likely to cause compatibility problems. If you have enough development capabilities, you can try.

3 Why is it recommended to use the native HTTP library instead of Django and Flask

    Django and Flask are Python libraries for Web site development released in 2005 and 2010, respectively. These two frameworks have specific usages and additional learning costs. As API servers, we don’t need to pay attention to many features. Using the Python native HTTP library directly, we only need to write a HTTP service subclass code to implement the methods for processing GET and POST requests. The syntax is still familiar.

4 Python implements a sample HTTP service program and supports IPv6

    In this section, we introduce an HTTP server Demo that uses Python to process API requests and let it support the IPv6 protocol.

apiserver.py

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

"""

@author: AI Lemon

HTTP server program of a deep learning model API

"""

import http.server

import urllib

class TestHTTPHandle(http.server.BaseHTTPRequestHandler):

def setup(self):

self.request.settimeout(10) #Set the timeout time to 10 seconds

http.server.BaseHTTPRequestHandler.setup(self)

def _set_response(self):

self.send_response(200) #Set HTTP server request status 200

self.send_header('Content-type','text/html') #Set the HTTP server request content format

self.end_headers()

def do_GET(self):

buf ='AI Lemon Deep Learning Model Computing API Service'

self.protocal_version = 'HTTP/1.1'

self._set_response()

buf = bytes(buf,encoding="utf-8") #coding conversion

self.wfile.write(buf) #Write out the returned content

def do_POST(self):

'''

Process the input data passed and received through POST, calculate and return the result

'''

path = self.path #Get the requested URL path

print(path)

#Get the data submitted by post

datas = self.rfile.read(int(self.headers['content-length']))

#datas = urllib.unquote(datas).decode("utf-8", 'ignore')

dates = dates.decode ('utf-8')

'''

The input data has been stored in the variable "datas", where you can insert processing codes for deep learning calculations, etc.

Suppose the calculation result is stored in the variable "buf"

'''

self._set_response()

buf = bytes(buf,encoding="utf-8")

self.wfile.write(buf) #Write the return result to the client

To enable the transformation of IPv6 protocol support, add after the above code:

import socket

class HTTPServerV6(http.server.HTTPServer):

address_family = socket.AF_INET6

def start_server(ip, port):

if(':' in ip):

http_server = HTTPServerV6((ip, port), TestHTTPHandle)

else:

http_server = http.server.HTTPServer((ip, int(port)), TestHTTPHandle)

print('Server is turned on')

try:

http_server.serve_forever() #Set to always monitor and receive requests

except KeyboardInterrupt:

pass

http_server.server_close()

print('HTTP server closed')

This HTTPServerV6 is a server program that supports the IPv6 protocol. In order to be able to switch between pure IPv4 and IPv6, we only need to implement the start_server() function as the startup entry, and call it wherever we want to start the server program. it. E.g:

if __name__ == '__main__':

start_server('', 20000) # For IPv4 Network Only

#start_server('::', 20000) # For IPv6 Network

5 Application example: How does ASRT speech recognition API server work?

    The code for the ASRT speech recognition system to implement the API server program is:

https://github.com/nl8590687/ASRT_SpeechRecognition/blob/master/asrserver.py

As mentioned above, first add the initialization code related to deep learning outside the code of the HTTP Server class:

hard import

from SpeechModel251 import ModelSpeech

from LanguageModel import ModelLanguage

datapath = './'

modelpath = 'model_speech /'

ms = ModelSpeech(datapath)

ms.LoadModel(modelpath + 'm251/speech_model251_e_0_step_12000.model')

ml = ModelLanguage('model_language')

ml.LoadModel()

Then add the deep learning calculation code after the input data obtained by processing the POST request:

# Connect the previous code, not to open a new function

def recognize(wavs, fs):

r=''

try:

r_speech = ms.RecognizeSpeech(wavs, fs)

print(r_speech)

str_pinyin = r_speech

r = ml.SpeechToText(str_pinyin)

except:

r=''

print('[*Message] Server raise a bug. ')

return r

pass

datas_split = datas.split('&')

token = ''

fs = 0

wavs = []

#type = 'wavfilebytes' # wavfilebytes or python-list

for line in datas_split:

[key, value]=line.split('=')

if('wavs' == key and '' != value):

wavs.append(int(value))

elif ('fs' == key):

fs = int(value)

elif('token' == key ):

token = value

#elif('type' == key):

# type = value

else:

print(key, value)

if(token != 'qwertasd'):

buf = '403'

print(buf)

buf = bytes(buf,encoding="utf-8")

self.wfile.write(buf)

return

#if('python-list' == type):

if(len(wavs)>0):

r = self.recognize([wavs], fs)

else:

r = ''

#else:

# r = self.recognize_from_file('')

if(token == 'qwertasd'):

#buf ='Success\n'+'wavs:\n'+str(wavs)+'\nfs:\n'+str(fs)

buf = r

else:

buf = '403'

    Speaking of this, someone may ask how to change the deep learning code to run in this HTTP server program. You may find it very difficult to rebuild the code. It only shows that the code is too bad and the structure is not reasonable enough. Write it People with this code need to learn how to elegantly structured programming, especially the kind of python code file that runs from the beginning to the end. Many people who do deep learning can only write code, but do not understand programming. Therefore, when the algorithm is implemented, the software architecture is a mess, and this is also the case for back-end development.

6 Elastic scaling problem: What should we do if the number of API requests increases sharply

    Two words: expansion. Faced with the increase in requests, the first thing we should think of is horizontal expansion of servers, rather than vertical expansion, such as replacing servers with better computing performance. As shown in Figure 2, it is only necessary to increase the number of ASRT speech recognition API servers on the backend. Now based on the cloud computing platform, there are already automated operation and maintenance tools and products that can automatically help us deal with such problems. When we are not in an emergency, we will gradually optimize various algorithms including deep learning models from the perspective of software to find the maximum performance and time overhead bottlenecks of the program to speed up the program running speed and reduce the time overhead of a single operation.

7 Summary

    This article mainly introduces the best way to deploy a deep learning model. First, it introduces three commonly used methods of deep learning model deployment, and explains their respective characteristics. Then introduced the HTTP protocol and its advantages, and then introduced how to use Python to implement an HTTP server program based on the HTTP library for deep learning model calculation API services, and then introduced how to implement the code with the ASRT speech recognition system Take for example. Finally, we also discussed how to deal with our ever-increasing API request volume, which is to scale the server horizontally in emergencies and optimize the bottleneck in normal times.

Copyright Statement
The articles in this blog are original unless otherwise specified, and I am copyrighted. Welcome to reprint, please indicate the author and source link, thank you.
This article address:  https://blog.ailemon.me/2020/11/09/dl-best-deployment-python-impl-http-api-server/
All articles are under Attribution-NonCommercial-ShareAlike 4.0

Guess you like

Origin blog.csdn.net/c2a2o2/article/details/112918137