Concurrent processing of AI services [Python]

For a while, I focused on the research side of machine learning, developing custom machine learning solutions for different tasks. But lately, new projects have come in, and sometimes it's faster to take care of the initial deployment yourself than to enlist the help of other developers. I found several deployment options that differ in size, ease of use, pricing, etc.

Today, we'll discuss a simple yet powerful way to deploy machine learning models. It allows us to handle multiple requests concurrently and scale the application when needed. We'll also discuss the responsibilities of a data scientist when putting machine learning models into production, and how to load test web applications using some handy Python tools.

insert image description here

Recommendation: Use NSDT Designer to quickly build programmable 3D scenes.

1. Responsibilities of a Data Scientist

You can find tons of open source solutions for almost every task. Some existing services can even handle data validation and processing, data storage, model training and evaluation, model inference and monitoring, etc.

But what if you still need a custom solution? You have to develop the whole infrastructure yourself. This is the question I've been thinking about: what exactly is a data scientist responsible for? Is it just the model itself, or do we have to put it into production?

Typically, data scientist responsibilities vary from company to company. I discussed this with my CTO. We discussed some cases where a data scientist should have expertise. They should be able to deliver their solution as an API, containerize it, and ideally, develop the solution to handle multiple requests concurrently.

For mobile devices, it is usually sufficient to provide the mobile developer with the model converted to the appropriate format. Most importantly, you can provide documentation that describes what the model takes as input and what it returns as output.

If a Docker container cannot handle the expected traffic, data scientists should delegate further scaling to appropriate experts.

2. Flask and concurrency

We'll experiment with Flask as part of a simple application. It's a tiny web framework built in Python, designed for small applications.

When a request is received, the application will send a request to httpbin.org - a service that helps you experiment with different requests. After sending the request, our application will receive a two second delayed response. We need this delay to experiment with concurrency.

from flask import Flask
import requests
import json


app = Flask(__name__)

@app.route('/')
def test_request():
    response = requests.get('https://httpbin.org/delay/2').json()
    return response

if __name__ == "__main__":
    app.run(host='0.0.0.0')

Pure Python has its "notorious" GIL limitation, which essentially limits only one Python thread to running at a time (read it here). If we want our application to handle more requests in a given amount of time, we have two options: threading and multiprocessing. Which one to use depends on the bottleneck of the application.

3. When to choose multithreading?

Multithreading should be used whenever there is a wait time. The way our application is written represents a typical I/O bound operation. Therefore, most of the execution time is spent waiting for other services (such as operating system, database, internet connection, etc.). In this case, we can benefit from multithreading as it helps to take advantage of waiting time.

4. When to choose multi-process?

On the other hand, multiprocessing should be used when you want to improve application performance. Assuming our application is actively using the CPU (for example, passing data forward through a neural network), its performance depends entirely on the computing power of the CPU. This application is described as being CPU bound. To improve the performance of our application, we need multiprocessing. Unlike threads, we create separate interpreter instances and perform computations in parallel.

Flask's built-in server has been threaded by default since version 1.0. So why not deploy your application entirely with Flask? Flask's website clearly states that "Flask's built-in server is not suitable for production" because it doesn't scale well.

We'll look at another deployment solution in a minute. But first, I recommend testing the application to see how well it can handle the load.

5. Use Locust for load testing

It's important to load test your API before sending traffic to it. One way is to use a Python library called Locust. It runs a web application on localhost and has a simple interface that allows us to customize tests and visualize the testing process.

Let's run some tests on our Flask application on localhost.

1) Install locust with the following command:

pip3 install locust

2) Create the script and add it to our project directory:

from locust import HttpUser, between, task


class MyWebsiteUser(HttpUser):
    wait_time = between(5, 15)

    @task
    def load_main(self):
        self.client.get("/")

3) Run the application with the following command:

python3 demo.py

4) Run our load test application with another command:

locust -f load_testing.py --host=http://0.0.0.0:5000/

5) Visit http://localhost:8089 in the browser, you will see the Locust interface

6) Specify the number of unique users generated and the number of requests sent per second

7) Select parameters as needed
insert image description here

6. Test multithreading

Now let's look at testing the application with and without multithreading enabled. It will let us know if it helps to process more requests in a given time. Remember that we set a two second delay before getting a response from the server.

The following figure shows the situation when multi-threading is turned off:

 ...
if __name__ == "__main__":
    app.run(host='0.0.0.0', threaded=False)

insert image description here

The next test is to open the thread. To do this, remove the previously mentioned parameters.
insert image description here

As you can see, you can get higher RPS (requests per second) rates by using threads.

But what if we develop an application to classify images? This operation actively uses our CPU. In this case, rather than using threads, it would be better to use a separate process to handle the requests.

7. Use uWSGI to handle concurrency

To setup concurrent requests we need to use uWSGI. It is a tool that allows us more control over multiprocessing and threading. uWSGI gives us enough power and flexibility to deploy applications while still being accessible.

Let's change the Flask application we created earlier to look more like a real machine learning service:

from flask import Flask
import numpy as np
import tensorflow as tf


app = Flask(__name__)

model = tf.keras.applications.MobileNetV2(input_shape=(160, 160, 3),
                                               include_top=True,
                                               weights='imagenet')

@app.route('/')
def predict():
    data = np.zeros((1, 160, 160, 3))
    model.predict(data)
    return "Hello, World!"

if __name__ == "__main__":
    app.run(host='0.0.0.0')

After running, it initializes the model. It performs a forward pass of an array of zeros through the model every time it receives a request to simulate real application functionality.

First, let's look at the RPS (requests per second) of the application without uWSGI:

insert image description here

turn off multithreading
insert image description here

Enable multithreading

We tested the "service" as a pure Flask application with threaded=False and threaded=True. As you can see, although the RPS is higher when threaded=True, the improvement is not much. This is because the application is still mostly dependent on the CPU.

8. Use uWSGI for testing

First, we need to install uWSGI:

pip3 install uwsgi

Then, we need to add the configuration file to our project directory. This file contains all the parameters uWSGI needs to run with our application.

[uwsgi]
module = demo:app
master = true
processes = 2
threads = 1
enable-threads = true
listen = 1024
need-app = true

http = 0.0.0.0:5000

Let's review the most important parameters you will see in this configuration file:

  • module = demo:app — this is the name of the script that contains the name of our application:Flask object
  • master = true — this is the main uWSGI process necessary to repeatedly call workers, log and manage other functions. In most cases this should be set to "true"
  • processes = 2 / threads = 1 — This specifies the number of processes and threads to run. You can also use the uWSGI submodule called cheaper to automatically scale the number of processes and threads
  • enable_threads = true — this runs uWSGI in multi-threaded mode
  • listen=1024 — this is the size of the request queue
  • need-app=true - This flag prevents uWSGI from running if it cannot find or run the application. If it equals False, uWSGI will ignore any import issues and return a 500 status for the request.
  • http = 0.0.0.0:5000 — This is the URL and port to access the application. It is only used when the user sends a request directly to the application.

By default, uWSGI loads your application and then forks it. But you can specify lazy-apps=true. This way, uWSGI loads your application separately for each worker. It can help avoid TensorFlow model errors or share other data between workers.

Another key parameter is listen. This parameter must be set to the maximum number of unique users you wish to queue during reload. Otherwise, some of them may go wrong. By default, listen is equal to 100. Read more about it here.

uWSGI has more useful parameters, but for now, let's run the application:

uwsgi uwsgi.ini

We can now view the load test results for the Flask application wrapped in uWSGI:

insert image description here

uWSGI - 1 process/2 threads
insert image description here

uWSGI - 2 processes/1 thread

Using two separate processes radically improves RPS. However, it's not always obvious whether an application is I/O bound or CPU bound. Choosing between multiprocessing and multithreading is a bit tricky. Have a look at this article for a better understanding.

9. Conclusion

Hopefully now you understand why this question has been on my mind. Do data scientists need to focus more on getting models ready for deployment, and if so, what are the constraints? Ideally, with Flask and uWSGI, you already have the basics to get it up and running. But the sky is the limit, and your situation may call for more.

Finally, if you want your application to be open to the world, you should pay attention to security. We didn't cover security in this article because it's another topic entirely, but it's something you should keep in mind.

As always, I hope this article was useful to you. be safe.


Original Link: Concurrent Processing of AI Services—BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/131269117