Implementation of Machine Learning Service Based on Task Queue

There are many ways to deploy machine models to production. One of the common approaches is to implement it as a Web service. The most popular type is REST API. Its role is to deploy and run 24/7, waiting to receive JSON requests from clients, extract the input, and send it to the ML model to predict the outcome. The result is then wrapped into a response and returned to the user
insert image description here

Recommendation: Use NSDT editor to quickly build programmable 3D scenes

1. Simple implementation mode

You start googling this question with "deploying machine learning as a REST API". You will receive a million results. If you put in the effort and read this. In many of the top results, you'll see a common pattern for solving this problem, as shown in the image below.
insert image description here

The popular way is that we need a web framework (Flask, Diango or FastAPI) for building the API. Next, we will need machine learning to take the input and return a prediction. To help the system run in production, we need to wrap additional WSGI (if we use Flask) or ASGI (if we use FastAPI) outside of the web module.

It is worth noting here that the machine learning model in this approach is usually implemented in the same code block as the web framework (Flask/FastAPI/…). This means that the machine learning model runs in the same process as the web module. This causes many problems:

For a Flask/FastAPI process, only one ML model process can be started.
At one runtime point, the ML model can only handle one request
If we want to scale the application, we can use WSGI (such as guvicorn) or ASGI (such as uvicorn) to create many child processes, which increases the number of web modules and machine learning models, since they are implemented in the same process.
For some heavy tasks, ML models can take a long time (even seconds) to run inference. Because they are created in the same block of code in the web module, it blocks other requests while we need to wait for all tasks to complete before processing the next one.

So is there a better way? For heavy and long-running tasks, is there any way to process them without blocking client requests? Today, I'm going to cover an approach that may not be new but doesn't seem to have been applied to deploying machine learning ML to production: using a task queue distributed system.

2. What is a task queue?

Task queues are used as a mechanism for distributing work across threads or machines.

The input to a task queue is a unit of work, called a task, and a dedicated worker process then constantly monitors the queue to perform new work. – Celery Github

A task queue is a tool that allows you to run different software programs in separate machines/processes/threads. In an application, there are some parts (tasks) that often run for a long time or we don't know when they are finished. For these tasks, it is better to run them in separate processes or distributed machines, and when they complete, we will be notified of the check results. This doesn't block other parts. This is suitable for long-running tasks such as sending emails, scraping web content, or in this case running ML models. Let us consider the description below.

insert image description here

The architecture of the distributed task queue consists of three main modules: producers, consumers and message brokers.

The client sends a request to our Flask application (Producer).
Producers send task messages to Message Broker.
ML Workers (consumers) consume messages from a message broker. After the task completes, save the result to Message Broker and update the task status.
After sending a task to the message broker, FastAPI applications can also monitor the status of the task from the message broker. When the status is complete, it retrieves the result and returns it to the client

The three modules are started in different processes or distributed machines so that they can live independently. There are many tools for developing task queues, spread across several programming languages, and in this blog, I'll focus on Python and use Celery, the most popular task queue tool for Python projects. To learn more about the benefits of Celery and the distributed task queue system, check out this great explainer. Now, let's jump to the question below.

3. Travel time prediction model

To illustrate this, I'll try to build a simple machine learning model that can help predict the average travel time given a pick-up location, a drop-off location, and a trip length. This will be a regression model. Note that I'm not focusing on building the accuracy model in this blog, but just using it to set up the web service API. Model weights and how we build the model can be found at this link.

3.1 API overview

insert image description here

We will develop a web API for serving machine learning models, which consists of 3 modules: web, redis, and ML models. These modules are Dockerized and deployed into containers.

├── apps
│   └── api
│       ├── api_routers.py
│       └── main.py
├── boot
│   ├── docker
│   │   ├── celery
│   │   │   ├── cuda90.yml
│   │   │   └── trip
│   │   │       ├── Dockerfile
│   │   │       └── entrypoint.sh
│   │   ├── compose
│   │   │   └── trip_duration_prediction
│   │   │       ├── docker-compose.cpu.yml
│   │   │       ├── docker-compose.dev.yml
│   │   │       ├── docker-compose.yml
│   │   │       ├── docker-services.sh
│   │   │       ├── my_build.sh
│   │   └── uvicorn
│   │       ├── Dockerfile
│   │       ├── entrypoint.sh
│   │       └── requirements.txt
│   └── uvicorn
│       └── config.py
├── config.py
├── core
│   ├── managers
│   ├── schemas
│   │   ├── api_base.py
│   │   ├── health.py
│   │   └── trip.py
│   ├── services
│   │   ├── trip_duration_api.py
│   │   └── trip_duration_prediction_task.py
│   └── utilities
├── repo
│   ├── logs
│   └── models
│       └── lin_reg.bin
├── tasks
│   └── trip
│       └── tasks.py
└── tests
    ├── http_test
    │   └── test_api.py
    └── model_test
        └── test_trip_prediction_task.py

Here are the details of the repository folder structure:

apps: defines the main application and API router for the web module using FastAPI
boot: Define Dockerfile image for web module, ML module and docker-compose file to link 3 modules. It also contains the configuration for each docker image and the corresponding yml file for the package repository.
config.py: The configuration file defines various configurations about CELERY_BROKER_URL, CELERY_RESULT_BACKEND, TRIP_DURATION_MODEL, TRIP_DURATION_THRESHOLD, etc...
core: defines all implementation scripts used in the Web, Redis, Worker modules
repo: stores application and task logs when the API starts. It also stores the model weights
tasks: Define Celery task scripts
tests: define the unit tests for the API

3.2 Web module

In the web module, I use FastAPI as the web framework. FastAPI offers many niche features such as: ultra-fast, integration with Uvicorn, automatic checking of type validation using Pydantic, automatic documentation generation, and more...

Let's see how to start the FastAPI application: boot/docker/uvicorn/entrypoint.sh, this is where I start the FastAPI application

#!/usr/bin/env sh

USERNAME="$(id -u -n)"
MODULE="apps.api"
SOCKET="0.0.0.0:8182"
MODULE_APP="${MODULE}.main:app"
CONFIG_PATH="boot/uvicorn/config.py"
REPO_ROOT="repo"
LOGS_ROOT="${REPO_ROOT}/logs/apps/api"
LOGS_PATH="${LOGS_ROOT}/daemon.log"

sudo mkdir -p ${LOGS_ROOT} && \
sudo chown -R ${USERNAME} ${LOGS_ROOT} && \
sudo chown -R ${USERNAME} ${REPO_ROOT} && \

gunicorn \
    --name "${MODULE}" \
    --config "${CONFIG_PATH}" \
    --bind "${SOCKET}" \
    --log-file "${LOGS_PATH}" \
    "${MODULE_APP}"

Then, I define the API router and message schema - apps/api/api_routers.py:

import os
import json
from typing import Dict

from loguru import logger
from fastapi import Request, APIRouter

import config
from core.schemas.trip import TripAPIRequestMessage, TripAPIResponseMessage
from core.schemas.health import Health
from core.services.trip_duration_api import TripDurationApi

API_VERSION = config.API_VERSION
MODEL_VERSION = config.MODEL_VERSION

api_router = APIRouter()


@api_router.get("/health", response_model=Health, status_code=200)
def health() -> Dict:
    return Health(
        name=config.PROJECT_NAME, api_version=API_VERSION, model_version=MODEL_VERSION
    ).dict()


@api_router.post(
    f"/{config.API_VERSION}/trip/predict",
    tags=["Trips"],
    response_model=TripAPIResponseMessage,
    status_code=200,
)
def trip_predict(request: Request, trip_request: TripAPIRequestMessage):
    api_service = TripDurationApi()
    results = api_service.process_raw_request(request, trip_request)
    return results
core/schemas/trip.py：

from core.schemas.api_base import APIRequestBase, APIResponseBase


class TripAPIRequestMessage(APIRequestBase):
    PULocationID: int
    DOLocationID: int
    trip_distance: float

    class Config:
        schema_extra = {
            "example": {
                "request_id": "99999",
                "PULocationID": 130,
                "DOLocationID": 250,
                "trip_distance": 3.0,
            }
        }


class TripAPIResponseMessage(APIResponseBase):
    duration: float

    class Config:
        schema_extra = {"example": {"reply_code": 0, "duration": 12.785509620119132}}
core/services/trip_duration_api.py：

import collections

import celery
from loguru import logger
from fastapi import Request
from fastapi.encoders import jsonable_encoder

import config
from core.schemas.trip import TripAPIRequestMessage, TripAPIResponseMessage
from core.utilities.cls_time import Timer

task_celery = config.CeleryTasksGeneralConfig
celery_app = celery.Celery()
celery_app.config_from_object(task_celery)


class TripDurationApi:
    # pylint: disable=too-many-instance-attributes
    def call_celery_matching(
        self,
        pu_location_id: int,
        do_location_id: int,
        trip_distance: float,
    ):
        """
        :type celery_result: celery.result.AsyncResult
        """
        celery_result = celery_app.send_task(
            task_celery.task_process_trip,
            args=[
                pu_location_id,
                do_location_id,
                trip_distance,
            ],
            queue=task_celery.task_trip_queue,
        )

        return celery_result

    def process_api_request(self):
        celery_result = self.call_celery_matching(
            self.trip_request.PULocationID,
            self.trip_request.DOLocationID,
            self.trip_request.trip_distance,
        )

        results: dict = {}
        try:
            results = celery_result.get(timeout=60)
            celery_result.forget()
            results = results or {}
        except celery.exceptions.TimeoutError:
            results = {}

        reply_code: int = results.pop("reply_code", 1)
        duration: float = float(results.pop("duration", 0.0))

        self.response = TripAPIResponseMessage(reply_code=reply_code, duration=duration)
        self.status_code = 200
        self.timings = results.pop("timings", {})
        self.results = results

Next, I define a class called trip_duration_api.py where I handle the request logic.

In the function process_request_api, it collects request information
The call_celery_matching function will be added as a task to a message broker Redis queue deployed in another container. ML modules deployed in other containers will pop the job from Redis and start processing it. The result is a promise that will notify the web module's backend when the worker has completed the task or when the due time has elapsed. Please pay attention to lines 29 and 35, where you need to enter task_celery.task_process_trip as the Celery task name and task_celery.task_trip_queue as the Celery queue
Lines 12 to 14 help the web module connect with the ML module via Celery

Everything is combined and built into a single docker image.

Web Dockerfile：

ARG VM_BASE
FROM $VM_BASE

ARG VM_USER
ARG VM_HOME
ARG VM_CODE
ARG VM_PIP

COPY . $VM_CODE
WORKDIR $VM_CODE/
RUN rm -rf libs
RUN apk add --no-cache sudo \
    && apk add --no-cache --virtual .build-deps gcc musl-dev g++\
    && pip install --no-cache-dir -r $VM_PIP \
    && apk del .build-deps

RUN apk add --no-cache bash

RUN adduser --disabled-password --gecos '' $VM_USER \
    && addgroup sudo \
    && adduser $VM_USER sudo \
    && echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers \
    && chown -R $VM_USER $VM_HOME

USER $VM_USER

WORKDIR $VM_CODE/
EXPOSE 8182
ENTRYPOINT ["boot/docker/uvicorn/entrypoint.sh"]

3.3 Worker module

In celery, each piece of work that can be done in a separate process or machine is called a task. Tasks can vary from scraping web content to sending emails, or even complex running machine models. Tasks can be triggered at runtime or periodically. Once deployed, each worker thread can run in a process, a green thread... depending on the type of Celery we're using. To get a better understanding of Celery execution pools, you can read more on this blog. In this API, I choose Celery pool type gevent. The entry point for Celery can be found in boot/docker/celery/trip/entrypoint.sh

#!/usr/bin/env bash

USERNAME="$(id -u -n)"
MODULE="tasks.trip"
REPO_ROOT="repo"
LOGS_ROOT="${REPO_ROOT}/logs/tasks/trip"
LOGS_PATH="${LOGS_ROOT}/daemon.log"

sudo mkdir -p ${LOGS_ROOT} && \
sudo chown -R ${USERNAME} ${LOGS_ROOT} && \
sudo chown -R ${USERNAME} ${REPO_ROOT} && \
source activate venv && \
python -m celery worker \
    -A ${MODULE} \
    -Q ${MODULE} \
    -P gevent \
    --prefetch-multiplier=1 \
    --concurrency=4 \
    --loglevel=INFO \
    --logfile="${LOGS_PATH}"

Note that on line 18, my Celery type of choice is gevent. The prefetch multiplier is the number of messages to prefetch at a time, which means it will only reserve one task per worker process at a time. Concurrency is the number of green threads created per Celery instance.

Celery configuration

Celery configuration is defined in config.py:

class CeleryTasksGeneralConfig:
    task_trip_queue = "tasks.trip"
    task_trip_prefix = "tasks.trip.tasks"
    task_process_trip = f"{task_trip_prefix}.predict_ride"

    broker_url = os.environ.get("CELERY_BROKER_URL", None)
    result_backend = os.environ.get("CELERY_RESULT_BACKEND", None)
    worker_prefetch_multiplier = int(
        os.environ.get("CELERY_WORKER_PREFETCH_MULTIPLIER", 1)
    )

The above file contains all the configuration needed for Celery to run. Lines 6 and 7 set the proxy URL and the resulting backend (Redis in this case). These configurations will be taken from the docker image's env file, which I explain later when defining docker-compose.

When a producer sends a message to a message broker, it needs to define which task to consume and on which queue. Then, based on the queue name and task name, Celery can dispatch the message to the correct consumer worker thread processing that task. So, in line 2, I define the queue name as "tasks.trip" and the task_name as "tasks.trip.tasks.predict_ride". Recall that these parameters are used in the file core/services/trip_duration_api.py when the web module executes the Celery tasks.

Celery tasks

Celery tasks are implemented in tasks/trip/tasks.py

import celery

import config
from core.utilities.cls_loguru_config import loguru_setting
from core.services.trip_duration_prediction_task import TripDurationTask

loguru_setting.setup_app_logging()

app = celery.Celery()
app.config_from_object(config.CeleryTasksGeneralConfig)
app.autodiscover_tasks(["tasks.trip"])


@celery.shared_task(time_limit=60, soft_time_limit=60)
def predict_ride(pu_location_id: int, do_location_id: int, trip_distance: float):
    return TripDurationTask().process(pu_location_id, do_location_id, trip_distance)

Lines 9 to 11 are where I assign the tasks to the corresponding task names. In this way, when the client calls the travel task in the future, Celery will trigger the execution of the task in the script file. In line 14, I set the maximum time to execute the task to 60 seconds, which means that if the task is not completed within 60 seconds, the task will fail and the client will be notified of the error.

Travel Time Prediction Task

import pickle
import collections

# import boto3
from loguru import logger

import config
from core.utilities.cls_time import DictKeyTimer
from core.utilities.cls_constants import APIReply

# s3 = boto3.resource("s3")

# TRIP_DURATION_MODEL_KEY = config.ModelConfig.s3_trip_model_key()
# TRIP_DURATION_MODEL_BUCKET = config.ModelConfig.s3_bucket()


TRIP_DURATION_MODEL_PATH = config.ModelConfig.trip_duration_model()
with open(TRIP_DURATION_MODEL_PATH, "rb") as f_in:
    dv, model = pickle.load(f_in)


def preprare_feature(pu_location_id: int, do_location_id: int, trip_distance: float):
    features = {}
    features["PU_DO"] = f"{pu_location_id}_{do_location_id}"
    features["trip_distance"] = trip_distance
    return features


class TripDurationTask:
    @classmethod
    def process(cls, pu_location_id: int, do_location_id: int, trip_distance: float):
        try:
            timings = collections.OrderedDict()
            step_name = "feature_prepare"
            with DictKeyTimer(timings, step_name):
                features = preprare_feature(
                    pu_location_id, do_location_id, trip_distance
                )
            step_name = "model_predict"
            with DictKeyTimer(timings, step_name):
                pred = cls.predict(features)
                logger.info(f"Predict duration:{pred}")

            result = {
                "duration": pred,
                "reply_code": APIReply.SUCCESS,
                "timings": timings,
            }
        # pylint: disable=broad-except
        except Exception:
            result = {
                "duration": 0.0,
                "reply_code": APIReply.ERROR_SERVER,
                "timings": timings,
            }
        return result

    @classmethod
    def predict(cls, features: dict):
        X = dv.transform(features)
        preds = model.predict(X)
        return float(preds[0])

The above file is the main location where the ML model is implemented. From line 17 to line 19, I load the model weights stored in the repo/models folder. The rest is self-explanatory, where the regression model takes inputs including pick-up location, drop-off location, and travel distance, and then predicts the travel time.

4. Connect everything with Docker Compose

As I explained at the beginning, we need 3 modules: Web, Redis and ML modules. In order to connect these three parts and enable them to communicate with each other, I use docker-compose to define three docker image definitions. When the application starts, three corresponding containers are created and communicate with each other in the docker network. Details can be found in boot/docker/compose/trip_duration_prediction/docker-compose.yml.

version: "2.3"
services:

  web:
    build:
      context: "${DC_UNIVERSE}"
      dockerfile: "${WEB_VM_FILE}"
      args:
        VM_BASE: "${WEB_VM_BASE}"
        VM_USER: "${WEB_VM_USER}"
        VM_HOME: "${WEB_VM_HOME}"
        VM_CODE: "${WEB_VM_CODE}"
        VM_PIP: "${WEB_VM_PIP}"
    platform: linux/amd64
    image: ${DOCKER_IMAGE_PROJECT_ROOT_NAME}_web:${COMMIT_ID}
    ports:
      - "${HTTP_PORT}:8182"
    volumes:
      - "${HOST_REPO_DIR}:${WEB_VM_CODE}/repo"
    restart: always
    environment:
      VERSION: "${WEB_VERSION}"
      PROJECT_APP: "${WEB_VM_PROJECT_APP}"
      REDIS_HOST: "redis"
      REDIS_PORT: "${REDIS_PORT}"
      CELERY_BROKER_URL: "redis://redis:${REDIS_PORT}"
      CELERY_RESULT_BACKEND: "redis://redis:${REDIS_PORT}"

  redis:
    image: redis:latest
    restart: on-failure
    expose:
      - "${REDIS_PORT}"
    command: redis-server --port "${REDIS_PORT}"

  worker:
    build:
      context: "${DC_UNIVERSE}"
      dockerfile: "${WORKER_VM_FILE}"
      args:
        VM_BASE: "${WORKER_VM_BASE}"
        VM_USER: "${WORKER_VM_USER}"
        VM_HOME: "${WORKER_VM_HOME}"
        VM_CODE: "${WORKER_VM_CODE}"
        VM_CONDA: "${WORKER_VM_CONDA}"
    platform: linux/amd64
    volumes:
      - "${HOST_REPO_DIR}:${WORKER_VM_CODE}/repo"
    image: ${DOCKER_IMAGE_PROJECT_ROOT_NAME}_worker:${COMMIT_ID}
    restart: always
    runtime: nvidia
    environment:
      NVIDIA_VISIBLE_DEVICES: "0"
      PROJECT_APP: "${WORKER_VM_PROJECT_APP}"
      REDIS_HOST: "redis"
      REDIS_PORT: "${REDIS_PORT}"
      CELERY_BROKER_URL: "redis://redis:${REDIS_PORT}"
      CELERY_RESULT_BACKEND: "redis://redis:${REDIS_PORT}"
.env 文件包含运行 docker-compose 时运行的所有参数，可以在  boot/docker/compose/trip_duration_prediciton/.env中找到。

DC_UNIVERSE=../../../..

HTTP_PORT=8182
REDIS_PORT=6379
GPU_MEMORY_SET=800


WEB_VERSION=v1
WEB_VM_FILE=boot/docker/uvicorn/Dockerfile
WEB_VM_BASE=python:3.8-alpine
WEB_VM_USER=docker
WEB_VM_HOME=/home/docker
WEB_VM_CODE=/home/docker/workspace
WEB_VM_PIP=./boot/docker/uvicorn/requirements.txt
WEB_VM_PROJECT_APP=apps.api

WORKER_VM_FILE=boot/docker/celery/trip/Dockerfile
WORKER_VM_BASE=nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04
WORKER_VM_USER=docker
WORKER_VM_HOME=/home/docker
WORKER_VM_CODE=/home/docker/workspace
WORKER_TORCH_DIR=/home/docker/.torch/models
WORKER_VM_CONDA=./boot/docker/celery/cuda90.yml
WORKER_VM_PROJECT_APP=tasks.trip

5. Test the application

Run the entire application. As you can see in the image below, when I start the API, there are three containers running.
insert image description here

3 docker containers up and running

The web container runs on port 8182, and we can access the API documentation through the address: localhost:8182/docs. This is one of the niche features of FastAPI, when we complete the API implementation with zero effort, we will get the Swagger documentation immediately.
insert image description here

API Swagger Documentation

Then, let's try running the API endpoint at /v1/trip/predict to see the predictions and check the log returns.
insert image description here

Stroke Prediction Termination Point
insert image description here

Trip Prediction Response
insert image description here

Logging for the web module
insert image description here

Logging for the worker module

Once a request is sent from the client to the web module, it will be processed asynchronously in a worker thread using Celery as a separate process or thread. This brings a lot of benefits:

Heavy tasks are handled in a separate process/thread which can help increase the number of requests we can handle as it doesn't block client calls.
ML modules are implemented in another thread, packaged in a separate docker image, which means that data scientists or machine learning engineers can keep their implementation code and packages independently.
If the number of requests increases, we can easily increase the number of ML modules to handle the surge in requests, while the web module can remain the same

6. Conclusion

In this blog, I introduced how to use the task queue distributed architecture to implement APIs serving ML modules. Using Celery, FastAPI, and Redis can help better handle long-running tasks like ML runtimes, improving overall performance.

The original idea was something I developed and improved while working at my previous company. Thank you Shan Hong and Jonathan, they are great former colleagues, I learned a lot of good things from them.

If you want to refer to the full code, please check out github.

Original Link: Queue-based ML Service Implementation—BimAnt