Practice of TensorFlow Serving in Kubernetes

[email protected]

About TensorFlow Serving

The following is the architecture diagram of TensorFlow Serving:

Enter image description

For more basic concepts and other knowledge of TensorFlow Serving, please refer to the official documentation . No matter how good the translation is, it is not as good as the original text.

Here, I summarize the following knowledge points, which I think are more important:

  • TensorFlow Serving configures multiple versions of multiple models to serve simultaneously through the Model Version Policy;
  • By default, only the latest version of the model is loaded;
  • Support file system-based model automatic discovery and loading;
  • Low request processing latency;
  • Stateless, support horizontal expansion;
  • Different Version Models can be tested using A/B;
  • Support scanning and loading TensorFlow models from the local file system;
  • Support scanning and loading TensorFlow models from HDFS;
  • Provides a gRPC interface for client calls;

TensorFlow Serving Configuration

When I rummaged through the official documentation of TensorFlow Serving, I still couldn't find how to configure a complete model config, and I was very frustrated. No way, the development is too fast, and the documentation can't keep up with the normal, so I can only code.

In the main method of model_servers, we see the complete configuration items and descriptions of tensorflow_model_server as follows:

tensorflow_serving/model_servers/main.cc#L314

int main(int argc, char** argv) {
...
    std::vector<tensorflow::Flag> flag_list = {
        tensorflow::Flag("port", &port, "port to listen on"),
        tensorflow::Flag("enable_batching", &enable_batching, "enable batching"),
        tensorflow::Flag("batching_parameters_file", &batching_parameters_file,
                       "If non-empty, read an ascii BatchingParameters "
                       "protobuf from the supplied file name and use the "
                       "contained values instead of the defaults."),
        tensorflow::Flag("model_config_file", &model_config_file,
                       "If non-empty, read an ascii ModelServerConfig "
                       "protobuf from the supplied file name, and serve the "
                       "models in that file. This config file can be used to "
                       "specify multiple models to serve and other advanced "
                       "parameters including non-default version policy. (If "
                       "used, --model_name, --model_base_path are ignored.)"),
        tensorflow::Flag("model_name", &model_name,
                       "name of model (ignored "
                       "if --model_config_file flag is set"),
        tensorflow::Flag("model_base_path", &model_base_path,
                       "path to export (ignored if --model_config_file flag "
                       "is set, otherwise required)"),
        tensorflow::Flag("file_system_poll_wait_seconds",
                       &file_system_poll_wait_seconds,
                       "interval in seconds between each poll of the file "
                       "system for new model version"),
        tensorflow::Flag("tensorflow_session_parallelism",
                       &tensorflow_session_parallelism,
                       "Number of threads to use for running a "
                       "Tensorflow session. Auto-configured by default."
                       "Note that this option is ignored if "
                       "--platform_config_file is non-empty."),
        tensorflow::Flag("platform_config_file", &platform_config_file,
                       "If non-empty, read an ascii PlatformConfigMap protobuf "
                       "from the supplied file name, and use that platform "
                       "config instead of the Tensorflow platform. (If used, "
                       "--enable_batching is ignored.)")};
...
}

Therefore, we see the configuration of model version config, all of --model_config_filewhich are configured in , the following is the complete structure of model config:

tensorflow_serving/config/model_server_config.proto#L55

// Common configuration for loading a model being served.
message ModelConfig {
  // Name of the model.
  string name = 1;

  // Base path to the model, excluding the version directory.
  // E.g> for a model at /foo/bar/my_model/123, where 123 is the version, the
  // base path is /foo/bar/my_model.
  //
  // (This can be changed once a model is in serving, *if* the underlying data
  // remains the same. Otherwise there are no guarantees about whether the old
  // or new data will be used for model versions currently loaded.)
  string base_path = 2;

  // Type of model.
  // TODO(b/31336131): DEPRECATED. Please use 'model_platform' instead.
  ModelType model_type = 3 [deprecated = true];

  // Type of model (e.g. "tensorflow").
  //
  // (This cannot be changed once a model is in serving.)
  string model_platform = 4;

  reserved 5;

  // Version policy for the model indicating how many versions of the model to
  // be served at the same time.
  // The default option is to serve only the latest version of the model.
  //
  // (This can be changed once a model is in serving.)
  FileSystemStoragePathSourceConfig.ServableVersionPolicy model_version_policy =
      7;

  // Configures logging requests and responses, to the model.
  //
  // (This can be changed once a model is in serving.)
  LoggingConfig logging_config = 6;
}

We see model_version_policy, that's the configuration we're looking for, it's defined as follows:

tensorflow_serving/sources/storage_path/file_system_storage_path_source.proto

message ServableVersionPolicy {
    // Serve the latest versions (i.e. the ones with the highest version
    // numbers), among those found on disk.
    //
    // This is the default policy, with the default number of versions as 1.
    message Latest {
      // Number of latest versions to serve. (The default is 1.)
      uint32 num_versions = 1;
    }

    // Serve all versions found on disk.
    message All {
    }

    // Serve a specific version (or set of versions).
    //
    // This policy is useful for rolling back to a specific version, or for
    // canarying a specific version while still serving a separate stable
    // version.
    message Specific {
      // The version numbers to serve.
      repeated int64 versions = 1;
    }
}

So model_version_policy currently supports three options:

  • all: {} means to load all found models;
  • latest: { num_versions: n } Indicates that only the latest n models are loaded, which is also the default option;
  • specific: { versions: m } Indicates that only the model of the specified version is loaded, which is usually used for testing;

Therefore, at tensorflow_model_server —port=9000 —model_config_file=<file>startup, a complete model_config_file format can be referred to as follows:

model_config_list: {
	config: {
		name: "mnist",
		base_path: "/tmp/monitored/_model",mnist
		model_platform: "tensorflow",
		model_version_policy: {
		   all: {}
		}
	},
	config: {
		name: "inception",
		base_path: "/tmp/monitored/inception_model",
		model_platform: "tensorflow",
		model_version_policy: {
		   latest: {
		   	num_versions: 2
		   }
		}
	},
	config: {
		name: "mxnet",
		base_path: "/tmp/monitored/mxnet_model",
		model_platform: "tensorflow",
		model_version_policy: {
		   specific: {
		   	versions: 1
		   }
		}
	}
}

TensorFlow Serving compilation

In fact, the compilation and installation of TensorFlow Serving has been written clearly in the github setup document . Here I just want to emphasize one point, and it is a very important point, which is mentioned in the document:

Optimized build

It's possible to compile using some platform specific instruction sets (e.g. AVX) that can significantly improve performance. Wherever you see 'bazel build' in the documentation, you can add the flags -c opt --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-O3 (or some subset of these flags). For example:

bazel build -c opt --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-O3 tensorflow_serving/...
Note: These instruction sets are not available on all machines, especially with older processors, so it may not work with all flags. You can try some subset of them, or revert to just the basic '-c opt' which is guaranteed to work on all machines.

This is very important. At the beginning, we did not add the corresponding copt option to compile. The test found that the performance of the tensorflow_model_server compiled in this way is very poor (at least it cannot meet our requirements), and the delay of concurrent client requests for tensorflow serving is very high. (Basically all request latency is greater than 100ms). When these copt options are added, the same model is tested with the same concurrency, and the result is that 99.987% of the delays are within 50ms, which is very different.

Regarding the use of --copt=O2 or O3 and its meaning, please see the description of gcc optimizers , which will not be discussed here. (Because I don't understand either...)

So, are they all compiled according to the exact same cpt options given by the official? the answer is negative! It depends on the cpu configuration of the server you are running TensorFlow Serving on. You /proc/cpuinfocan see the compilation cpt configuration item you should use by looking at:

Enter image description

Precautions for use

  • Since TensorFlow supports serving multiple versions of multiple models at the same time, it is recommended that clients try to specify the model and version they want to call when calling gRPC. Because different versions correspond to different models, the predicted values ​​obtained may be very different.
  • When copying and importing the trained model to the model base path, try to compress it into a tar file first, copy it to the base path, and then decompress it. Because the model is very large, the copying process will take some time, which may cause the exported model file to be copied, but the corresponding meta file has not been copied. At this time, if TensorFlow Serving starts to load the model and cannot detect the meta file, then The server will not be able to successfully load the model and will stop trying to load the version again.
  • If you use it protobuf version <= 3.2.0, then please note that TensorFlow Serving can only load models up to 64MB in size. You can pip list | grep protocheck . My environment is using 3.5.0 post1, this problem does not exist, please pay attention. See issue 582 for more .
  • Officials claim to support dynamic changes to model_config_list through the gRPC interface, but in fact you need to develop custom resources, which means it is not available out of the box. Keep following issue 380 .

TensorFlow Serving on Kubernetes

Deploy TensorFlow Serving to Kubernetes in Deployment mode. The following is the corresponding Deployment yaml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tensorflow-serving
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: "tensorflow-serving"
    spec:
      restartPolicy: Always
      imagePullSecrets:
      - name: harborsecret
      containers:
      - name: tensorflow-serving
        image: registry.vivo.xyz:4443/bigdata_release/tensorflow_serving1.3.0:v0.5
        command: ["/bin/sh", "-c","export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$(/usr/lib/hadoop-2.6.1/bin/hadoop classpath --glob); /root/tensorflow_model_server --port=8900 --model_name=test_model --model_base_path=hdfs://xx.xx.xx.xx:zz/data/serving_model"]
        ports:
        - containerPort: 8900

Summarize

TensorFlow Serving is really great. It allows you to easily serve your models. It also provides file system-based model automatic discovery, multiple model loading strategies, support for A/B testing, and more. It's so easy to deploy it in Kubernetes, and it's a joy. At present, we have provided the self-service application of TensorFlow Serving service in the TaaS platform. Users can easily create a custom TensorFlow Serving instance for the client to call. In the future, we will improve the load balancing, elastic scaling, and automatic instance creation of TensorFlow Serving. . Interested students are welcome to communicate with each other.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324409015&siteId=291194637