Configure distributed TensorFlow

Training a neural network on a large data set often requires larger computing resources, and it takes several days to complete the computation.
TensorFlow provides a mode that can be deployed in a distributed manner, splitting a training task into multiple small tasks and assigning them to different computers to complete collaborative operations. In this way, the use of computer clusters to replace single-computer computing can greatly shorten the training time.
The role and principle of a distributed TensorFlow
To configure TensorFlow for distributed training, you need to understand distributed role assignments in TensorFlow.
  • ps: As a distributed training server, waiting for each terminal (supervisors) to connect.
  • Worker: called supervisors in the TensorFlow code, as the computing terminal of distributed training.
  • chief supervisors: Among the many computing terminals, one must be selected as the main computing terminal. The terminal is first started in the computing terminal, and its function is to combine the learning parameters of each terminal after computing, and save or load them.
Each specific role network identifier is unique, that is, distributed on machines with different IPs (or the same machine but different ports).
In actual operation, the code of the network construction part of each role must be 100% identical. The distribution of the three is as follows:
  • As a multi-party coordinator, the server waits for each computing terminal to connect.
  • chief supervisors will uniformly manage the global learning parameters at startup, initialize or load from the model.
  • Other computing terminals are only responsible for obtaining their corresponding tasks and performing calculations, and do not save checkpoints, nor do they save any parameter information such as summary logs for TensorBoad visualization.
The whole process is communicated through the RPC protocol.

Two specific methods of distributed deployment of TensorFlow
During the configuration process, you first need to create a server, in which the ps and the IP ports of all workers will be prepared. Next, use the managed_session in tf.train.Supervisor to manage an open session. The session is only responsible for operations, and the communication coordination is handed over to the supervisor for management.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324655772&siteId=291194637