Configure distributed TensorFlow - Code World

Configure distributed TensorFlow

Others 2022-04-21 23:14:26 views: 0

Training a neural network on a large data set often requires larger computing resources, and it takes several days to complete the computation.

TensorFlow provides a mode that can be deployed in a distributed manner, splitting a training task into multiple small tasks and assigning them to different computers to complete collaborative operations. In this way, the use of computer clusters to replace single-computer computing can greatly shorten the training time.

The role and principle of a distributed TensorFlow

To configure TensorFlow for distributed training, you need to understand distributed role assignments in TensorFlow.

ps: As a distributed training server, waiting for each terminal (supervisors) to connect.
Worker: called supervisors in the TensorFlow code, as the computing terminal of distributed training.
chief supervisors: Among the many computing terminals, one must be selected as the main computing terminal. The terminal is first started in the computing terminal, and its function is to combine the learning parameters of each terminal after computing, and save or load them.

Each specific role network identifier is unique, that is, distributed on machines with different IPs (or the same machine but different ports).

In actual operation, the code of the network construction part of each role must be 100% identical. The distribution of the three is as follows:

As a multi-party coordinator, the server waits for each computing terminal to connect.
chief supervisors will uniformly manage the global learning parameters at startup, initialize or load from the model.
Other computing terminals are only responsible for obtaining their corresponding tasks and performing calculations, and do not save checkpoints, nor do they save any parameter information such as summary logs for TensorBoad visualization.

The whole process is communicated through the RPC protocol.

Two specific methods of distributed deployment of TensorFlow

During the configuration process, you first need to create a server, in which the ps and the IP ports of all workers will be prepared. Next, use the managed_session in tf.train.Supervisor to manage an open session. The session is only responsible for operations, and the communication coordination is handed over to the supervisor for management.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324655772&siteId=291194637

Configure distributed TensorFlow

Vernacular tensorflow distributed deployment and development

Configure the distributed pressure measurement of jmeter in the window environment

Tensorflow cpu distributed principle and run the sample

tensorflow stepped pit distributed deployment experience

TensorFlow Distribution (data read and distributed in training)

Pycharm configure remote development and debugging remotely be tensorflow

Configure PyQt, TensorFlow and other environments under PyCharm

[Use Anaconda to configure tensorflow+nlp environment]

Abrupt termination (termination) during tensorflow distributed training

[Distributed] tensorflow 1 distributed code practice and explanation; running 2 distributed worker threads on a single node

[Distributed cluster construction two] Clone the virtual machine and configure the cluster

TensorFlow Federated: Machine-based learning distributed data

_1_ deep learning neural network _4_ distributed Tensorflow

CoRR 2018 | Horovod: Fast and Easy Distributed Deep Learning in Tensorflow

Tensorflow distributed programming-playing with common GPU configurations

How to configure the new version of the GPU-accelerated tensorflow library

(Linux) Use conda to configure the env environment compatible with TensorFlow and PyTorch

Win10 uses conda to configure tensorflow-gpu environment

How to configure the CPU and GPU versions of the tensorflow library on Linux Ubuntu

Machine learning distributed framework ray runs TensorFlow instance

tensorflow1.13 distributed training reference material-tutorial principle

[Transfer] This article explains the necessary knowledge of Tensorflow distributed training

[Tensorflow-2.x-gpu] python3 configure tensorflow-2.x gpu environment (2)

CUDA and Tensorflow installation issues, including how to configure multiple CUDA environments and Tensorflow versions on one machine

(Ii) configure your development environment - Hadoop installation and pseudo-distributed Cluster Setup

Level 2: Configure the development environment-Hadoop installation and pseudo-distributed cluster construction

distributed

TensorFlow 2.0.0-beta0 release, distributed strategies improvements, API freeze

Byte beating open source high performance distributed training framework BytePS: TensorFlow compatible with other mainstream frame

Recommended

wlnmp one-click installation package update 240522

ChatGPT was severely down and was rumored to have been hacked by Russian hackers.

Ranking

Java implementation Storage Structure queue

Android development Yabo Sports Start Tutorial

Let me catch it! There are a lot of interesting souls, not many interesting books, Redis in-depth adventure subverts your understanding of Redis

easyrecovery software 2024 free version computer file data recovery tool

Deep Understanding of Virtual Private Networks: How VPNs Connect the World

EOS was looted, not without reason

Construction of chromium GN system

Comprehensive summary of JAVA basic knowledge [New students must see and learn together]

Fall in love with [data structures] Quick Sort

OpenCV implements FAST algorithm corner point detection and ORB algorithm feature point detection

Daily

More

2024-05-22(9)

2024-05-21(35)

2024-05-20(5)

2024-05-19(0)

2024-05-18(31)

2024-05-17(6)

2024-05-16(23)

2024-05-15(5)

2024-05-14(9)

2024-05-13(8)