CS194 Full Stack Deep Learning(2) Infrastructure and Tooling

0. Preface

  • Relevant information
  • Station B video (subtitles are automatically generated, but they are almost sufficient)
  • Introduction:
    • Work in an ideal state: Get data -> Build a prediction system (such as server API or local deployment program)
    • Actual work: Annotate data -> build/test model -> manage computing resources -> train/test model -> deploy model -> monitor test results, build flywheel
  • In order to achieve various functions in actual work, we need to choose a development kit and related tools.
    • This lesson focuses on the middle part.

image-20210308020723787

  • The main content map includes:
    • Programming language and editor choice

1. Programming language and editor

  • Language Python, mainly because there are many third-party packages
  • There are many editor choices, the recommended ones are
    • VSCode: Recommended editor
    • Jupyter: Commonly used in Data science
      • Suitable for writing simple programs such as getting started tutorials
      • Not suitable for complex tasks, such as difficult to version control, difficult to test, no IDE function (such as code jump), difficult to run distributed tasks
    • Streamlit : Perfectly solve the problem of interactive applets, that is, write Python to publish a simple web application (replace flask and front-end)

2. Computing Resources

2.1. Why computing resources are important

  • Corresponding functions can be considered from two dimensions
Dimension Required function Ideal state solution
Development Write code, debug models, view results Quickly develop models and train them, preferably with a graphical interface Desktop system or cloud server with 1-4GPU
Training/Evaluation Model structure and hyperparameter search, training large models Perform training quickly and review results 4-card desktop system, or GPU cluster
  • In addition, with the development of deep learning, models are getting larger and larger, and more computing power is required.
  • Computing power is very important, but new models and new structures are also very important. You can do more with less computing power.

2.2. How to choose computing resources

  • This section mainly introduces the following parts:
    • Introduction to GPU
    • Introduction to Cloud Server
    • Introduction to the physical machine
    • how to choose
  • Introduction to GPU:
    • The main thing is to use NVidia equipment, the fastest is currently TPU (only available in Google Compute Platform), Intel and AMD are trying to catch up.
    • GPU basics are shown in the table below, to introduce these parameters
      • Arch is the GPU architecture, from top to bottom is the process of evolution
        • Do not use Kepler Maxwell architecture, which is 2-4 times slower than Pascal and Volta. It should not be bought, it is too old. Generally cheaper.
        • Pscal architecture (P100/1080ti) If you want to buy, buy second-hand, mid-range
        • Volta and Turing should be bought now, support mixed precision
          • 2080ti is 1.3 times faster than 1080ti on 32bit, but 2 times faster on 16bit
          • At that time V100 (fall 2019) was the fastest
      • Use-case is object-oriented, used for servers (Server), enthusiasts (Enthusiast) or ordinary users (Consumer)
      • RAM is the video memory size, the most important parameter at present
      • 32bit TFlops and Tensor TFlops are calculation speed, 32-bit floating point calculation speed, Tensor Cores is designed for deep learning mixed precision calculation
      • 16bit is to use only 16-bit operations (not mixed precision)
        • This is only available on the P100. Other cards either do not support 16bit or support Tensor Cores (mixed precision calculation is better than pure 16bit, so 16bit is not needed)
        • If there is only P100, then use 16bit, the speed is increased by 2 times, and the video memory is increased by 1.5 times.

image-20210309004840894

  • Cloud service provider comparison
    • AWS is the most expensive and can only use preset instances. You can pay attention to the Spot instance, which is very cheap.
    • Google Cloud Platform has multiple GPU options (not a default instance, you can choose by yourself) and supports TPU
    • No one recommends Azure
    • There are some emerging options, and foreign ones don’t care anymore.
  • Buy a physical machine, you can buy pre-built (pre-built), or you can assemble it yourself
  • Price comparison ideas:
    • Calculate the price of physical machines and cloud servers, and see how long the price of physical machines can be rented for servers.
    • Another way of thinking is that we need to complete training as soon as possible and open multiple instances for training at a time. The total time (GPU hours) remains the same, but due to parallel training, the actual elapsed time is less.
  • The actual situation:
    • Cloud servers are expensive, but expansion is easy.
    • Local servers are cheap, but maintenance is very troublesome after a certain number.
  • Suggest:
    • For single-player developers and teams that are just starting out: use 4-card Turing PC for development, use the same 4-card PC for training (until the architecture is dialed in), if you need more computing power, you can buy another or rent a cloud server.
    • For large companies: each ML Scientist is equipped with a 4-card Turing PC, or a V100 instance is used, and a cloud server instance is used for training and testing.

3. Resource Management

  • Mainly it is server computing resource management.
  • Requirements: Multiple people use multiple servers, and each person uses a different environment.
  • Goal: to be able to integrate various resources and conduct multiple experiments very easily.
  • solution:
    • Manual form distribution, low-end but used by a lot of people.
    • The Python script will automatically occupy idle resources and will use the previous cluster task (Job) scheduling. Each Job defines the tasks to be performed, and then these Jobs will be executed in order.
    • Docker (lightweight virtual machine) + Kubernetes (running multiple containers in a cluster)
    • Kubeflow, Google's open source project, mainly deals with multi-step ML workflows and has plugins for hyperparameter adjustment.
    • polyaxon, a full-stack machine learning platform, open source, but also provides some paid features.

4. Other

  • Deep learning framework

    • If there are no special requirements, choose either TensorFlow/Keras or PyTorch.
      • The current development direction is similar, both of which are easy to develop define-by-run mode (ie eager mode in TF), and multi-platform optimized static calculation graphs (ie TorchScript in PyTorch).
    • Currently there are many new projects in PyTorch.
    • fast.ai may be suitable for some novice and not in-depth users.
  • Distributed training

    • There are two ways of distributed training, data parallel or model parallel.
    • TF/PyTorch comes with a distributed training function.
    • Other solutions include Ray and Horovod.
  • Experiment management

    • Status: Even if you run one experiment at a time, you will get confused over time, let alone running multiple experiments.
    • Tensorboard: A good solution for recording a single experiment, but it is very inconvenient to manage multiple experiments.
    • Losswise/Comet.ml/Weights & Biases: These are all similar, they are all installed with a package, which is called in the way of tensorboard during training. The difference is that tensorboard saves the data locally, while other libraries are uploaded to the corresponding server. Then check the corresponding website.
    • MLFlow Tracking: Open source software, you can deploy related platforms locally, with very powerful functions.
  • Hyperparameter adjustment:

    • Hyperas,即 Keras + Hyperopt
    • Sigopt, didn't take a closer look
    • Ray-Tune: It seems that I will check this related content in the follow-up, there are many SOTA algorithms
    • Weights & Biases has related content
  • All-in-one, the all-in-one solution

image-20210311012450714

Guess you like

Origin blog.csdn.net/irving512/article/details/114650863