How to design large model infrastructure from scratch

This article is compiled based on recent learning experience. It mainly explains how to design a large model IT infrastructure. The content of the article will be kept at the level of framework methods and try to avoid the narrow scope problem caused by binding specific details.

one. Preliminary evaluation

       First of all, before starting the design, you must be clear about the purpose of building a large privatized model, have a clear estimate of the time and cost that can be invested, and ensure that there are no obvious failures in the feasibility assessment.

       Some basic common sense needs to be reiterated before designing, as follows:

  1. Building a large model requires three production factors, IT infrastructure, algorithm models and data, as well as corresponding technical personnel.
  2. GPU-based computing cluster resources are not only expensive, but also in short supply. The most conservative estimate is that this situation will not change before the end of 2023. Personally, I think it is a more reliable estimate that there will be no major changes before the first half of 2024. You need to overcome difficulties to raise GPU card resources (due to the capability gap between video memory and communication bandwidth latency, the H100 is still the most suitable for large model training, and the 4090 can only be used for inference)
  3. The time period for early data collection and preprocessing, as well as later training and fine-tuning, is usually measured in months. The entire process is difficult to complete in a few weeks. You must be mentally prepared for the project cycle evaluation.
  4. Although large models have demonstrated amazing capabilities, privatized large models are usually applied to vertical industries, and the initial results may vary, depending on data quality, industry characteristics, and project expectations.

After the above basic information is clear and correct, you need to re-evaluate whether the startup conditions are met. If there are good answers or solutions to the above questions, then you can enter the design stage.

two. overall design process

Divided into three steps: model selection, scale training, and adaptation deployment

1. Quickly find the most suitable model at the smallest scale

There are quite a lot of open source models available now, but for your specific application needs, there must be suitable and inappropriate ones. At this step, you need to find the optimal model. How to evaluate the best? Generally speaking, the final result of large model training is "an auxiliary tool that can complete certain tasks." Co-pilot, co-pilot, and digital assistant are all very vivid descriptions. The quality of the evaluation mainly depends on this tool. Accuracy.

The industry uses another term "Hyperparameter Search" to describe the process of finding a model. There is a very vivid picture in one of Lambda's public materials. I will borrow it here.

The so-called hyperparameters are parameters that are preset before the machine learning process is started. Typical hyperparameters include learning rate , batch size batch_size , number of iterations Epoch , regularization coefficients that affect the generalization ability of the model, and the structure of the neural network. Parameter number of layers, number of neurons, convolution kernel size, etc. Different hyperparameter settings are added together to basically determine the specific model. Usually, we need to adjust and optimize hyperparameters and select a set of optimal hyperparameters to improve the performance and effect of machine learning and obtain the optimal model. In practice, hyperparameter adjustment generally uses manual setting of the adjustment range, and then uses a machine to search within the hyperparameter range. This process is called hyperparameter search. Common basic hyperparameter search methods usually include network search GridSearch, random search RandomizedSearch or Latin hypercube sampling Latin hypercube, etc. There are many public information on the specific details, and I will not go into details here.

Hyperparameters are essentially parameters of parameters. Manual configuration is required. Every time the hyperparameters are changed and adjusted, the model must be retrained, and then corresponding verification and evaluation work must be done. A round cycle may range from a few hours to two or three days. The hardware scale of the setup at this time can also be relatively small, and you can even start with a gaming laptop with a GPU, but the time period may be longer. If there are three GPU servers, the search period will be greatly shortened. After all, computing power determines the time

And due to the guarantee of resources, you can also consider running different models in parallel on each server, so as to produce results faster.

2. Scale up to design scale, train and fine-tune as needed

After finding the most suitable model, it’s time to “get serious” and scale it up. As for whether to expand to dozens of servers or tens of thousands, it depends on design expectations and budget and hardware resource planning.

Guess you like

Origin blog.csdn.net/m0_61289673/article/details/133530185