background
With the least computing resources, the LLM large-scale model prediction problem was solved, and a series of LLaMa models were trained to achieve the effect of large-scale models in the industry with a relatively small number of parameters.
The main contribution is to improve the training speed and efficiency of the LLM model, and greatly improve the effect of the model on the basis of small capacity.
At the same time, due to the smaller and simpler model structure, the reasoning speed is greatly improved.
data
The pre-training data is a combination of open data in the industry, which is relatively transparent.
model structure
The main model structure is still the classic transformer model structure, but it has been optimized. For example, instead of performing norm regularization on the output results of each layer, norm regularization is performed on the input layer. Activation functions etc. are replaced.
optimizer
Training Acceleration Optimization
Using the idea of "SELF-ATTENTION DOES NOT NEED O(n2) MEMORY", the memory of self-attention is optimized, and the memory usage is simplified from O(n2) to O(log(n)), which greatly reduces the model memory Occupancy, effectively improving the ability to process long sequences.