70 billion parameters Llama 2 training accelerated by 195%! Data has become a key element to improve its effectiveness

Llama 2 is the latest generation of open source large models officially released by Meta AI, reaching 2 trillion tokens . The fine-tuned Chat model was trained on 1 million human annotated data . Llama 2 outperforms other open source language models in many external benchmarks including inference, coding, mastery and knowledge tests.

Llama 2 opens a new chapter in the sharing of large-scale AI models around the world. It includes model weights and starting code for pre-training and fine-tuning Llama language models with parameters ranging from 7 billion to 70 billion. Compared with the previous generation model, Llama 2 uses more training data and directly doubles the context length to 4096. In addition, Llama 2 outperforms current mainstream models in human judgment, including single-round and multi-round conversations with a context length of 4K.

Llama 2 is very similar to the first-generation model in terms of pre-training settings and model architecture.

As shown in the figure, the Llama series models all use the autoregressive Transformer architecture, that is, Transformer's decoder-only architecture. Consistency has been maintained between the two generations of models. This consistency is reflected in the following aspects:

Pre-normalization: The sub-layer input of each transformer is normalized and the RMSNorm normalization function is used to ensure that the model is trained more stably and efficiently.

SwiGLU activation function: Use the SwiGLU activation function in the feedforward neural network (FFN) to replace the ReLU activation function in Transformer, thus improving the performance of the model.

Rotary Positional Embeddings (RoPE): RoPE allows the model to process relative position and absolute position information simultaneously, thereby improving the generalization ability of the model. The use of this technique helps the model better understand and process sequence information.

Data is the key to improving model performance. Llama 2 not only increases the amount of training data by 40% compared to the previous generation Llama 1, but the source and richness of the data have also been significantly enhanced.

The impact of data quality on the Llama 2 model is very significant. Using low-quality open source conversation data can lead to poor model performance. On the contrary, if higher quality dialogue data is used, the model performance will be significantly improved. Therefore, when Meta trained the Llama 2 model, it strictly screened the data and selected high-quality dialogue data.

In addition, different data sources can have a significant impact on the fine-tuned results, which further highlights the importance of data quality. In order to verify the data quality, Meta carefully examined 180 samples and compared the manually reviewed model generation results with the results written by humans. The results show that human-reviewed data is also competitive with human-written data, meaning that high-quality data is crucial for training conversation models. Therefore, Meta spent a lot of effort to collect high-quality human feedback data when training the Llama 2 model.

By increasing the amount of data, improving data quality, increasing data diversity, and improving data annotation, the effect and performance of the model can be significantly improved, allowing the model to achieve optimal results, thereby building more intelligent, efficient, and accurate AI applications.

Only high-quality data can enable the model to learn correct language rules and grammar, reducing the possibility of bias and misunderstanding; data from multiple sources and backgrounds can increase the generalization ability of the model, enabling it to adapt to different scenarios and Language style; Correct data annotation is also very important for model training, as it can help the model better understand the meaning and goals of the input data, thereby better generating output.

Jinglianwen Technology has rich experience in text data collection and labeling projects, and can provide text-related data collection and data labeling services for large AI models. Own data management platform supports natural language processing: text cleaning, OCR transliteration, sentiment analysis, part-of-speech tagging, sentence writing, intent matching, text judgment, text matching, text information extraction, NLU sentence generalization, machine translation, etc. Type data annotation. By opening up the data closed loop, data distribution, cleaning, annotation, quality inspection, and other links can be carried out in an orderly manner, delivering high-quality training data, improving the efficiency of enterprise AI data training, and accelerating the implementation iteration cycle of artificial intelligence-related applications.

Jinglianwen Technology|Data Collection|Data Annotation

Promote artificial intelligence technology and empower the intelligent transformation and upgrading of traditional industries

The copyright of the article's graphics and text belongs to Jinglianwen Technology. For commercial reprinting, please contact Jinglianwen Technology for authorization. For non-commercial reprinting, please indicate the source.

Guess you like

Origin blog.csdn.net/weixin_55551028/article/details/132848985