NLP Miscellaneous

After coming to Beijing for more than a week, I was about to recover from my initial illness, and I finally passed llama and ViT. I write it down here——

I used to make images before, and the migration of large models is basically NLP-related knowledge. There is still a gap between many things and CV. In addition, large models require high computing power. Cloud-based operations are a habit for me to operate locally. The drag-and-drop engineer is still a little uncomfortable, so he supplemented some knowledge of NLP, large-scale models, cloud computing, domestic frameworks, Linux, Docker and hardware:

MindFormersIt is a natural language processing tool library launched by HUAWEI CLOUD. It provides a wealth of pre-trained models and downstream task applications, covering a full-process development kit for large model training, fine-tuning, evaluation, reasoning, and deployment. Based on the MindSpore Transformers suite, it provides mainstream Transformer pre-training models in the industry and SOTA downstream task applications, covering a wealth of parallel features

CausalLanguageModelDatasetclass is the class used in the MindFormer library for building causal language model datasets.

In natural language processing tasks, the causal language model (Causal Language Model) is a common model type, which mainly learns the causal relationship in the text sequence, that is, the influence of a word or phrase on subsequent words or phrases (often used in generation, summarization and classification tasks)

The CausalLanguageModelDataset class provides a convenient way to create and manipulate datasets for causal language models. Data can be automatically read from the specified dataset directory or file, and operations such as preprocessing, batch processing, and randomization can be performed as needed. Additionally, this class supports splitting the dataset into training, validation, and test sets for evaluation and tuning using different subsets of data during training; by using the CausalLanguageModelDataset class, it is easier to build and train causal
languages model for better performance and results.

Besides causal language models, there are other types of natural language models:

  • Statistical Language Model: This type of language model predicts the next word or character based on a probability distribution. They typically use n-grams, or n-grams, to represent sequences of text and use maximum likelihood estimation or other methods to compute probabilities

  • Neural Network Language Model: This type of language model uses a neural network to learn a probability distribution over sequences of text. They usually consist of an encoder and a decoder, where the encoder converts an input sequence into a hidden state and the decoder generates an output sequence based on the hidden state

  • Transformer Language Model (Transformer Language Model): This type of language model is a neural network architecture based on a self-attention mechanism, which is widely used in natural language processing tasks, such as machine translation, text summarization, etc.

Dump data collection and Profiling data collection are both performance analysis tools, but their application scenarios are different:

  • Dump data collection is mainly used to diagnose problems such as program crashes and memory leaks
  • Profiling data collection is mainly used to analyze the performance bottleneck of the program, such as which functions are called more often and which code lines are longer

ModelArts notebook跑LLama:

 	1  git clone -b dev https://gitee.com/mindspore/mindformers.git
    2  cd mindformers
    3  bash build.sh

	cp /user/config/nbstart_hccl.json ./
	bash run_distribute.sh /home/ma-user/work/mindformers/nbstart_hccl.json /home/ma-user/work/mindformers/configs/llama/run_llama_7b.yaml [0,8] train
    tail -f ../output/log/rank_0/info.log

ModelArts notebook runs ViT:

git clone -b dev https://gitee.com/mindspore/mindformers.git
cd mindformers
bash build.sh
wget https://bj-aicc.obs.cn-north-309.mtgascendic.cn/dataset
ll
mv dataset imageNet2012.tar
ll
tar -xvf imageNet2012.tar 
ls
top  	# 进程的资源占用情况
bash run_distribute.sh /home/ma-user/work/mindformers/scripts/nbstart_hccl.json /home/ma-user/work/mindformers/configs/vit/run_vit_base_p16_224_100ep.ymal [0,8] train

Guess you like

Origin blog.csdn.net/weixin_44659309/article/details/131916609