How to build an efficient high-performance computing large model training platform in the SAM era

Keywords: SAM; PCB; SA-1B; Prompt; CV; NLP; PLM; BERT; ZSL; task; zero-shot; data; H100, H800, A100, A800, LLaMA, Transformer, OpenAI, GQA, RMSNorm, SFT , RTX 4090, A6000, AIGC, CHATGLM, LLVM, LLMs, GLM, NLP, AGI, HPC, GPU, CPU, CPU+GPU, NVIDIA, Nvidia, Intel, AMD, high-performance computing, high-performance server, blue ocean brain, multiple Heterogeneous computing power, high-performance computing, large model training, general artificial intelligence, GPU server, GPU cluster, large model training GPU cluster, large language model, deep learning, machine learning, computer vision, generative AI, ML, DLC, ChatGPT, image segmentation, pre-trained language model, PLM, machine vision, AI server

Abstract: Segment Anything Model (SAM) is an innovative AI model recently launched by Meta Company, which is specially used for image segmentation tasks in the field of computer vision. Drawing on the learning paradigm of ChatGPT, pre-training and specific tasks are combined to significantly improve the generalization ability of the model. SAM is designed to simplify the image segmentation process, reduce reliance on professional modeling knowledge, and reduce the computing resources required for large-scale training.

In the field of computer vision, the SAM model is a ChatGPT based on the CV field, which provides powerful image segmentation functions. However, to use the SAM model, we need to configure the SAM large model environment. While configuring a SAM environment may present some challenges, once configured, we will be able to take full advantage of the power of the SAM model.

To configure the SAM environment, we need to ensure that the server has sufficient computing resources and storage space to support the efficient operation of the SAM model. SAM models usually require large amounts of computing resources and storage capabilities for accurate image segmentation. However, you also need to pay attention to the impact of local deployment of SAM on the server. The deployment of the SAM model may have a certain impact on the performance and stability of the server.

The Blue Ocean Brain large model training platform provides powerful computing clusters, high-speed storage systems and high-bandwidth network connections to accelerate the model training process; it also uses an efficient distributed computing framework and parallel computing so that model training can be carried out simultaneously on multiple computing nodes. Greatly shorten training time. It has functions such as task scheduling, resource management and monitoring to improve training efficiency and manageability. In addition, a rich set of tools and libraries are available for model development, debugging and optimization. Support is also provided for model deployment and inference. Once the model training is completed, the platform can deploy the trained model to the production environment for practical application use.

If the image cannot be displayed, please refresh the page

SAM model: ChatGPT in the CV field


1. What is the SAM model?

The SAM model is an artificial intelligence model launched by Meta. It is described on the official website as "can segment any object in any image with just one click." Taking previous image segmentation models as a foundation and training on a huge dataset, this model aims to solve multiple downstream tasks and become a general model.

The core points of this model are:

1. Learn from the inspirational ideas of ChatGPT, adopt a prompt learning paradigm, and improve learning efficiency;

2. Create the largest image segmentation data set to date, Segment Anything 1-Billion (SA-1B), containing 11 million images and more than 1 billion masks;

3. Construct a general and automatic segmentation model that can be flexibly applied to new tasks and fields under zero-sample conditions, and the results are better than previous supervised learning results.

IMG_257

SAM model official article

2. Prompt: Apply ChatGPT’s learning thinking in the CV field

SAM uses advanced technology routes to achieve breakthroughs in the underlying computer vision technology, and has broad versatility and zero-sample migration capabilities. Prompt-based learning is used for learning and training, that is, prompts are used as model input. Different from traditional supervised learning methods, this method has been widely used driven by the GPT-3 team.

1. What is the model before Prompt doing? 

Pretrained language model (PLM) is an advanced natural language processing (NLP) model that plays an important role in human-computer interaction. NLP aims to improve communication and understanding between humans and computers, and PLM is one of the cutting-edge models in this field.

Common algorithms and models for natural language processing (NLP)

Pre-trained models can be divided into four generations according to learning paradigm and development stage:

1) Feature learning: Extract text feature encoding text by setting rules, such as the TF-IDF model.

2) Structural learning: Deep learning is introduced and applied in NLP. The representative model is Word2Vec. What the first and second generation pre-trained models have in common is that the output is used as input for downstream tasks, but it does not directly perform the downstream tasks itself. Subsequent models apply both the pre-training results and the model itself to downstream tasks.

IMG_259

Development stages and characteristics of pre-trained models (PLM)

3) Downstream fine-tuning: Pre-training and downstream fine-tuning are used. Representative models include BERT and GPT.

4) Prompt learning: further improved on the basis of BERT and GPT, using prompt-based learning (Prompt-based Learning) method. This method processes the input information through a specific template and transforms the task into a form more suitable for pre-trained language model processing. Representative models include ChapGPT, GPT3.5 and SAM.

The pre-trained model is like a high school graduate, while the downstream tasks are equivalent to professional courses in college. High school graduates who study courses related to future application fields can become college students equipped with professional skills and knowledge to meet the requirements of professional positions.

IMG_260

Branches of prompt-based learning

2. Advantages of Prompt: Unifying pre-training and downstream tasks

As shown in the figure below (left), the traditional PLM+ fine-tuning paradigm suffers from large differences between upstream and downstream and application mismatch problems. In the pre-training stage, auto-regression or auto-encoding methods are used, but for downstream fine-tuning tasks, it is necessary to Plenty of new data to suit different formats and requirements.

IMG_261

Traditional pre-training + fine-tuning model and prompt paradigm

As model parameters become larger and larger, the cost for enterprises to deploy models is very high. At the same time, in order to meet the tasks between different downstreams, it is necessary to fine-tune each task specifically, which is also a huge waste. There are mainly two disadvantages:

1) The number of samples required for fine-tuning is very large

2) The model is highly specific and the deployment cost is high

In response to the above shortcomings, the PT-3 team proposed that after reading a large amount of unsupervised text, the language model can effectively solve the problem by "cultivating a wide range of skills and pattern recognition capabilities." Experiments show that in a few-sample scenario, the model can achieve good results without updating any parameters. The pre-training plus fine-tuning paradigm is to adapt the model to downstream tasks through extensive training. Prompt, on the other hand, unifies downstream tasks into pre-training tasks in the form of specific templates, organizes the data of downstream tasks into natural language form, and gives full play to the capabilities of the pre-training model itself.

IMG_262

The difference between fine-tune and prompt paradigms

Taking the emotion classification task as an example, using the traditional Fine-tune method requires preparing a fine-tuned data set, which contains evaluations of movies/books and feelings after manual reading. The fine-tuning data set must be large enough to meet the complex task requirements. However, the size of the fine-tuning data set may exceed the size of the pre-training data set, causing the purpose of pre-training to lose meaning.

In contrast, using Prompt can better handle emotion classification tasks and make full use of pre-trained model capabilities to avoid heavy fine-tuning of data set preparation. Prompt can output the prediction of the word in the MASK position based on the input sentence, and then infer the user's attitude towards the work.

IMG_263

Pre-training + downstream task fine-tuning (PLM + Fine-tuning) handles emotion classification tasks (writing movie reviews)

The prompt paradigm has the following advantages:

1) Greatly reduces the number of samples required for model training, and can be trained with few samples or even zero samples.

2) Improve the versatility of the model, reduce costs and improve efficiency in practical applications

At present, large models such as GPT-4 no longer fully open all model parameters, and users can only use the model to make predictions through the API interface. It can be seen that the importance of the Prompt project in downstream tasks is self-evident.

3. ZSL: Zero-sample learning reduces costs and increases efficiency, and improves model generalization capabilities

1. What is zero-sample learning ability?

Zero-shot Learning (ZSL) is a difficult problem in machine learning. Its goal is to enable the model to classify and identify "unknown objects" that have never been seen before. The picture below shows a classic case of getting to know zebras. A "child" has seen many animals in the zoo, such as horses, pandas, lions, tigers, etc., but has never seen a zebra. Through the teacher's description, the "child" learned that zebras have four legs, black and white stripes, and a tail. Eventually the "child" easily identified the zebra.

Similarly, the model can also use zero-shot learning to extract features from categories that have been seen (such as looking like a horse, with stripes, black and white), and then identify those categories that have never been seen before based on the description of the features of the unknown category. . In other words, the model applies previously learned knowledge and features to the recognition of unknown objects.

Zero-shot learning (ZSL) example

2. SAM’s zero-sample learning ability is recognized

SAM has such a zero-sample segmentation capability that can generate high-quality masks from various prompt inputs (including points, boxes, and text). There are many papers in the academic world that discuss SAM's ZSL capabilities. For example, "SAM.MD: Zero-shot medical image segmentation capabilities of the Segment Anything Model" tests the ZSL effect of SAM and inputs some points and boxes as prompts in the image segmentation task. ,The results show that expert users can achieve fast ,semi-automatic segmentation in most scenarios through SAM. Although SAM did not show leading fully automatic segmentation performance in experiments, it can become a potential catalyst to promote the development of semi-automatic segmentation tools for clinicians.

IMG_265

Application of SAM's zero-shot learning ability in CT imaging

4. SA-1B: The largest segmentation data set to date, helping to increase model efficiency

1. Data Engine: Use the data engine to generate masks

SAM uses the data set for training, and uses SAM's interactive annotation image method to annotate the data. In addition, novel data collection methods are adopted to combine the power of models and annotators to improve the efficiency and quality of data collection. The entire process can be divided into three stages to make SAM's data engine more complete and efficient.

Schematic diagram of SAM using data engine to progressively collect data

1) Manual stage: In the model-assisted manual annotation stage, the annotator uses the SAM model as an auxiliary tool to generate MASK by clicking on the image, selecting a box, or inputting text, and the model will update the MASK in real time based on the input of the annotator. , and provide some optional MASKs for annotators to choose and modify. This approach enables annotators to quickly and accurately segment objects in images without manual drawing. Its purpose is to collect high-quality MASK for training and improving the SAM model.

2) Semi-automatic stage: The SAM model already has certain segmentation capabilities and can automatically predict objects in the image. However, due to the imperfection of the model, there may be errors or omissions in predicting MASK. The main task of the annotator is to check and correct the prediction results of the model to ensure the accuracy and completeness of MASK. The goal of this stage is to collect more MASKs to further improve the performance and generalization capabilities of the SAM model.

3) Fully automatic stage: The SAM model has reached a high level and can accurately segment all objects in the image without any manual intervention. Therefore, the annotator's job shifts to validating and validating the model output to ensure there are no errors. This stage aims to use the automatic annotation capabilities of the SAM model to quickly expand the scale and coverage of the data set.

2. Data Set: Use the data engine to generate masks

By gradually carrying out the method of "model-assisted manual annotation - semi-automatic semi-annotation - model fully automatic segmentation mask", the SAM team successfully created an image segmentation data set named SA-1B. The dataset is characterized by unprecedented scale, high quality, rich diversity, and privacy protection.

1) Image quantity and quality: SA-1B contains 11 million diverse, high-definition, privacy-protected photos that are provided and licensed by a large photo company, comply with relevant data license requirements, and are available for computer use Visual research use.

2) Quantity and quality of segmentation masks: SA-1B contains 1.1 billion fine segmentation masks, which are automatically generated by the data engine developed by Meta, demonstrating the engine's powerful automated annotation capabilities.

3) Image resolution and number of Masks: The average resolution of each image is 1500x2250 pixels, and each image contains about 100 masks.

4) Comparison of data set size: SA-1B is more than 400 times larger than the existing segmentation data set; compared with completely manual polygon-based mask annotation (such as the COCO data set), the method using SAM is 6.5 times faster; than in the past Largest data labeling jobs are done twice as fast.

IMG_267

SA-1B is 400 times larger than existing segmented datasets

The goal of the SA-1B dataset is to train a general model that can segment any object from open-world images. This data set not only provides a powerful training basis for the SAM model, but also provides a new research resource and benchmark for the field of image segmentation.

In addition, in the SA-1B paper, the author conducts RAI (Responsible AI, Responsible Intelligence) analysis and points out that the images of this data set have stronger characteristics in terms of cross-regional representation.

The SA-1B data set has strong cross-regional representation.

5. SAM core advantages: reduce training requirements and improve segmentation performance

The core goal of SAM is to achieve target universal segmentation without requiring professional modeling knowledge, reducing training computing requirements, and self-labeling masks. To gradually achieve this goal, SAM adopts the following three methods to build a general segmentation model in the image field:

1) Data scale and quality

SAM has zero-sample migration capabilities and collects a large amount of high-quality image segmentation data (11 million images and 1.1 billion masks) to build the SA-1B data set. This is currently the largest image segmentation data set, far exceeding previous ones. data set.

2) Model efficiency and flexibility

SAM draws on the Transformer model architecture and combines the attention mechanism and convolutional neural network to achieve an efficient and guideable image segmentation model. The model is capable of processing images of any size and scale, and can generate different segmentation results based on different input prompts.

SAM’s suggestible segmentation model is divided into three parts

3) Generalization and transfer of tasks

SAM achieves generalization and transfer capabilities for image segmentation tasks. It builds an image segmentation model capable of zero-shot transfer by adopting a method that prompts segmentation tasks. This means that SAM can adapt to new image distributions and tasks without requiring additional training data or fine-tuning. This feature makes SAM perform well on multiple image segmentation tasks, even surpassing some supervised models.

Currently, SAM already has the following functions:

Learning object concepts enables you to understand the concepts and characteristics of objects in images.

Generate masks for unseen objects Generate accurate masks for unseen objects in images or videos.

High versatility has a wide range of applications and can be adapted to different scenarios and tasks.

Supports multiple interactive methods SAM supports users to use multiple interactive methods for image and video segmentation, such as select-all segmentation to automatically identify all objects in the image, and frame-select segmentation (segmentation can be completed by simply selecting the part selected by the user).

Box selection segmentation (BOX)

In the field of image segmentation, SAM is a revolutionary model. It introduces a new paradigm and way of thinking, and provides a new perspective and direction for basic model research in the field of computer vision. The emergence of SAM has changed people's perception of image segmentation and brought great progress and breakthroughs in this field.

2. Based on SAM secondary creation, derivative models improve performance

Since the introduction of SAM, this technology has aroused great interest and discussion in the field of artificial intelligence, and has derived a series of related models and applications, such as SEEM and MedSAM. These models are widely used in different fields such as engineering, medical imaging, remote sensing images, agriculture, etc. Using the concept and method of SAM for reference, and through further improvement and optimization, the application range of SAM is wider.

1) SEEM: Interaction and semantics are more generalized, and segmentation quality is improved.

SEEM is more generalizable than SAM in both interaction and semantic space

SEEM is a new interactive model based on SAM, which uses the powerful zero-sample generalization ability of SAM to achieve the task of segmenting all objects in any image. The model combines SAM with a detector to generate corresponding object masks by using the bounding boxes output by the detector as input cues. SEEM can provide multiple input modalities (such as text, images, graffiti, etc.) according to the user, and complete all content segmentation and object recognition tasks in images or videos at once.

This research has been experimented on several public datasets, and its segmentation quality and efficiency are better than SAM. It is worth mentioning that SEEM is the first universal interface to support various user input types, including text, points, graffiti, boxes and images, providing powerful combination capabilities.

SEEM performs image recognition based on user-input dots and graffiti

SEEM has classification and recognition characteristics. It can directly input a reference image and specify a reference area, thereby segmenting other images and finding objects consistent with the reference area. At the same time, the model also has a zero-sample segmentation function, which can accurately segment reference objects for videos that are blurry or undergo severe deformation. With inputs such as first frames and user-provided graffiti, SEEM is able to perform well in applications such as road scenes and sports scenes.

SEEM segments other images based on a reference image

2) MedSAM: Improving perception and applying medical image segmentation

In order to evaluate the performance of SAM in medical image segmentation tasks, Shenzhen University and many other universities collaborated to create the COSMOS 553K data set (the largest medical image segmentation data set to date). Researchers used this data set to conduct comprehensive, multi-angle, and Detailed assessment at scale. This dataset considers the diverse imaging modalities, complex boundaries, and wide object scales of medical images, posing greater challenges. Through this evaluation, we can gain a more comprehensive understanding of the performance of SAM in medical image segmentation tasks.

Detailed framework for SAM segmentation medical imaging testing

According to the evaluation results, although SAM has the potential to become a general medical image segmentation model, its performance in medical image segmentation tasks is currently not stable enough. Especially in the fully automatic Everything segmentation mode, SAM has poor adaptability to most medical image segmentation tasks, and its ability to perceive medical segmentation targets needs to be improved. Therefore, the application of SAM in the field of medical image segmentation requires further research and improvement.

Dataset COSMOS 553K and segmentation effect to test SAM's medical image segmentation performance

.

Therefore, in the field of medical image segmentation, research focus should be on how to use a small amount of medical images to effectively fine-tune the SAM model to improve its reliability, and build a Segment Anything Model suitable for medical images. Towards this goal, MedSAM proposes a simple fine-tuning method to adapt SAM to general medical image segmentation tasks. Through comprehensive experiments on 21 3D segmentation tasks and 9 2D segmentation tasks, MedSAM demonstrates that its segmentation performance is better than the default SAM model. This research provides an effective method for medical image segmentation, enabling the SAM model to better adapt to the characteristics of medical images and achieve better segmentation results.

MedSAM Schematic Diagram

3) SAM-Track: Expand SAM application fields and enhance video segmentation performance 

The latest open source SAM-Track project was developed by researchers from the ReLER Laboratory of Zhejiang University to enhance the capabilities of the SAM model in the field of video segmentation. SAM-Track can segment and track any object and supports various spatiotemporal scenes, such as street view, AR, cells, animation and aerial photography. This project can achieve target segmentation and tracking on a single card, and can track more than 200 objects at the same time, providing users with powerful video editing capabilities.

Compared with traditional video segmentation technology, SAM-Track has higher accuracy and reliability. It can adaptively identify objects in different scenes and perform segmentation and tracking quickly and accurately, allowing users to easily perform video editing and post-production to achieve better visual effects. In general, SAM-Track is a meaningful research result based on SAM, which brings new possibilities to research and applications in the field of video segmentation and tracking. Its emergence brings more opportunities and challenges to video editing, post-production and other fields.

3. SAM and derivative models enable multi-scenario applications

The SAM model is an efficient and accurate image segmentation model that has broad potential for application in the field of computer vision. It can empower the field of industrial machine vision to reduce costs, increase efficiency, fast training, and reduce dependence on data. In the AR/CR industry, autonomous driving and security monitoring fields, SAM can be used to capture and segment dynamic images. Although it may involve challenges in technology, computing power and ethical privacy, its development potential is huge. 

In addition, SAM may be difficult for segmentation tasks in some specific scenes, but it can be improved through fine-tuning or the use of adapter modules. In the fields of medical imaging and remote sensing image processing, SAM can adapt to segmentation tasks through simple fine-tuning or training with a small amount of annotated data. In addition, SAM can also be used in conjunction with other models or systems, such as classifiers for object detection and recognition or generators for image editing and transformation. This combination can further improve the accuracy and efficiency of image segmentation and bring more application scenarios to various industries.

1) Based on 3D reconstruction, empowering AR and games 

In the field of AR/VR, SAM models combine 3D reconstruction technology and image processing algorithms to provide users with a more realistic and immersive visual experience. Through the SAM model, users can convert 2D images into 3D scenes, and observe and manipulate them on AR or VR devices to simulate and restore the real world. This combination of technologies brings users a highly immersive interactive experience, allowing them to interact with objects in the virtual world and enjoy a more realistic visual experience.

In addition, the SAM model also combines deep learning algorithms to recognize and track user sight and gestures to achieve a more intelligent interaction method. For example, when the user looks at an object, the SAM model can automatically focus and provide more detailed information; when the user makes gesture operations, the SAM model can also respond quickly to adjust and change the scene.

2) Track moving objects and empower security monitoring 

In the field of image segmentation, SAM is an efficient and accurate model that can segment videos and dynamic images, and generates two derivative applications, SEEM and SAM-Track. These derived models make full use of SAM's zero-shot generalization ability to achieve accurate segmentation of target objects in blurred or severely deformed videos by using reference images and user-input information such as graffiti and text.

For example, in videos such as parkour, sports, and games, traditional image segmentation algorithms often cannot effectively handle complex backgrounds and fast-moving target objects. However, the SEEM model is not only able to accurately identify reference objects, but also eliminates background interference, thereby improving segmentation accuracy. In short, the SAM model and its related applications show excellent performance and accuracy in handling image segmentation problems with dynamic characteristics.

SEEM can accurately segment reference objects in parkour, sports, and game videos

In addition to applications in sports scenes, SEEM and SAM-Track can also empower fields such as security and video surveillance to accurately segment objects in videos for subsequent identification and processing. SEEM and SAM-Track can accurately judge the target object and perform precise segmentation through the input prompt information.

3) Solve the long tail problem and empower autonomous driving

Although autonomous driving technology has been successfully implemented in more than 90% of road scenarios, there are still 10% of long-tail scenario problems, mainly due to the unpredictability of road conditions and vehicle driving conditions. These long-tail scenarios include extreme situations such as emergencies, complex terrain, and severe weather, such as heavy rainfall, blizzards, and thunder and lightning, which pose a huge challenge to the identification and decision-making capabilities of autonomous driving systems. In addition, in urban traffic, the impact of factors such as non-motorized vehicles, pedestrians and buildings on the autonomous driving system also needs to be considered.

In order to solve the long tail problem, autonomous driving technology needs to integrate more algorithms and sensors, and improve the intelligence level of the system through methods such as data collection and deep learning. For example, the ability to identify and track target objects is improved by integrating data from sensors such as radar, cameras, and lidar. At the same time, deep learning algorithms can be used to simulate and predict complex scenarios. In addition, artificial intelligence technology is introduced to allow the autonomous driving system to continuously learn and optimize in long-tail scenarios to improve its adaptability and generalization capabilities.

There are many long-tail scenes in urban road scenes.

In the field of autonomous driving, image segmentation plays a key role in sensing and understanding the road environment. SAM (Segment Anything Model) can achieve precise scene perception by marking and segmenting different objects and regions in images. Traditional manual annotation methods are time-consuming and error-prone, while SAM's automated segmentation can significantly reduce costs and improve accuracy.

SAM can sense key elements such as road markings, lane lines, pedestrians, and traffic lights in real time in the autonomous driving system. By combining with other deep learning models, such as target detection and path planning models, SAM can accurately understand the surrounding environment and help autonomous driving systems make safe and efficient decisions.

Taking pedestrian recognition and lane line tracking as an example, SAM can predict the movement trajectories of pedestrians and vehicles and help reduce the risk of potential traffic accidents.

4) Improve segmentation performance and empower remote sensing images

Remote sensing images are an important tool for obtaining earth surface information through remote sensing means such as satellites and aircraft. They have the characteristics of diversity, full coverage and high precision, and play an indispensable role in the development of modern science and technology. Remote sensing images are widely used in fields such as environmental monitoring, natural resource management, urban planning, and disaster early warning.

Remote sensing data includes optical remote sensing data, spectral data, SAR radar data, UAV data and other types. Processing remote sensing data is generally divided into two stages: the first stage processes the received satellite data through the remote sensing ground processing system, including atmospheric correction, color homogenization and image cropping, etc., to obtain images that can be further identified and processed; the second stage The second stage is to further process and interpret the remote sensing images on this basis, mainly to identify objects in the images.

Due to the diversity, complexity and large amount of data in remote sensing images, there are many challenges and difficulties in the processing process.

Image processing goes through three stages:

Manual interpretation stage: completely relying on annotators for image interpretation, but this method is costly and the interpretation efficiency is low;

AI+remote sensing stage: With the support of AI technology and computing power, the difficulty of image interpretation is effectively alleviated and human-machine collaboration is achieved. As the number of observation platforms and satellites such as remote sensing and mapping increases, the combination of AI and remote sensing provides more possibilities for image interpretation;

The era of large remote sensing models: With the release of large neural network models, the interpretation of remote sensing images is expected to enter the large model stage.

Development stage of remote sensing image processing

Large remote sensing image segmentation model SAM is an emerging technology that provides a new method for remote sensing image processing. Based on deep learning algorithms, SAM can efficiently segment, identify and generate remote sensing images, thereby significantly improving the efficiency of remote sensing image interpretation. Using the SAM model for remote sensing image segmentation, users can quickly and accurately generate high-quality maps and three-dimensional models, thereby improving the efficiency and accuracy of environmental monitoring and resource management. In addition, the SAM model also supports the fusion of multi-source data, combining remote sensing images with other data to produce more comprehensive and accurate analysis results. Improving the efficiency of remote sensing data processing not only lays a solid foundation for remote sensing applications, but also brings broader development space for downstream remote sensing applications.

Large models are used in remote sensing image processing

Although SAM large models still face challenges when dealing with some difficult remote sensing image segmentation tasks, such as low accuracy when facing tasks such as shadows, cover segmentation, and cryptic animal localization. The remote sensing image segmentation task requires the model to have higher perception and recognition capabilities. The SAM model is currently unable to fully "segment everything", especially in processing details, and there is room for further improvement. However, through continuous improvement and optimization, the performance of SAM models can be improved.

In addition, RS-promter is a prompt learning method for remote sensing image instance segmentation based on the SAM basic model that was created by a team of experts after the release of SAM. This method, called RSPrompter, enables SAM to generate semantically discernible remote sensing image segmentation results without manually creating prompts. The goal of RSPrompter is to automatically generate prompts to automatically obtain semantic instance-level masks. This approach is not only applicable to SAM, but can also be extended to other base models.

The SAM model remains challenging in handling difficult remote sensing image segmentation tasks, but its performance can be improved through improvements and optimizations, including the introduction of more data sets, the adoption of more advanced neural network architectures, and improved methods based on RS-promter.

Anchor-based prompter

The researchers conducted a series of experiments to verify the effect of RSPrompter. These experiments not only demonstrate the effectiveness of each component of RSPrompter, but also demonstrate its better performance compared to other advanced instance segmentation techniques and SAM-based methods on three public remote sensing datasets.

Large models bring drivers and challenges to the aerospace information industry

The introduction of large models brings new impetus and challenges to the field of remote sensing images. In the application of multimodal spatiotemporal remote sensing data, large models have wide applications in aerial photography based on synthetic aperture radar (SAR), optics, multispectral satellites and drones. With the help of open source large model infrastructure, customized model development is carried out for remote sensing data to achieve one-stop, full-process remote sensing large model construction capabilities. In addition, the large model supports the processing of large-scale model parameters and annotated data volumes, achieving more efficient and accurate remote sensing data processing and analysis, and providing technical support for areas such as intelligent retrieval and push of images, intelligent extraction and compilation of surface objects, and digital twin product lines.

In the future, large model training and small model deployment will be combined to achieve better application results. Traditional image processing methods are difficult to meet the requirements of remote sensing image processing, so using large models to process remote sensing images has become an important direction of current research. The empowerment of the SAM model further enhances the significance and application value of remote sensing images, brings new opportunities and challenges to research and application in this field, and also provides technical support for people to better understand and utilize earth resources.

5) Driven by computing power applications, the functions empowering machine vision are mainly classified into four types: identification, measurement, positioning, and detection.

identify

By identifying the characteristics of the target object, such as shape, color, character, barcode, etc., it can achieve high-speed and high-accuracy screening. 

Measurement

Convert image pixel information into commonly used measurement units to accurately calculate the geometric dimensions of the target object. Machine vision has advantages in complex morphological measurement and high accuracy. 

position

Obtain the two-dimensional or three-dimensional position information of the target object.

Detection

Mainly for appearance inspection, the content covers a wide range of topics. For example, integrity inspection after product assembly, appearance defect inspection (such as scratches, unevenness).

Four major functions and difficulties of machine vision

Machine vision is called the "eye of intelligent manufacturing" and is widely used in the field of industrial automation. A typical machine vision system includes a light source, lens, camera, and vision control system (including vision processing analysis software and vision controller hardware). According to different technologies, machine vision can be divided into hardware-based imaging technology and software-based visual analysis technology. The development of machine vision is affected by four core driving forces, including imaging, algorithms, computing power and applications. Every aspect plays an important role in promoting the development of machine vision and is indispensable.

Machine vision development history

The development of machine vision technology is influenced by two core drivers.

Application-driven: With the gradual adoption of machine vision technology in traditional manufacturing industries and the rise of emerging industries, the demand for machine vision continues to increase. In the field of intelligent manufacturing, machine vision technology can help companies realize automated production and improve production efficiency and product quality. In the field of intelligent medical care, machine vision technology can assist doctors in diagnosis and treatment, improving medical standards and treatment effects.

Computing power/algorithm drive: With the increase in CPU computing power and the rapid evolution of AI algorithms, especially the application of technologies such as deep learning, machine vision technology has become more efficient and accurate in image processing and analysis. The promotion of high-performance computing equipment and the continuous advancement of algorithms provide strong support for the development of machine vision technology.

The introduction of large AI models has brought major breakthroughs to the machine vision industry. Currently, the field of machine vision uses advanced technologies, including deep learning, 3D processing and analysis, image perception fusion, and hardware-accelerated image processing. These technologies and models have greatly improved the intelligent application capabilities of machine vision, improved the complexity and accuracy of image recognition, while reducing costs and improving efficiency.

AI-based lightweight face recognition network can be used for real-time video analysis, security monitoring, etc.

AI has a wide range of applications in the field of machine vision. Deep learning networks such as CNN are used to detect and identify objects, classify images to understand scenes, improve image quality and recovery effects, achieve real-time analysis and anomaly detection, and perform 3D reconstruction and augmented reality technologies. At the same time, AI gives machine vision the ability to "understand" the images it sees, bringing unlimited innovation and development opportunities to various application scenarios.

Among them, SAM, as an important AI large model in the visual field, can promote innovation and progress in the field of machine vision. For example, SAM can be directly applied in smart cities to improve the efficiency of tasks such as traffic monitoring and face recognition. In the field of smart manufacturing, SAM can enhance visual inspection and quality control capabilities. In addition, SAM can also be combined with OVD technology to automatically generate required information and enhance semantic understanding, thus enhancing the user's interactive experience. To sum up, the application of AI in the field of machine vision and the use of SAM models have brought huge potential and opportunities to various fields.

OVD target detection basic process

SAM large model environment configuration


To deploy the "Segment Anything Model", you need to follow the following steps:

Collect and label training data: Collect image data of the objects that the model will segment and label them.

Perform data preprocessing: Before training, preprocess images (resize the image, crop irrelevant areas, or apply augmentation techniques) to improve the accuracy and generalization ability of the model.

Build and train the model: Choose a suitable model and train it using preprocessed data (appropriate network architecture, tuning hyperparameters and optimizing the model's loss function).

Model evaluation and tuning: Evaluate the trained model to ensure its performance on segmentation tasks. Model tuning can be performed, such as adjusting thresholds, adding training data, or using techniques such as transfer learning.

Deployment and inference: Deploy the trained model to the target environment and use new image data for inference.

The following is the specific operation process:

Please ensure that the system meets the following requirements: Python version is greater than or equal to 3.8, PyTorch version is greater than or equal to 1.7, and torchvision version is greater than or equal to 0.8.

You can refer to the official tutorial to operate: https://github.com/facebookresearch/segment-anything

1. The following are several ways to install the main libraries:

1. Install using pip (Git needs to be configured):

Pip install

git+https://github.com/facebookresearch/segment-anything.git

2. Local installation (Git needs to be configured):

git clone [email protected]:faceboo\kresearch/segment-anything.git

cd segment-anything

pip install -e .

3. Manual download + manual local installation:

The private message assistant obtains the zip file, decompresses it and runs the following command:

cd segment-anything-main

pip install -e .

2. Install dependent libraries:

In order to install dependent libraries, you can run the following command:

pip install opencv-python pycocotools matplotlib onnxruntime onnx

Please note that if you encounter errors when installing matplotlib, you can try installing a specific version of matplotlib, such as version 3.6.2. A specific version of matplotlib can be installed using the following command:

pip install matplotlib==3.6.2

3. Download the weight file:

You can download one of the three weight files from the following links:

1. default or vit_h: ViT-H SAM model.

2. vit_l: ViT-L SAM model.

3. vit_b: ViT-B SAM model.

If you find that the download speed is too slow, please private message the assistant to get the weight file.

By downloading and using one of these weight files, you will be able to use the corresponding pretrained model in the "Segment Anything" model.

How to configure the training SAM model server


In the field of computer vision, image segmentation is a key task that involves accurately segmenting different objects or regions in an image. As a ChatGPT based on the CV field, the SAM model provides powerful capabilities for image segmentation tasks. However, to use the SAM model, you need to configure a server suitable for the SAM environment and meet the SAM model's requirements for computing resources and storage space.

Configuring servers suitable for your SAM environment is key to taking full advantage of the SAM model. In order to meet the SAM model's requirements for computing resources and storage space, it is necessary to ensure that the server has sufficient CPU and GPU resources, storage space and high-performance network connections.

1. Computing resource requirements

Since the SAM model relies on deep learning algorithms, large-scale matrix operations and neural network training are required. Therefore, a large amount of computing resources are usually required for efficient image segmentation. Therefore, when configuring the SAM environment, you need to ensure that the server has sufficient CPU and GPU resources to support the computing requirements of the SAM model. Especially when processing large-scale image data sets, the server needs to have high parallel computing capabilities to ensure the efficient operation of the model.

1、GPU

1) GPU memory: SAM models require a large amount of memory to store model parameters and image data. Therefore, it is crucial to choose a GPU with sufficient memory capacity.

2) GPU computing power: The SAM model relies on deep learning algorithms and requires large-scale matrix operations and neural network training. Therefore, choosing a GPU with higher computing power can improve the running efficiency of the SAM model. For example, choose a GPU with more CUDA cores and a high clock frequency.

2、CPU

Although the GPU plays an important role in the SAM model, the CPU is also a component that cannot be ignored in the server configuration. In the SAM model, the CPU is mainly responsible for data preprocessing, model loading and other non-computation-intensive tasks. Therefore, when choosing a CPU, you need to consider the following factors:

1) Number of CPU cores: Since the CPU can process multiple tasks in parallel, choosing a CPU with more cores can improve the overall performance of the SAM model.

2) CPU clock frequency: Preprocessing of SAM models and other non-computationally intensive tasks usually require higher clock frequencies. Therefore, choosing a CPU with a higher clock frequency can speed up the execution of these tasks.

3. Commonly used CPU+GPU recommendations

1)AMD EPYC 7763 + Nvidia A100 80GB

AMD 7763 is a 64-core high-end EPYC chip. The A100 80GB single card memory is up to 80GB, which can support the training of large models.

2) Dual AMD EPYC 7742 + 8 AMD Instinct MI50

The 7742 is AMD's previous generation 32-core server CPU, and dual CPUs can provide 64 cores. MI50 is AMD's higher-end GPU, with 16GB of memory, and 8 images can provide sufficient computing resources.

3) Dual Intel Xeon Platinum 8280 + 8 Nvidia V100 32GB

The 8280 is the 28-core flagship CPU of the Intel Scalable series, with dual CPUs providing 56 cores. V100 32GB single card 32GB memory.

4) AMD EPYC 7713 + 8 Nvidia RTX A6000

The RTX A6000 is based on the Ampere architecture and has 48GB of memory, which is more economical than the A100 and has large enough memory.

5) Dual Intel Xeon Gold 6300 + 8 AMD Instinct MI100

The Intel Xeon Gold 6300 series provides lower-cost multi-core Xeon CPUs, and MI100 can achieve better cost performance when used together.

6) For CPU, AMD EPYC 7003 series processor is a good choice. This is AMD's third-generation EPYC server CPU, using TSMC 5nm process, with up to 96 Zen 3 cores, providing powerful multi-thread processing performance. For specific models, you can choose the EPYC 7773X with 72 cores or the EPYC 7713 with 64 cores.

For GPUs, Nvidia's A100 Tensor Core GPU is currently the first choice for training large neural networks. It is based on the Ampere architecture, has up to 6912 Tensor Cores, and can provide Tensor floating point performance of up to 19.5 TFLOPS. 4-8 blocks of A100 can be configured to meet training needs.

In addition, AMD's Instinct MI100 GPU is also a good choice. It uses the CDNA architecture, has 120 computing units, and can provide up to 11.5 TFLOPS of half-precision floating point performance. More cost-effective than A100.

4. Storage requirements

When performing image segmentation tasks, the SAM model needs to load and store a large amount of model parameters and image data. Therefore, the server needs to have enough storage space to store the SAM model and related data. In addition, in order to improve the operating efficiency of the SAM model, we can also consider using high-speed storage devices, such as SSD (Solid State Drive), to speed up the reading and writing of data.

5. High-performance network requirements

When performing image segmentation tasks, the SAM model needs to receive and send a large amount of data through the network. Therefore, the server needs to have a high-speed and stable network connection to ensure fast data transmission and real-time response capabilities of the model. Especially when processing real-time image segmentation tasks, the server needs to have a low-latency and high-bandwidth network connection to meet real-time requirements.

Blue Ocean Brain Large Model Training Platform


The Blue Ocean Brain large model training platform provides powerful computing power support, including an AI accelerator based on high-speed interconnection of open acceleration modules. It is configured with high-speed memory and supports fully interconnected topology to meet the communication requirements of tensor parallelism in large model training. It supports high-performance I/O expansion and can be extended to Wanka AI cluster to meet the communication needs of large model pipelines and data parallelism. Powerful liquid cooling system hot-swappable and intelligent power management technology, when the BMC receives a PSU failure or error warning (such as power outage, surge, overheating), it automatically forces the system's CPU to enter ULFM (ultra-low frequency mode) to achieve the lowest power. consumption). Committed to providing customers with environmentally friendly and green high-performance computing solutions through "low carbon and energy saving". Mainly used in deep learning, academic education, biomedicine, earth exploration, meteorology and oceanography, supercomputing centers, AI and big data and other fields.

1. Why do we need large models?

1. The model effect is better

The effect of large models in various scenes is better than that of ordinary models

2. Stronger creative ability

Large models can perform content generation (AIGC) to facilitate large-scale content production

3. Flexible customization of scenarios

By giving examples, we can customize a large number of application scenarios for large models.

4. Less labeled data

By learning a small amount of industry data, the large model can respond to the needs of specific business scenarios

2. Platform features

1. Heterogeneous computing resource scheduling

A comprehensive solution based on general-purpose servers and dedicated hardware for scheduling and managing multiple heterogeneous computing resources, including CPUs, GPUs, etc. With powerful virtualization management functions, it is possible to easily deploy underlying computing resources and efficiently run various models. At the same time, give full play to the hardware acceleration capabilities of different heterogeneous resources to speed up the running speed and generation speed of the model.

2. Stable and reliable data storage

Supports multiple storage type protocols, including block, file and object storage services. Pool storage resources to realize the free circulation of models and generated data, and improve data utilization. At the same time, data protection mechanisms such as multiple copies, multi-level fault domains, and fault self-recovery are adopted to ensure the safe and stable operation of models and data.

3. High-performance distributed network

Provides network and storage of computing resources, forwards them through distributed network mechanisms, transparently transmits physical network performance, and significantly improves the efficiency and performance of model computing power.

4. Comprehensive security guarantee

In terms of model hosting, a strict authority management mechanism is adopted to ensure the security of the model warehouse. In terms of data storage, measures such as privatization deployment and data disk encryption are provided to ensure the security and controllability of data. At the same time, in the process of model distribution and operation, comprehensive account authentication and log audit functions are provided to fully guarantee the security of models and data.

3. Common configurations

Currently, H100, H800, A800, A100 and other GPU graphics cards are commonly used for large model training. The following are some commonly used configurations.

1. Common configurations of H100 server

NVIDIA H100 is equipped with the fourth generation Tensor Core and Transformer engine (FP8 precision), which can provide 9 times higher training speed for multi-expert (MoE) models compared to the previous generation. By combining 4th generation NVlink that delivers 900 GB/s GPU-to-GPU interconnect, NVLINK Switch systems that accelerate per-GPU communications across nodes, PCIe 5.0, and NVIDIA Magnum IO™ software, it delivers everything from small businesses to large-scale unified GPU clusters Efficient scalability.

Accelerated servers equipped with H100 can provide corresponding computing power and take advantage of NVLink and NVSwitch's 3 TB/s memory bandwidth and scalability per GPU to handle data analysis with high performance and support large data sets through expansion. By combining NVIDIA Quantum-2 InfiniBand, Magnum IO software, GPU-accelerated Spark 3.0 and NVIDIA RAPIDS™, NVIDIA data center platforms can accelerate these large workloads with outstanding performance and efficiency.

CPU: Intel Xeon Platinum 8468 48C 96T 3.80GHz 105MB 350W *2

Memory: DRAM 64GB DDR5 4800MHz*24

Storage: Solid state drive 3.2TB U.2 PCIe 4th generation*4

GPU :Nvidia Vulcan PCIe H100 80GB *8

Platform: HD210 *1

Cooling: CPU+GPU liquid cooling integrated cooling system*1

Network: NVIDIA IB 400Gb/s single port adapter*8

Power supply: 2000W (2+2) redundant high-efficiency power supply*1

2. Common configurations of A800 server

The deep learning computing power of NVIDIA A800 can reach 312 teraFLOPS (TFLOPS). Its deep learning training Tensor floating point operations per second (FLOPS) and inference Tensor teraflops operations per second (TOPS) are both 20 times that of NVIDIA Volta GPU. Employing NVIDIA NVLink delivers twice the throughput of the previous generation. When combined with NVIDIA NVSwitch, this technology can interconnect up to 16 A800 GPUs and increase speeds up to 600GB/s for outstanding application performance on a single server. NVLink technology can be applied in A800: SXM GPU is connected through HGX A100 server motherboard, and PCIe GPU can bridge up to 2 GPUs through NVLink bridge.

CPU:Intel 8358P 2.6G 11.2UFI 48M 32C 240W *2

Memory: DDR4 3200 64G *32

Data disk: 960G 2.5 SATA 6Gb R SSD *2

Hard drive: 3.84T 2.5-E4x4R SSD *2

Network: Dual-port 10G optical fiber network card (including module)*1

          Dual-port 25G SFP28 module-less optical fiber network card (MCX512A-ADAT)*1

     GPU:HV HGX A800 8-GPU 8OGB *1

     Power supply: 3500W power module*4

     Others: 25G SFP28 multi-mode optical module*2

     Single-port 200G HDR HCA card (Model: MCX653105A-HDAT) *4

       2GB SAS 12Gb 8-port RAID card*1

    16A power cable national standard 1.8m *4

       Support rail*1

       The motherboard reserves PCIE4.0x16 interface*4

       Support 2 M.2 *1

       Original factory warranty 3 years*1

3. Common configurations of A100 server

NVIDIA A100 Tensor Core GPU can achieve excellent acceleration at different scales for AI, data analysis and HPC application scenarios, effectively assisting higher-performance elastic data centers. A100 uses NVIDIA Ampere architecture, which is the engine of NVIDIA data center platform. A100 delivers up to 20x better performance than the previous generation and can be partitioned into seven GPU instances to dynamically adjust to changing needs. The A100 is available in 40GB and 80GB graphics memory versions, with the A100 80GB doubling the GPU memory and providing ultra-fast memory bandwidth (more than 2 terabytes per second [TB/s]) to handle very large models and data set.

CPU:Intel Xeon Platinum 8358P_2.60 GHz_32C 64T_230W *2

RAM: 64GB DDR4 RDIMM server memory*16

SSD1: 480GB 2.5-inch SATA solid state drive*1

SSD2: 3.84TB 2.5-inch NVMe solid state drive*2

GPU:NVIDIA TESLA A100 80G SXM *8

Network card 1: 100G dual-port network card IB Mellanx*2

Network card 2: 25G CX5 dual-port network card*1

4. Common configurations of H800 server

H800 is NVIDIA's new generation processor, based on the Hopper architecture, which has a significant improvement in efficiency for tasks such as deep recommendation systems, large-scale AI language models, genomics, and complex digital twins. Compared with the A800, the performance of the H800 has been improved by 3 times, and the memory bandwidth has also been significantly improved, reaching 3 TB/s.

Although the H800 is not the most powerful in terms of performance, due to US restrictions, the more powerful H100 cannot be supplied to the Chinese market. Industry insiders said that the main difference between H800 and H100 is in the transmission rate. Compared with the previous generation A100, the transmission rate of H800 is still slightly lower, but in terms of computing power, H800 is three times that of A100. .

CPU:Intel Xeon Platinum 8468 Processor,48C64T,105M Cache 2.1GHz,350W *2

Memory: 64GB 3200MHz RECC DDR4 DIMM *32

System hard drive: intel D7-P5620 3.2T NVMe PCle4.0x4 3DTLCU.2 15mm 3DWPD *4

GPU: NVIDIA Tesla H800 -80GB HBM2 *8

GPU network: NVIDIA 900-9x766-003-SQO PCle 1-Port IB 400 OSFP Gen5 *8

Storage network: Dual-port 200GbE IB *1

Network card: 25G network interface card dual port*1

5. Common configurations of A6000 server

CPU:AMD EPYC 7763 64C 2.45GHz 256MB 280W*2

Memory: 64GB DDR4-3200 ECC REG RDIMM*8

Solid state drive: 2.5" 960GB SATA read-intensive SSD*1

Data disk: 3.5" 10TB 7200RPM SATA HDD*1

GPU:NVIDIA RTX A6000 48GB*8

platform:

Rack-mounted 4U GPU server supports two AMD EPYC 7002/7003 series processors, supports up to 280W TDP, supports up to 32 memory slots, and supports 8 3.5/2.5-inch hot-swappable SAS/SATA/SSD hard disk bays (including 2 NVMe hybrid slots), optional external SAS or RAID card, supports multiple RAID modes, independent IPMI management interface, 11xPCIe 4.0 slot.

2200W (2+2) redundant titanium power supply (96% conversion efficiency), no optical drive, including rails

6. Common configurations of AMD MI210 server

CPU:AMD EPYC 7742 64C 2.25GHz 256MB 225W *2

Memory: 64GB DDR4-3200 ECC REG RDIMM*8

Solid state drive: 2.5" 960GB SATA read-intensive SSD*1

Data disk: 3.5" 10TB 7200RPM SATA HDD*1

GPU:AMD MI210 64GB 300W *8

platform:

Rack-mounted 4U GPU server supports two AMD EPYC 7002/7003 series processors, supports up to 280W TDP, supports up to 32 memory slots, and supports 8 3.5/2.5-inch hot-swappable SAS/SATA/SSD hard disk bays (including 2 NVMe hybrid slots), optional external SAS or RAID card, supports multiple RAID modes, independent IPMI management interface, 11xPCIe 4.0 slot.

2200W (2+2) redundant titanium power supply (96% conversion efficiency), no optical drive, including rails

7. Common configurations of AMD MI250 server

CPU: AMD EPYC™ 7773X 64C 2.2GHz 768MB 280W *2

Memory: 64GB DDR4-3200 ECC REG RDIMM*8

Solid state drive: 2.5" 960GB SATA read-intensive SSD*1

Data disk: 3.5" 10TB 7200RPM SATA HDD*1

GPU:AMD MI250 128GB 560W *6

platform:

Rack-mounted 4U GPU server supports two AMD EPYC 7002/7003 series processors, supports up to 280W TDP, supports up to 32 memory slots, and supports 8 3.5/2.5-inch hot-swappable SAS/SATA/SSD hard disk bays (including 2 NVMe hybrid slots), optional external SAS or RAID card, supports multiple RAID modes, independent IPMI management interface, 11xPCIe 4.0 slot.

2200W (2+2) redundant titanium power supply (96% conversion efficiency), no optical drive, including rails

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/132400499