A solution for life science cryo-electron microscopy single-cell gene analysis based on deep learning of alchemy artifact

Life Science | Cryo-EM | Protein Structure

Deep Learning| Gene Sequencing| Convolutional Neural Network

With the rapid development of cryo-electron microscopy, proteomics, deep learning, gene sequencing, convolutional neural network, high-performance computing, single-cell gene, data mining, data analysis, target discovery, crystal prediction, AlphaFold and other technologies, life Science began to be gradually valued by people.

The life science industry involves the scientific field of studying all living things such as microorganisms, animals and plants, and also includes considerations in related fields such as bioethics. Life science research is of great help to improve the quality of human life. From a global perspective, since the 21st century, the development of global life sciences has entered the fast lane, especially the implementation of the Human Genome Project, the deepening of stem cell research, and the continuous development of cloning technology have pushed the development of life sciences to new heights. At the same time, R&D investment in corresponding fields is also increasing. As a typical representative of strong dependence on scientific and technological information technology, the field of drug development and gene sequencing analysis in the life science industry is facing problems such as shortage of computing resources and long development cycle.

Challenges in the life sciences

Yang Tao, director of the biological computing platform at the School of Life Sciences, Tsinghua University, believes that the current challenges of cryo-electron microscopy in scientific research are: data management challenges, scientific research progress challenges and experimental risk challenges.

1. Data management

Due to the excess data, under the premise of maximum compression, about 4TB of data will be generated every day. In order to maximize the efficiency of computing equipment, it needs to work 365 days without interruption. The total amount of data in one year is astonishing, which brings huge challenges to data management.

2. Scientific research management

Cryo-electron microscopy technology has been widely recognized, and various scientific research institutions are mobilizing resources to seize high ground, so there is a problem of time efficiency. Even if it is half a day slower than others, it will lose the value of the first launch.

3. Experimental risk

This is a very long pipeline of experiments, and any intermediate links are risky. Once the risks cannot be resolved in time, the output of the entire system will be greatly reduced.

client needs

The single-cell genome research technology center of a college (referred to as "the center") aims to establish standardized and automated engineering technology, improve the level of single-cell structure analysis, determine the three-dimensional structure from protein molecules to whole cells with high precision, and on this basis Reveal the functions of proteins and their complexes, prepare proteins/antibodies on a large scale, and build a core base for protein science research with world-class level and comprehensive demonstration functions.

As far as life science research projects are concerned, the amount of data involved in each project is as small as hundreds of terabytes. For projects with a long time period and a wide range of fields, the future data demand may be at the PB level. In addition, the center needs to consider supporting a variety of life science research projects, and among them, different applications have different requirements for high-performance platform computing environments, such as gene sequencing requires high I/O performance and large memory consumption, while molecular dynamics research in addition to In addition to I/O performance, high network and concurrent processing capabilities are also required. All of these presented challenges for the center to build a high-performance platform:

1. The amount of data has increased by more than 10 times, and the computing power must "keep up"

The cryo-electron microscopy technology adopted by the research team has made revolutionary progress in the past two years. Specifically, the camera technology has achieved a leap forward, and the ability to collect data has increased by more than 10 times, or even hundreds of times, so that the source data for protein structure research The growth is geometric progression, which requires the center to comprehensively improve data processing and computing capabilities in the later stage.

2. Urgent need to simplify management to ensure service quality

With more and more life science research projects, how to allocate resources according to the individual needs of different projects and researchers, recycle resources in a timely manner, realize centralized and unified management across the entire high-performance resource pool, simplify maintenance management, and reduce the burden on operation and maintenance personnel, It is a common problem faced by high-performance computing platforms for scientific research.

3. TCO remains high

Life science research has quickly become a national strategic development direction, leading to a rapid increase in research projects and interdisciplinary research needs. The low utilization rate of traditional tiered computing and storage resources leads to a rapid increase in new costs. In addition, energy consumption has become an insurmountable "high wall" that hinders the expansion of high-performance computing centers.

4. Network performance cannot be delayed

As the key to ensure the normal operation of high-performance clusters, high-performance networks undertake important connection tasks. With the continuous improvement of single-node computing and storage performance, high-performance users need 10G, 40G, 100G, and InfiniBand network options to meet different high-performance computing needs.

Solution Features

Based on the fusion architecture, Blue Ocean Brain helped a college's single-cell genome research technology center build a distributed high-performance platform with 250 physical computing nodes, 5,000 computing cores, a total storage capacity of 1.92PB, and a theoretical computing capacity of 208Tflops. Centralized and unified management across 20 converged architectures is realized through Luster technology.

1. Computing density of 4.1TFLOPS/U, 4 times performance improvement

Configurations can be tailored for different projects. Among them, the high-density computing node supports the 14-core Intel® Xeon™ E5-2600v3 processor, has a density of 224 computing cores in 2U, and the computing performance density of a single U space reaches an industry-leading 4.1TFLOPS, and supports 64 DIMMs at the same time High-density memory ensures high performance and low latency performance requirements. In addition, InfiniBand interfaces are supported, ideal for workloads requiring ultra-low latency. With the guarantee of strong computing power, the computing efficiency is increased by 3-4 times, and the computing tasks that were completed in the past 4-5 days can be completed in one day.

2. Simplify the monitoring and management of high-performance resource pools

Different system configurations can be customized according to project requirements, and 20 FX systems can be centrally monitored and managed through the Chassis Management Controller (CMC). In addition, agentless lifecycle management and one-to-many remote management functions can ensure that BIOS and firmware program updates will not affect business stability, and improve the efficiency of lifecycle management of computing nodes in the system. Moreover, when expanding the server, IT personnel can issue configuration files to enable the system to automatically update the BIOS and firmware programs, avoiding the tedious process of repeatedly inputting configuration parameters, reducing system failures caused by manual input errors, and simplifying management, operation and maintenance. Reduced management costs.

3. TCO is reduced by about 20%

Integrated deployment with automation, high density and low energy consumption, and centralized and unified management can reduce the TCO of the center by about 20%. Among them, Blue Ocean Brain will connect the server, storage and 1G0b network through the motherboard, and form a fusion all-in-one machine through modular design. At the same time, it will provide shared slots for heat dissipation, power supply, network, management and PCIe expansion, reducing the footprint and energy consumption of the data center. , help center get good value for money.

4. High-speed network guarantee platform I/O performance

Blue Ocean Brain provides the center with a 40G high-performance network. On the basis of maintaining cost advantages, it provides users with stable network performance and guarantees high performance and low latency requirements.

5. Break the original server heat dissipation method and use liquid cooling to dissipate heat

The Blue Ocean Brain liquid-cooled server system breaks through the traditional air-cooled heat dissipation mode and adopts a mixed heat dissipation mode of air-cooled and liquid-cooled - the main heat source in the server, the CPU, is cooled by a liquid-cooled cold plate, and the remaining heat sources are still cooled by air-cooled methods. Through this hybrid cooling method, the heat dissipation efficiency of the server can be greatly improved, and at the same time, the power consumption of the main heat source CPU heat dissipation can be reduced, and the reliability of the server can be enhanced. After testing, the annual average PUE value of data centers using liquid-cooled server supporting infrastructure solutions can be reduced to below 1.2.

Customer benefits

1. The Blue Ocean Brain HPC high-performance computing and AI platform has become a high-performance, multi-functional, and professional cutting-edge computing platform, especially in the aspect of AI deep learning, providing efficient computing support for biological research inside and outside the school. At the same time, it provides computing services for multiple research groups such as computational biology, deep learning, and gene sequencing. Including off-line processing of sequencers, sequence search and comparison analysis, molecular dynamics simulation, computer-aided drug design and molecular docking, and calculation of biological networks.

2. Fully support the research and development of molecular graph coding based on deep learning and traditional Chinese medicine prescription system based on deep learning. R&D personnel can use HPC high-performance computing and AI platforms to develop deep learning codes based on three-dimensional molecular maps, and carry out TCM diagnostic prescriptions based on deep learning. The multi-task molecular prediction model is composed of convolutional neural network or recurrent neural network. Cross-validation is used to tune and validate parameters, and external data is used to test and evaluate the model. At the same time, key information is mined from the predictive model. At the same time, a large amount of prescription compatibility information is learned through convolutional neural network or recurrent neural network, and then the main drug is used to generate the adjuvant obtained by semantic automatic correlation analysis, thereby generating a new prescription. Blue Ocean Brain's HPC high-performance computing and AI platform provides efficient parallel computing resources, which greatly accelerates the training speed of the model, so that the final task can be completed within an effective time.

3. Support the ab initio drug design based on chemical fragments, which has an important role in promoting the treatment of diseases and the understanding of biological functions. The traditional drug screening process is time-consuming and costly, resulting in inefficiency in the entire drug design and discovery process. In order to speed up the process of drug design and discovery, researchers used this platform to gradually develop the method of molecular de novo design, and achieved good results. Through the combination of Monte Carlo tree search and neural network model, the researchers realized the search of huge chemical space and the sampling of the optimal structure, quickly completed the complete ab initio drug design process, and explored the protein pocket characterization and scoring functions.

4. Use the deep learning framework to build a deep learning model, strengthen the training of the learning model, realize the training and testing of the scoring function model of deep learning, and train the model. For the molecules generated by the model, the synthesis, toxicity, and physicochemical properties of the molecules were analyzed by clustering to select the appropriate molecules.

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/127424242