Shenzhen Technology creates an efficient development platform for scientific researchers based on Serverless containers.

Authors: Li Yangbing, Liu Shan, Mu Huan, Jiu Yu, Ding Yue

Scientific research in the cloud, AI for Science new paradigm

In the past, scientists' scientific research work required repeated verification of a large number of experiments, complex mathematical calculations, and years of continuous trial and error and painstaking exploration. The development of cloud computing basic services and the rise of artificial intelligence technology AI have also brought new changes to the field of scientific research.

In 2019, the Event Horizon Telescope (EHT) team released the first black hole photo, which was the result of more than 30 scientific research institutions around the world collaborating in the cloud; team members can call on various cloud resources around the world, and the data processing cycle has changed from weeks to weeks. Compressed to days.

Columbia University conducts climate science research on the cloud, building complex Earth System Model (ESM) simulations to understand patterns and make predictions. Monitor growing environmental data from continents, oceans and the atmosphere using satellites, drones and sensors to predict natural disasters and assess the state of the planet.

The University of North Carolina at Chapel Hill and Finland's Techila collaborated to shorten the time for graphics reconstruction from one month to 18 hours. A 40GB MATLAB mathematical experiment that originally took one to two weeks to run on a local personal computer was moved to 100 nodes on the cloud. It only takes two or three hours.

Shenzhen Technology has identified the AI ​​for Science track very early, and pioneered a revolutionary new scientific research paradigm of "multi-scale modeling + machine learning + high-performance computing". The self-developed computing platform serves drugs, The field of materials brings breakthrough computational simulation and design tools.

Bohrium Technology's scientific computing platform - Bohrium® scientific research cloud platform, is committed to providing scientific researchers with an out-of-the-box computing environment, while supporting task submission methods based on command lines and graphical interfaces. By providing microsimulation tools that combine speed and efficiency, Bohrium® helps R&D personnel calculate the physical properties of optics, electricity, magnetism, and mechanics, and conduct detailed research on the microstructure components and action mechanisms of materials. High-throughput rational design of alloys, batteries, semiconductors, catalysis and other materials is becoming a reality in the Bohrium® platform.

Bohrium® provides an out-of-the-box computing environment and supports task submission based on command line and graphical interfaces. Bohrium® realizes efficient pooling of multi-cloud and multi-supercomputer computing resources, combining high elasticity on the cloud with high performance of supercomputer. Through intelligent scheduling of tasks, the platform provides users with a “more, faster and more economical” computing experience.

Difficulties and technical overview of Shenzhen Technology Development Platform

The Borhrium® scientific computing platform has been built on Alibaba Cloud since 2018. The technical architecture solution for the task training part is based on the ACK Serverless container upgrade and runs smoothly. As the industry's understanding and use of AI for Science continues to develop, Shenzhen Technology hopes to upgrade and iterate the development and debugging part and realize an integrated development-training process.

In terms of technical indicators, Shenzhen Technology hopes that the development platform can support hundreds of thousands of scientific researchers to start online experimental research at the same time, and has at least the following four important capabilities:

  • Supports 2000+ people to activate it at the same time, and has the ability to quickly turn on and off in seconds;
  • If you encounter an abnormality in the experimental machine, you can continue the experiment without starting over;
  • The platform does not require a large amount of support manpower investment, and achieves zero technical support and zero operation and maintenance costs;
  • Taking into account both business security and host security.

Initially, the development platform container AI technology architecture adopted the classic container service ACK cluster to manage cloud server ECS, rather than the Serverless container mode. There are two major problems with the old version of the solution: on the one hand, because the development cycle of scientific researchers is long, they will be turned on and off several times during the process. In order to facilitate the next startup to continue the experiment, the environment needs to be saved through container image packaging, but such a shutdown If the time is too long, the image will continue to expand and become too large; on the other hand, there is a low probability of unexpected downtime. Unless the customer happens to manually save it before the downtime, all information and data will be lost. There is an urgent need to help the majority of users. Automatically save at any time.

In summary, Shenzhen Technology’s needs are as follows:

Correspondingly, the technical difficulties and challenges are:

  • High concurrent requests for large-scale resources

    The development machine supports fast start, stop and shutdown (in seconds), and it is hoped that it can be used by 2000+ scientific researchers online at the same time. When a development machine applies for cloud CPU and GPU resources, it needs to support cross-availability zone and regional scheduling/restart without user awareness.

  • Environment save when exiting midway

    Instance resources can be released when the user shuts down, but user container data and temporary data need to be retained for backtracking. At the same time, if the container is restarted or even the development machine instance is released due to user misoperation, the container environment and temporary data before restarting/release must also be retained. Supports restarting the development machine and continuing previously aborted experiments.

  • Mirror data expansion

    This solves the problem of data increase in the traditional development machine solution. The traditional solution shuts down the machine to save the image, and uses the shutdown image to restore the environment when the machine is turned on. After turning on and off the machine multiple times, the size of the image layers continues to increase, bringing additional costs.

  • Business data linkage and automatic management

    The ECI-based development machine not only needs to access Alibaba Cloud's NAS but also needs to support third-party JuiceFS storage, and it needs to support automatic data copying between the two storages.

  • Strong isolation and stability in multi-tenant environment

    Strong data and resource isolation must be achieved between development machines; the failure of one development machine or the downtime of the underlying node will not affect other development machines.

  • Operation and maintenance convenience

    The development machine supports continuous iterative upgrades and supports automatic image cache production.

Create an efficient scientific research and development platform based on Serverless containers

After many in-depth discussions, Alibaba Cloud and Shenzhen Technology jointly finalized the following serverless container solution:

In terms of the overall architecture design, the development machine adopts a cross-regional multi-K8s cluster solution, which ensures overall reliability and disaster tolerance and can dispatch computing resources in different regions on a wider scale; on the K8s management and control side, Alibaba Cloud containers are used The service serverless version (ACK Serverless) can eliminate many operation and maintenance burdens, such as no need to maintain node pools, no need to pre-cache images, no need to maintain K8s to set up hosting, etc. The underlying ECI elastic container instances can be started and shut down faster than traditional cloud service ECS. And it follows the Serverless concept of use on demand and pay on demand; in terms of product management such as application images and AI model files, choosing Alibaba Cloud Container Image Warehouse Enterprise Edition (ACR EE) one-stop solution can achieve global synchronous acceleration and large-scale deployment. Scale/large image distribution is accelerated and seamlessly integrated with Container Service ACK.

What's more worth mentioning is that the ACK Fluid solution exclusively provided by Alibaba Cloud can seamlessly connect to third-party storage mounts and provide multi-user data secure sharing and isolation functions. At the same time, ACK Fluid also realizes the automation of data management, further improving the efficiency and ease of use of the system.

Serverless container solution architecture diagram of scientific research and development platform

For the requirement that exception points can be recovered after restart/release, that is, the container environment and temporary data are still retained for easy viewing or continued experiments, you can refer to the following CRD life cycle design. During the life cycle of the development machine, powering on, shutting down, and restarting can all be done within 20 seconds. At the same time, ACK Fluid supports data mounting to ECI that is shortened to less than 5 seconds (including Alibaba Cloud NAS and third-party storage).

After cooperation and communication between the two parties, the AI ​​development platform solution based on Serverless containers finally achieved the following results:

  • Large-scale POD elasticity, enabling 2,000+ development machines at the same time
  • Resource utilization increased by 30%, pay-as-you-go, providing rich and available resources
  • After an exception, restore the data at the time of the downtime
  • Based on ECI, it can be used and opened within seconds and supports environment preservation.
  • Compared with the traditional K8s form, ACK Serverless does not need to maintain servers and image caches.
  • Seamless data access to third-party storage solutions through ACK Fluid, while supporting multi-user data security sharing and isolation, as well as scheduled data synchronization between different storage systems

In the future, move forward with dreams in mind

At present, the serverless container solution of the development machine has basically met expectations. In the subsequent operation period, we need to continue to pay attention to and improve the robustness of the overall project: optimize the upstream and downstream bottleneck dependencies of the overall project (such as API call frequency and flow control), and build a complete exception situation A covert plan.

Li Yangbing, technical architect of Shenzhen Technology, said: "Thanks to the professional technical strength and professionalism of the Alibaba Cloud team: targeting business pain points, jointly overcoming technical difficulties, exploring cutting-edge technology solutions, and combining with Serverless container architecture to create industry-leading scientific research Development Platform."

In addition, we further explore and try the unified management and efficiency optimization of multi-location/multi-type resources under the cloud native architecture, realize unified management and scheduling of multiple cluster resources based on K8s, and give Fluid Dataset unified access to cloud native storage, acceleration and management. Heterogeneous/off-site data solutions.

Today, the Bohrium® scientific research cloud platform runs smoothly and has provided good support to many scientific researchers:

Research teams from Wuhan University and Southern University of Science and Technology have made important progress in the field of liquid metal, laying the foundation for atomic-scale, customizable and precise manufacturing of new materials, high-entropy alloys.

The theoretical team of the Institute of Geochemistry, Chinese Academy of Sciences, and their collaborators explored a new mechanism of anisotropy in the Earth's inner core based on Bohrium®, providing a new explanation for the origin of the complex anisotropy and heterogeneous structure of the inner core.

Peking University School of Mathematical Sciences and School of Materials Science and Engineering, Beijing Institute of Scientific Intelligence, Shenzhen Science and Technology and CATL 21C Innovation Laboratory used deep potential energy methods to study the phase change and structural evolution of silicon-based anodes during the lithium deintercalation process, and achieved important progress.

Shenzhen Technology continues to work tirelessly for the "AI for Science" revolutionary new paradigm of scientific research, using artificial intelligence and multi-scale simulation algorithms, combined with advanced computing methods to solve important scientific problems, and provide the most basic biomedicine and energy for human civilization. , materials and information science and engineering research to create a new generation of micro-scale industrial design and simulation platform.

Qt 6.6 is officially released. The pop-up window on the lottery page of Gome App insults its founder . Ubuntu 23.10 is officially released. You might as well take advantage of Friday to upgrade! RISC-V: not controlled by any single company or country. Ubuntu 23.10 release episode: ISO image was urgently "recalled" due to containing hate speech. Russian companies produce computers and servers based on Loongson processors. ChromeOS is a Linux distribution using Google Desktop Environment 23-year - old PhD student fixes 22-year-old "ghost bug" in Firefox TiDB 7.4 released: officially compatible with MySQL 8.0 Microsoft launches Windows Terminal Canary version
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10117792