Improving the reproducibility of scientific research: Hewhale focuses on AI for Science full life cycle management

In March this year, the Ministry of Science and Technology, together with the Natural Science Foundation of China, officially launched the special deployment of "AI for Science ". Data-driven scientific research has long faced many difficulties. In response to the limitations of traditional scientific research workflows that rely too much on the experience and physical strength of human experts, AI4S aims to use artificial intelligence methods to perform computationally intensive and efficient tasks based on the support of scientific data and computing power. Iterative scientific exploration brings new breakthroughs to scientific research work.

However, with the continuous upgrading of scientific research paradigms, traditional infrastructure has gradually become unable to respond to the software and hardware support required by emerging AI for Science. This article will focus on " one-stop full-process management of research objects from data, algorithms to models " and introduce the data science collaboration platform ModelWhale to research teams in various fields, with a view to providing assistance for scientific research driven by artificial intelligence.

1. Scientific research expectations and current situation

Scientific research expectations: The project can have a complete life cycle. The completion is not the end of the project. The subsequent reuse is the continuation of the project life.

Actual situation: "Workshop model" rather than "platform scientific research" cannot connect team work, the efficiency of reproducing results is low, and the project life cycle is short

The definition of "reuse" in the reuse of previous research results to give the project a complete life cycle is very broad: it can be the reuse of "intermediate results" in previous projects, such as just a code fragment; it can also be "more complete in stages " "results", such as a model or a completed image ; it can also be a more streamlined and standardized " research paradigm " for a certain type of project.

At the same time, this kind of "reuse" occurs across time and people . However, in fact, due to the lack of systematic summary management of project research results, and the frequent turnover of project team members, it is very easy to forget the previous research results as time goes by. Even if you remember and plan to reuse, it is difficult to reuse them. It was found and completely reproduced with the supporting environment. Even if everyone agrees that reasonable reuse can save a lot of time in the long run, in order to avoid immediate troubles, most relevant personnel will choose to start over during the project.

2. Full life cycle management of artificial intelligence-driven scientific research

ModelWhale focuses on one-stop full-process management of research objects from data, algorithms to models, improves the reproducibility of scientific research from the infrastructure level, and helps create a good ecosystem for organized scientific research .

Projects produced from scratch

01 Multi-source data access and management

The foundation of data-driven research is the data itself, while data-driven research deployed on traditional infrastructure mainly relies on human power for data management. Through ModelWhale, on the premise of ensuring data security, researchers can not only create different types of data sources, such as data sets, database connections, object storage connections, NAS space, annotation data, etc., but also can perform related data source analysis. Overview, logo, version management, comments and distribution sharing. The data access and management functions provided by ModelWhale for researchers are to lay a solid foundation for data-driven research, so that researchers no longer have to waste time on the complicated underlying work of data management.

Perform data access, management, collaboration, analysis and other operations in the NAS space

02 Ready to use without packaging

Once the data problem is solved, the project will be produced from scratch, and the first step in producing the project is often to package and build the environment. As a cloud data science collaboration platform, ModelWhale provides three cloud analysis environments: Notebook interactive, Canvas drag-and-drop, and CloudIDE, and supports several programming languages ​​such as Python and R to adapt to the different programming needs and habits of researchers; in addition, the platform It has been equipped with a variety of general and specific subject images, which can be directly selected when creating a new project. It is truly ready to use - open ModelWhale and you can start project research without configuring any environment, saving time and effort.

Quickly create a new Notebook and start researching

Built-in multiple images for use by researchers in different fields

03 Version management supports exploration of non-deterministic issues

After configuring the environment, you can start data analysis and programming modeling. Needless to say about general programming operations, just select the corresponding analysis interface, computing power, and image to start. It is worth mentioning that data-driven research generally focuses on the exploration of uncertain issues. When faced with a new topic, it is often not clear at the beginning which method and means can achieve the research purpose. We need a variety of Various attempts. Therefore, the additional function that ModelWhale can provide here is non-Git logic-controlled version management, which is not too heavy. It can perform project version comparison and Cell-level version backtracking at any time, supporting researchers to explore from scratch.

Version comparison and version backtracking, accept historical versions with one click

04 Model offline training: free up energy and resources

In addition, for large and complex computing tasks that are common in data-driven research, such as deep learning, as mentioned above, ModelWhale first supports mounting the NAS directory as a data source into the analysis environment, thereby enabling analysis and research of very large data. Secondly , It also supports the offline training function of the model, that is, the training task can continue after the computer is turned off, freeing up researchers' time and energy. It also provides visual comparison of training results to assist in efficient model adjustment and selection. In one sentence, ModelWhale will alleviate the complex underlying work of researchers from various details.

Create a new model offline training task

05 Multi-person collaboration and team collaboration

Scientific research is often not the work of one person. For complex projects, it is common for multiple people to share the work within the group. ModelWhale is not only about data science, but also a cloud collaborative innovation platform, which is bound to help multiple people collaborate on research . In layman's terms, ModelWhale can be imagined as a code version of mainstream cloud document software that enables multiple people to edit the same project online. Of course, in order to avoid bugs caused by code collisions, a version needs to be generated to synchronize progress with others. In addition, ModelWhale also has a project management tool for task planning . The person in charge can create new project tasks, split them into sub-tasks for distribution, and collaborate with the team to complete complex project research. Finally, multi-person collaboration not only focuses on within a certain project team, we also focus on cross-industry and cross-field collaboration: using the Canvas function, theoretical scholars in various fields with weak coding capabilities can work with data scientists at the same time, and the theoretical scholars are responsible for Using functional modules to build research ideas, data scientists transform them into practical codes, which complement each other and get twice the result with half the effort.

Project management tools, task planning interface

Use Canvas to quickly build an analysis process

Reuse previous research

01 Reuse custom images, no need to create artificial wheels

ModelWhale itself embeds a variety of general and specific subject images, which can be directly selected when creating a new project. So what should we do if these images cannot meet the current research needs? At this point, researchers can create custom images to match current needs. But this does not mean that every researcher in the project team needs to perform this step before starting research. Once the creation of a custom image that meets the research needs is completed, the image can be distributed to any member of the organization for reuse . People make the wheel. In addition to the first person responsible for creating a new image, other researchers in the team can still reuse the previously built research environment out of the box.

Customize scientific research images and synchronize them to other researchers in the project team with one click

02 Notebook code library: code snippets can be easily reused

ModelWhale Notebook has a code snippet library function in the sidebar. Researchers can pre-collect code snippets that are likely to be reused in previous research. When conducting a new round of research, they can use the code library in "My Collection" The corresponding code snippets can be found in the code library. In addition, the code library also contains some official codes. Whether it is "Public Library" or "My Collection", the code snippets can be reused in the new project interface and can be inserted directly. Finally, the code snippets in the code base support permission management and distribution within the organization. The code snippets collected by researcher A can be easily reused in B's project.

Collection and reuse of code snippets

03 Canvas component: Create projects through visual specification Flow templates

Simply put, ModelWhale Canvas is based on the concept of visualization and model-driven, and completes the construction of application models through "drag and drop" components. This is a bit big and abstract. In actual operation, how can Canvas be applied to the reuse process of project results? Just imagine, when researchers are carrying out a set of project steps that are relatively tedious but extremely process-oriented, do not require innovation, and will be carried out frequently in the future, they can choose to use the set of steps of component preconstruction in Canvas to encapsulate them into Commonly used workflow Flow. If you encounter this set of steps again in other projects, you can directly create the project through the Canvas template, confirm the component process, and then convert it into a Notebook. At this time, the large framework is already available and can be implemented through code fine-tuning. That set of tedious and streamlined project steps is very convenient.

Create a Canvas project from a template and save it as a Notebook with one click

04 Algorithm library: realize the organization, sharing and reuse of algorithm models

Using the algorithm library, researchers can manage algorithm models that have been produced in previous research work, supplemented by text descriptions, to organize and share these algorithm models. When they are actually reused, such results can be directly created into projects or model services. , eliminating a lot of redundant code writing and model training work, saving time. In addition, some commonly used algorithms have been compiled in the ModelWhale algorithm library for researchers to call at any time when performing general data analysis work.

The algorithm library function provides precipitation management and one-click reproduction of algorithm models.

3. Conclusion

Led by technological revolution and top-level policies, the scientific research community is paying more and more attention to artificial intelligence. The data science collaboration platform ModelWhale Scientific Research Edition focuses on collaborative innovation in data-driven research. It is digital infrastructure with the mission of promoting the reform of the AI ​​for Science scientific research paradigm and strengthening organized scientific research : a one-stop shop focusing on research objects from data, algorithms to models. Full-process management improves the reproducibility of scientific research from the infrastructure level and helps create a good scientific research ecosystem for collaboration; based on FAIR principles and open scientific research concepts, it provides a safe and complete public sharing portal and online interaction for data and other research and production materials . Workbench ; heterogeneous integration, intensive management and control, on-demand allocation, agile response, and powerful computing power scheduling management make it possible for personal computers to call LLM large language models, and also maximize the availability of computing resources within the organizational team; introduce the ModelOps concept , assisting the full life cycle management of large models.

The ModelWhale scientific research version covers earth sciences, biomedicine, humanities and social sciences and other professional fields, and has implemented best practices in national scientific research institutions such as the National Meteorological Information Center and the China Natural Resources Aviation Geophysical Exploration and Remote Sensing Center. We hope to serve everyone engaged in Supported by pioneers in innovative data research and their teams. For any related needs, you are welcome to enter the ModelWhale official website to register and experience, or click [Contact Product Consultant (Mobile Jump)] to communicate with us.

Guess you like

Origin blog.csdn.net/ModelWhale/article/details/133039957