【Paper Interpretation】Scientific Discovery in the Artificial Intelligence Era

1. Brief introduction

Artificial intelligence (AI) is increasingly being integrated into scientific discovery to enhance and accelerate research, helping scientists generate hypotheses, design experiments, collect and interpret large data sets, and obtain insights that may not be possible using traditional scientific methods alone. opinion. Here, the paper examines breakthroughs from the past decade, including self-supervised learning, which allows models to be trained on large amounts of unlabeled data, and geometric deep learning, which leverages knowledge about scientific data structures to improve model accuracy. and effectiveness. Generative AI methods can create designs such as small molecule drugs and proteins by analyzing different data patterns, including images and sequences. The paper will discuss how these methods help scientists throughout the scientific process and the core questions that remain despite these advances. Both developers and users of AI tools need to better understand when these methods need improvement, while challenges posed by poor data quality and management remain. These questions span scientific disciplines and require the development of fundamental algorithmic methods that contribute to scientific understanding or autonomous acquisition of it, making them a key focus area for AI innovation.

2. Research background

It lays the foundation for developing scientific insights and theories through how to collect, transform, and understand data. The rise of deep learning in the early 2010s greatly expanded the scope and ambition of these scientific discovery processes. Artificial intelligence (AI) is increasingly used across scientific disciplines to integrate large data sets, refine measurements, guide experiments, explore data-compatible theoretical spaces, and provide actionable and reliable models that integrate with scientific workflows.

Data collection and analysis are fundamental to scientific understanding and discovery, two central goals of science, and quantitative methods and emerging technologies, from physical instruments like microscopes to research techniques like bootstrapping, have long been used to achieve these goals. The introduction of digitization in the 1950s paved the way for the widespread use of computing in scientific research. The rise of data science since the 2010s has enabled artificial intelligence to provide valuable guidance by identifying scientifically relevant patterns from large data sets.

Although scientific practices and procedures vary across stages of scientific research, the development of AI algorithms spans traditionally siled disciplines (Figure 1). Such algorithms can enhance the design and execution of scientific research. They are becoming indispensable tools for researchers by optimizing parameters and functions, automating procedures for collecting, visualizing and processing data, exploring the vast space of candidate hypotheses, generating hypotheses and estimating their uncertainties to propose relevant experiments.

Since the early 2010s, the capabilities of artificial intelligence methods have grown thanks to the availability of large data sets, aided by fast and massively parallel computing and storage hardware (GPUs and supercomputers), and combined with massively parallel computing and new algorithms Great improvement. The latter includes deep representation learning (Box 1), specifically multilayer neural networks, capable of identifying fundamental, compact features that can simultaneously solve many tasks that underlie scientific problems. Among them, geometric deep learning (Box 1) has been shown to help integrate scientific knowledge, presented as compact mathematical statements of physical relationships, prior distributions, constraints and other complex descriptors, such as the geometry of atoms in molecules. Self-supervised learning (Box 1) enables neural networks trained on labeled or unlabeled data to transfer learned representations to different domains, but with few labeled examples, e.g., by pre-training a large base model and making They adapt to solve different tasks in different domains. In addition, generative models (Box 1) can estimate the underlying data distribution of a complex system and support new designs. Unlike other uses of artificial intelligence, reinforcement learning methods (Box 1) find the best strategy for an environment by exploring many possible scenarios and assigning different actions based on metrics such as expected experiments.

In AI-driven scientific discovery, scientific knowledge can be incorporated into AI models using appropriate inductive biases (Box 1), which are assumptions that represent structure, symmetry, constraints, and prior knowledge. However, applying these laws can lead to equations that are too complex for humans to solve, even with traditional numerical methods. One emerging approach is to incorporate scientific knowledge into AI models by including information about fundamental equations, such as the laws of physics or principles of molecular structure and protein folding that bind them. This inductive bias can enhance AI models by reducing the number of training examples required to achieve the same level of accuracy and extending the analysis to a large, unexplored scientific hypothesis space.

Using AI for scientific innovation and discovery presents unique challenges compared to other areas utilizing AI. One of the biggest challenges is the huge hypothesis space in scientific questions, which makes systematic exploration unfeasible. In biochemistry, for example, there are an estimated 1,060 drug-like molecules to explore. Artificial intelligence systems have the potential to revolutionize scientific workflows by speeding up processes and providing predictions that approach experimental accuracy. However, there are challenges in obtaining reliable annotated datasets for AI models, which can involve time-consuming and resource-intensive experiments and simulations. Despite these challenges, AI systems enable efficient, intelligent, and highly autonomous experimental design and data collection, where AI systems can operate under human supervision to assess, evaluate, and act on results. This capability facilitates the development of artificial intelligence agents that constantly interact in dynamic environments and can, for example, make real-time decisions to navigate a stratospheric balloon. Artificial intelligence systems can play a valuable role in interpreting scientific data sets and extracting relationships and knowledge from scientific literature. Recent research shows that unsupervised linguistic AI models have the potential to capture complex scientific concepts, such as the periodic table of elements, and predict their applications years before functional materials are discovered, suggesting that potential knowledge about future discoveries may be embedded in the past in publications.

Recent research advances, including the successful solving of a 50-year-old protein folding problem and AI-driven simulations of molecular systems containing millions of particles, demonstrate the potential of AI in solving challenging scientific problems . However, this remarkable promise of discovery also comes with significant challenges in the emerging field of AI4science. As with any new technology, the success of AI for science depends on the paper's ability to integrate it into regular practice and understand its potential and limitations. Barriers to widespread adoption of artificial intelligence in scientific discovery include internal and external factors specific to each stage of the discovery process, as well as concerns about the practicality and potential misuse of methods, theory, software and hardware. Papers explore the development of AI science and address key issues, including the conduct of science, traditional skepticism, and implementation challenges.

3. AI-aided data collection and curation for scientific research

The increasing size and complexity of data sets collected by experimental platforms has led to an increasing reliance on real-time processing and high-performance computing in scientific research to selectively store and analyze data generated at high rates.

Data selection

A typical particle collision experiment generates more than 100 terabytes of data per second. This scientific experiment is pushing the limits of existing data transmission and storage technologies. In these physics experiments, more than 99.99% of the raw instrument data represents background events that must be detected and discarded in real time to manage data rates. To identify low-probability events for future scientific research, deep learning methods replace pre-programmed hardware event triggers with algorithms that search for anomalous signals to detect unforeseen or rare phenomena that may have been missed during the compression process. The background process can be modeled generically using a deep autoencoder (Box 1). Autoencoders return higher loss values ​​(anomaly scores) for unseen signals (small probability events) that previously broke out of the background distribution. Unlike supervised anomaly detection, unsupervised anomaly detection does not require annotations and has been widely used in physics, neuroscience, earth science, oceanography, and astronomy.

Data annotation

Training supervised models requires annotated labeled datasets that provide supervisory information to guide model training and estimate a function or conditional distribution on the target variable from the input. Pseudo-labeling and label propagation are attractive alternatives to laborious data labeling, allowing automatic annotation of large unlabeled datasets based on only a small set of precise annotations. In biology, due to the difficulty of experimentally generating labels, techniques to assign functional and structural labels to newly characterized molecules are critical for the downstream training of supervised models. For example, despite the proliferation of next-generation sequencing technologies, less than 1% of sequenced proteins are annotated with biological functions. Another strategy for data labeling is to utilize surrogate models trained on human-labeled data to label unlabeled samples and use these predicted pseudo-labels to supervise downstream predictive models. In contrast, label propagation spreads labels over unlabeled samples through a similarity map built based on feature embeddings (Box 1). In addition to automatic labeling, active learning (Box 1) can identify the most informative data points to be labeled by humans or the most informative experiments to be conducted. This approach allows the model to be trained with fewer expert-provided labels. Another strategy in data annotation is to develop labeling rules that leverage domain knowledge.

Data generation

Deep learning performance improves with the quality, diversity, and size of the training data set. A fruitful way to create better models is to enhance the training dataset by generating additional synthetic data points through automatic data augmentation and deep generative models. In addition to manually designing such data augmentation (Box 1), reinforcement learning methods can also discover an automatic data augmentation strategy that is flexible and agnostic to downstream models. Deep generative models, including variational autoencoders, generative adversarial networks, normalized flow, and diffusion models, learn the underlying data distribution and can sample training points from the optimized distribution. Generative adversarial networks (Box 1) have proven to be beneficial for scientific imaging because of their ability to synthesize realistic images from multiple domains such as particle collision events, pathology slides, chest X-rays, magnetic resonance contrast, and three-dimensional (3D) material microstructures. , protein function, gene sequence. An emerging technology in generative patterns.

Data refinements

Precision instruments, such as ultra-high-resolution lasers and non-invasive microscopy systems, can measure physical quantities directly or indirectly by counting real-world objects, producing highly precise results. Artificial intelligence technology significantly improves measurement resolution, reduces noise, and eliminates errors in measuring roundness, resulting in high accuracy consistency across sites. Examples of AI applications in scientific experiments include visualizing regions of space and time such as black holes, capturing physical particle collisions, improving the resolution of images of living cells, and better detecting cell types across biological environments. Deep convolutional methods take advantage of algorithmic advances such as spectral deconvolution, flexible sparsity, and generative capabilities to convert poor spatiotemporal resolution measurements into high-quality, super-resolution, and structured images. Across various scientific disciplines, an important artificial intelligence task is denoising, which involves distinguishing relevant signals from noise and learning to remove the noise. Denoising autoencoders can project high-dimensional input data into more compact base feature representations. These autoencoders minimize the difference between uncorrupted input data points and their reconstruction from their noise-corrupted versions of the compressed representation. Other forms of distributed learning autoencoders, such as variational autoencoders (VAEs; Box 1), are also frequently used. VAEs learn a stochastic representation through latent autoencoding that preserves fundamental data characteristics while ignoring non-fundamental sources of variation that may represent random noise. For example, in single-cell genomics, optimizing count-based gene activation vectors is often used to improve protein-RNA expression analysis.

4. Learning meaningful representations of scientific data

Deep learning can extract meaningful representations of scientific data at different levels of abstraction and optimize them to guide research, often through end-to-end learning (Box 1). A high-quality representation should preserve as much information about the data as possible while keeping the information simple and accessible. Scientifically meaningful representations are compact, differentiated, unravel underlying factors of variation, and encode underlying mechanisms that generalize across many tasks. Here, the paper introduces three emerging strategies to meet these requirements: geometric priors, self-supervised learning, and language modeling.

Geometric priors

Integrating geometric priors in learned representations has proven effective since geometry and structure play a central role in scientific fields. Symmetry is a widely studied concept in geometry. It can use invariance and equal variance (Box 1) to describe the behavior of a mathematical function, such as a neural feature encoder, under a set of transformations, such as the SE (3) set in rigid body dynamics. Important structural properties, such as the secondary structure content of the molecular system, accessibility to solvents, compactness of residues, and hydrogen bonding patterns, are invariant spatial orientations. In the analysis of scientific images, objects do not change when they are translated in the image, which means that the image segmentation masks are equivalent in translation because when the input pixels are translated, their changes are equivalent. Incorporating symmetry into models can benefit artificial intelligence using limited labeled data sets, such as three-dimensional RNA and protein structures, and can improve extrapolated predictions to inputs that are significantly different from those encountered in model training.

Geometric deep learning

Graph neural networks have become a leading method for deep learning on data sets with underlying geometric and relational structures (Figure 2a). In a broader sense, geometric deep learning involves discovering relational patterns and equipping neural network models with inductive biases, explicitly exploiting encoded local information through neural information transfer algorithms. Depending on the scientific problem, various diagram representations have been developed to capture complex systems. Directional edges can facilitate physical modeling of glass systems, hypergraph edges connect multiple nodes for chromatin structure understanding, model training multimodal models are used to create predictive models in genomics, and sparse, irregular and highly relational graphs are used Applied to many Large Strong Particle Collider physics tasks, including reconstruction of particle detector readings and background discrimination processes of physical signals.

Self-supervised learning

Supervised learning may not be sufficient when only a few labeled samples are available for model training, or when labeled data for a specific task is very expensive. In this case, leveraging labeled and unlabeled data can improve model performance and learning capabilities. Self-supervised learning is a technique that enables a model to learn general characteristics of a data set without relying on explicit labels. Effective self-supervised strategies include predicting occluded areas of an image, predicting past or future frames in a video, and using contrastive learning to teach the model to distinguish between similar and dissimilar data points (Figure 2b). Self-supervised learning can be a critical pre-processing step to learn transferable features in large unlabeled datasets and then fine-tune the model on small labeled datasets to perform downstream tasks. This pre-trained model with a broad understanding of the scientific domain is a universal predictor that can be adapted to a variety of tasks, thereby improving labeling efficiency beyond purely supervised methods.

Language modeling

Masked language modeling is a popular self-supervised learning method for natural languages ​​and biological sequences (Figure 2c). Atoms or amino acids (annotations) are arranged into structures to produce the functions of molecules and organisms, similar to how letters are formed into words and sentences that define the meaning of a document. As natural language and biological sequence processing continue to evolve, they have influenced each other's development. During training, the goal is to predict the next token in the sequence, while in mask-based training, the self-supervised task is to recover the masked token in the sequence using bidirectional sequence context. Protein language models can encode amino acid sequences to capture structural and functional properties and assess the evolutionary fitness of viral variants. This representation is transferable across a variety of tasks, from sequence design to structure prediction. Chemical language models help efficiently explore huge chemical spaces when dealing with biochemical sequences. They have been used to predict properties, plan multi-step syntheses and explore the space of chemical reactions.

Transformer architectures

Transformer (Box 1) is a neural architecture model that can handle token sequences by flexibly modeling the interactions between arbitrary pairs of tokens, surpassing earlier efforts to model sequences using recurrent neural networks. Transformers dominates natural language processing and has been successfully applied to a range of problems, including seismic signal detection, DNA and protein sequence modeling, modeling the effects of sequence variation on biological function, and symbolic regression. Although Transformer unifies graph neural networks and language models, Transformer's runtime and memory footprint can scale quadratically with the length of the sequence, which leads to efficiency challenges addressed by long-term modeling and linearized attention mechanisms. Therefore, unsupervised or self-supervised generative pre-trained Transformers, with effective fine-tuning of parameters, are widely used.

Neural operators

Standard neural network models may not be suitable for scientific applications because they assume a fixed data discretization. This approach is not suitable for many scientific datasets collected at different resolutions and grids. In addition, data are often sampled from underlying physical phenomena in continuous areas, such as seismic activity or fluid flow. Neural operators learn representations that are invariant to discretizations by learning mappings between function spaces. Neural operators are guaranteed to be discretization invariant, meaning they can handle any discretization of the input and converge to the limit of mesh refinement. Once neural operators are trained, they can be evaluated at any resolution without needing to be retrained. In contrast, the performance of standard neural networks can degrade when the data resolution during deployment changes from model training.

AI-based generation of scientific hypotheses

Testable hypotheses are at the heart of scientific discovery. They can take many forms, from symbolic expressions in mathematics to molecules in chemistry and genetic variations in biology. Formulating meaningful hypotheses can be a laborious process, as was the case with Johannes Kepler, who spent four years analyzing data on stars and planets before arriving at a hypothesis that led to the discovery of the laws of planetary motion. AI methods can be helpful at several stages of the process. They can generate hypotheses by identifying candidate symbolic expressions from noisy observations. They can help design objects, such as molecules that bind to therapeutic targets or counterexamples that contradict mathematical conjectures,9 and are recommended for experimental evaluation in the laboratory. Furthermore, AI systems can learn the Bayesian posterior distribution of hypotheses (Box 1) and use it to generate hypotheses that are compatible with scientific data and knowledge.

Black-box predictors of scientific hypotheses

Identifying promising hypotheses for scientific research requires efficiently examining many candidate hypotheses and selecting those that maximize the yield of downstream simulations and experiments. In the drug discovery process, high-throughput screening can evaluate thousands to millions of molecules, and algorithms can prioritize which molecules to study experimentally. Models can be trained to predict the utility of an experiment, such as relevant molecular properties or symbolic formulas that fit observations. However, real data on the experimental basis of these predictors may be unavailable for many molecules. Therefore, weakly supervised learning methods (Box 1) can be used to train these models, where noisy, limited, or imprecise supervision is used as the training signal. These serve as cost-effective proxies for human expert annotation, which is expensive in in silico computing or higher fidelity experiments (Fig. 3a).

Artificial intelligence methods trained with high-fidelity simulations have been used to effectively screen large molecular libraries, such as organic light-emitting diode candidate materials and 11 billion synthesis-based candidate ligands. In genomics, Transformer structures trained to predict gene expression values ​​from DNA sequences can help prioritize genetic variants. In particle physics, identifying intrinsic charm quarks in protons involves screening all possible structures and fitting experimental data for each candidate structure. To further improve the efficiency of these processes, AI-selected candidates can be sent to mid- or low-throughput experiments, using experimental feedback to continuously refine the candidates. These results can be fed back into AI models using active learning and Bayesian optimization (Box 1), allowing the algorithms to improve their predictions and focus on the most promising candidates.

Artificial intelligence methods have become invaluable when hypotheses involve complex objects such as molecules. For example, in protein folding, AlphaFold can predict the three-dimensional atomic coordinates of proteins in an amino acid sequence from atomic accuracy, even for proteins whose structure is different from any protein in the training data set. This breakthrough led to the development of various AI-driven protein folding methods, such as RoseTTAFold. In addition to forward problems, AI methods are increasingly used for inverse problems aimed at understanding the causal factors that produce a set of observations. Inverse problems, such as inverse folding or fixed backbone design, can predict amino acid sequences from the backbone three-dimensional atomic coordinates of proteins using black-box predictors trained on millions of protein structures. However, such black-box AI predictors require large training data sets and, despite reducing reliance on prior scientific knowledge, provide limited interpretability.

Navigating combinatorial hypothesis spaces

Although sampling all hypotheses that are compatible with the data is daunting, a manageable goal is to find a good hypothesis that can be formulated as an optimization problem. Unlike relying on human-designed rules, AI strategies can be used to estimate the feedback for each search and prioritize search directions with higher values. The policy is typically learned using an agent trained by a reinforcement learning algorithm. The agent learns to take actions in the search space that maximize feedback signals, which can be defined as reflecting the quality of the generated hypotheses or other relevant criteria.

To solve the optimization problem, the symbolic regression task can be solved using an evolutionary algorithm, which generates random symbolic rules as an initial solution set. In each generation, the candidate solutions change slightly. The algorithm checks whether any modifications produce a sign law that better fits the observations than the previous solution, and retains the best solution for the next generation. However, reinforcement learning methods are increasingly replacing this standard strategy. Reinforcement learning uses neural networks to sequentially generate a mathematical expression by adding mathematical symbols from a predefined vocabulary, and uses a learned policy to decide which symbol to add next. Mathematical formulas are represented as parse trees. The learned policy takes a parse tree as input to determine what leaf nodes to expand and what symbols (from the vocabulary) to add (Figure 3b). Another way to use neural networks to solve mathematical problems is to convert a mathematical formula into a binary sequence of symbols. The neural network policy can then probabilistically and sequentially grow the sequence one binary character at a time. By designing a feedback that measures the ability to refute a conjecture, this method can find refutations of mathematical conjectures without prior knowledge about the mathematical problem.

Combinatorial optimization is also applicable to tasks such as discovering molecules with desirable drug properties, where each step in molecule design is a discrete decision-making process. In this process, a partially generated molecular graph is used as input to a learning strategy that makes discrete choices about where to add a new atom and which atom to add at a selected position in the molecule. By performing this process iteratively, the strategy can generate a range of possible molecular structures that are evaluated based on their suitability for target properties. The search space is too large to explore all possible combinations, but reinforcement learning can effectively guide the search by prioritizing the branches most worthy of study. Reinforcement learning methods can be trained with a training objective that encourages the resulting policy to sample from all reasonable solutions (with high feedback) rather than focusing on a single good solution, as in standard feedback maximization in reinforcement learning. These reinforcement learning methods have been successfully applied to a variety of optimization problems, including maximizing protein expression, planning hydropower to reduce adverse impacts on the Amazon Basin, and exploring the parameter space of particle accelerators.

Policies learned by AI agencies are visionary actions that initially seem unconventional but prove to be effective. In mathematics, for example, supervised models can identify patterns and relationships between mathematical objects and help guide intuition and generate conjectures. These analyzes point to previously unknown patterns and even new world patterns. However, during model training, reinforcement learning methods may not generalize well to unseen data because the agent may fall into a local optimum once a sequence of operations that works well is found. To improve generalization, some exploration strategies need to be employed to collect a wider range of search trajectories to help the agent perform better in new and modified settings.

Optimizing differentiable hypothesis spaces

Scientific hypotheses often take the form of discrete objects, such as symbolic formulas in physics or chemical compounds in pharmaceuticals and materials science. While combinatorial optimization techniques have been successful in these problems, differentiable spaces can also be used for optimization because it lends itself to gradient-based methods that can efficiently find local optimizations. To be able to use gradient-based optimization, two approaches are often used. The first approach is to use models such as VAEs to map discrete candidate hypotheses to points in a potentially differentiable space. The second approach is to relax the discrete assumptions into differentiable objects that can be optimized in a differentiable space. This relaxation can take different forms, such as replacing discrete variables with continuous variables, or using soft versions of the original constraints.

Applications of symbolic regression in physics use syntactic VAEs. These models represent discrete symbolic expressions as parse trees using context-free grammars and map the trees to differentiable latent spaces. Bayesian optimization is then employed to optimize the latent space of symbolic rules while ensuring that the expressions are syntactically valid. In a related study, Brunton and colleagues introduced a method to distinguish symbolic rules by assigning trainable weights to predefined basis functions. A sparse regression approach is employed to select linear combinations of basis functions that accurately represent the dynamic system while maintaining compactness. Unlike equivariant neural networks, which use predefined inductive biases to enhance symmetries, symmetries can be discovered as characteristic behavior of a domain. For example, Liu and Tegmark describe asymmetry as a smooth loss function and minimize the loss function to extract previously unknown symmetries. This method was applied to reveal hidden symmetries in black hole waveform data sets, revealing unexpected space-time structures that have historically been challenging.

In astrophysics, VAEs have been used to estimate gravitational wave detector parameters based on pre-trained black hole waveform models. This method is six orders of magnitude faster than conventional methods and therefore can capture transient gravitational wave events. In materials science, thermodynamic rules are combined with an autoencoder to design an interpretable latent space for identifying phase maps of crystal structures. In chemistry, models such as the Simplified Molecular Input Line Entry System (SMILES)-VAE can convert SMILES strings, the molecular symbols of chemical structures in the form of discrete symbols that computers can easily understand, into a differentiable latent space that can be optimized Bayesian optimization techniques were used (Fig. 3c). By representing molecular structures as points in latent space, we can design differentiable targets and use self-supervised learning to predict the properties of molecules based on their latent representation. This means that papers can optimize discrete molecular structures by backpropagating gradients from AI predictors, all the way up to continuous-valued representations of molecular inputs. Decoders can convert these molecular representations into approximately corresponding discrete inputs. This approach is used in the design of proteins and small molecules.

Optimization in the latent space allows for more flexible modeling of the underlying data distribution than mechanistic approaches in the original hypothesis space. However, extrapolated predictions in sparsely explored regions of the hypothesis space can be poor. In many scientific disciplines, the hypothesis space can be much larger than what can be tested experimentally. For example, it is estimated that there are approximately 1060 molecules, while even the largest chemical libraries contain fewer than 1010 molecules. Therefore, there is an urgent need for a method to efficiently search and identify high-quality candidate solutions in these largely unexplored regions.

5. AI-driven experimentation and simulation

Evaluating scientific hypotheses through experiments is critical to scientific discovery. However, laboratory experiments can be expensive and impractical. Computer simulations have emerged as a promising alternative, offering the potential for more efficient and flexible experiments. While simulations rely on handcrafted parameters and heuristics to simulate real-world scenarios, they require a trade-off between accuracy and speed compared to physical experiments, which requires an understanding of the underlying mechanisms. However, with the advent of deep learning, these challenges are being addressed by identifying and optimizing hypotheses to effectively test them, and empowering computer simulations to connect observations to hypotheses.

Efficient evaluation of scientific hypotheses

Artificial intelligence systems provide experimental design and optimization tools that can enhance traditional scientific methods, reduce the number of experiments required, and save resources. Specifically, AI systems can help with two basic steps of experimental testing: planning and guidance. In traditional methods, these steps often require trial and error, which can be inefficient and sometimes even life-threatening. Artificial intelligence planning provides a systematic method for designing experiments, optimizing experimental efficiency, and exploring unknown areas. At the same time, AI pivots to guide experimental processes toward productive hypotheses, allowing systems to learn from previous observations and adjust experimental processes. These AI methods can be model-based, using simulations and prior knowledge, or they can be based on machine learning algorithms alone.

AI systems can help plan experiments by optimizing the use of resources and reducing unnecessary investigations. Unlike hypothesis searching, experiment planning involves the procedures and steps involved in the design of scientific experiments. One example is synthesis planning in chemistry. Synthetic planning involves finding a sequence of steps by which a target compound can be synthesized from existing chemicals. AI systems can design synthetic routes that produce the desired compounds, reducing the need for human intervention. Active learning is also used for materials discovery and synthesis. Active learning involves iterative interaction with and learning from experimental feedback to refine hypotheses. Materials synthesis is a complex and resource-intensive process that requires efficient exploration of high-dimensional parameter spaces. Active learning uses uncertainty estimates to explore parameter space and reduce uncertainty in as few steps as possible.

In an ongoing experiment, decisions must often be adjusted in real time. However, when driven solely by human experience and intuition, this process can be difficult and error-prone. Reinforcement learning offers an alternative approach to continuously react to evolving environments and maximize the safety and success of experiments. For example, reinforcement learning methods have been shown to be effective for the magnetic control of tokamak plasmas, where the algorithm interacts with a tokamak simulator to optimize the strategy for controlling the process (Fig. 4a). In another study, reinforcement learning agents used real-time feedback such as wind speed and sun altitude to control stratospheric balloons and find favorable wind speeds for navigation. In quantum physics, experimental designs need to be adjusted dynamically because the optimal choice for future materializations of complex experiments may be counterintuitive. Reinforcement learning methods can overcome this problem by iteratively designing experiments and receiving feedback from the experiments. For example, reinforcement learning algorithms have been used to optimize the measurement and control of quantum systems, where they improve experimental efficiency and accuracy.

Deducing observables from hypotheses using simulations

Computer simulations are a powerful tool for inferring observables from hypotheses, making it possible to evaluate hypotheses that cannot be directly tested. However, existing simulation techniques rely heavily on a largely human understanding and knowledge of the underlying mechanisms of the system under study, which can be sub-optimal and inefficient. Artificial intelligence systems can enhance computer simulations with more accurate and efficient learning by better fitting the key parameters of complex systems, solving the differential equations that control complex systems, and modeling states of complex systems.

Scientists often study complex systems by creating a model involving a parametric form, which requires domain knowledge to identify initial symbolic expressions of the parameters. An example is molecular force fields, which are interpretable but have limited ability to represent a wide range of functions and require strong inductive bias or scientific knowledge to generate. To improve the accuracy of molecular simulations, an artificial intelligence-based neural potential has been developed that fits expensive but precise quantum mechanical data to replace traditional force fields. In addition, the uncertainty quantification method is used to locate the energy barrier of the high-dimensional free energy surface, thereby improving the efficiency of molecular dynamics (Figure 4b). For coarse-grained molecular dynamics, artificial intelligence models have been used to reduce the computational cost of large systems by determining the extent to which the system needs to learn from hidden complex structures. In quantum physics, neural networks have replaced manually estimated symbolic forms in parameterizing wave functions or density functions due to their flexibility and ability to accurately fit data.

Differential equations are key to modeling the dynamics of complex systems in space and time. AI-based neural solvers integrate data and physics more seamlessly than numerical algebra solvers. These neural solvers combine physics with the flexibility of deep learning by rooting neural networks in domain knowledge (Figure 4c). Artificial intelligence methods have been applied to solve differential equations in different fields, including computational fluid dynamics, predicting the structure of glass systems, solving rigid chemical dynamics problems and solving the Eikonal equation to characterize the travel time of seismic waves. In dynamic modeling, continuous time can be modeled by divine regular differential equations. Neural networks can exploit the loss of physical information to parameterize solutions to the Navier-Stokes equations. However, standard convolutional neural networks have limited ability to model the fine structural features of the solution. This problem can be solved by learning to model the mapping between functions using neural networks. Furthermore, the solver must be able to adapt to different domains and boundary conditions. This can be achieved by combining neural differential equations and graph neural networks to partition arbitrary graphs.

Statistical modeling is a powerful tool that provides a complete quantitative description of complex systems by modeling the distribution of states in them. Due to its ability to capture highly complex distributions, deep generative modeling has recently become a valuable approach in the simulation of complex systems. A well-known example is the Boltzmann generator based on normalized flows (Box 1). Normalization flow can map any complex distribution to a prior distribution (e.g., a simple Gaussian distribution) and back using a series of reversible neural networks. Although computationally expensive (typically requiring hundreds or thousands of neural layers), the normalized flow provides an accurate density function, which makes sampling and training possible. Unlike traditional simulations, normalized flow can produce equilibrium states by directly sampling from a prior distribution and applying a neural network, which has a fixed computational cost. This enhances sampling in lattice fields and gauge theories and improves Markov chain Monte Carlo methods that might otherwise not converge due to mode mixing.

6. Major challenges

To leverage scientific data, models with simulation and human expertise must be built and used. This integration opens up opportunities for scientific discovery. However, to further increase the impact of AI across scientific disciplines, significant advances in theory, methods, software, and hardware infrastructure are needed. Interdisciplinary collaboration is critical to achieving a comprehensive and practical approach to advancing science through artificial intelligence.

Practical considerations

Scientific datasets are often not directly amenable to AI analysis because limitations in measurement techniques can produce incomplete datasets and biased or conflicting reads, as well as limited accessibility to AI analysis due to privacy and security concerns. . Standardized and transparent formats are needed to ease the workload of data processing. Model cards and data sheets are examples of recording the working characteristics of scientific data sets and models. Additionally, federated learning and cryptography algorithms can be used to prevent the release of sensitive data with high commercial value to the public domain. Leveraging open scientific literature, natural language processing, and knowledge graph technologies can facilitate literature mining to support materials discovery, chemical synthesis, and therapeutic science.

The use of deep learning poses a complex challenge to AI-driven design, discovery, and evaluation. To automate scientific workflows, optimize large-scale simulation codes and operate instrumentation, autonomous robotic controls can leverage predictions and conduct experiments on high-throughput synthesis and test lines to create self-driving laboratories. Early applications of generative models in materials exploration have shown that millions of possible materials can be identified as possessing desired properties and functionality and assessed for synthesizability. For example, King et al. combined logical artificial intelligence and robotics to automatically generate functional genomics hypotheses about yeast and experimentally tested these hypotheses using laboratory automation. In chemical synthesis, artificial intelligence optimizes candidate synthesis routes, and then robots guide chemical reactions in the predicted synthesis routes.

The actual implementation of AI systems involves complex software and hardware engineering, requiring a series of interdependent steps, from data management and processing to algorithm implementation and design of user and application interfaces. Small changes in implementation can lead to large changes in performance and impact the success of integrating AI models into scientific practice. Therefore, both data standardization and model standardization need to be considered. Reproducibility and task dependence of artificial intelligence methods due to the stochastic nature of model training, different model parameters and changing training data sets. Standardized benchmarking and experimental designs can mitigate these issues. Another direction to improve reproducibility is through open source initiatives that release open models, datasets, and educational projects.

Algorithmic innovations

In order to promote or autonomously obtain scientific understanding, algorithmic innovation needs to establish an underlying ecosystem that uses the most appropriate algorithms throughout the scientific process.

The problem of out-of-distribution generalization is at the forefront of artificial intelligence research. A neural network trained on data from a specific region may discover regularities that do not generalize across different regions where the underlying distribution changes (Box 1). Although many scientific laws are not universal, their applicability is often broad. The human brain can generalize to modified settings better and faster than state-of-the-art artificial intelligence. An appealing hypothesis is that this is because humans build not only a statistical model of what they observe, but also a causal model, that is, a model consisting of all possible interventions (e.g., different initial states, agents' Behavior or different institutions) indexed family of statistical models. Incorporating causality into artificial intelligence is still an emerging field, and there is still much work to be done. Techniques like self-supervised learning have great potential for scientific problems because they can exploit large amounts of unlabeled data and transfer their knowledge to low-data mechanisms. However, current transfer learning schemes may be ad hoc, lack theoretical guidance, and be susceptible to changes in the underlying distribution. While initial attempts have addressed this challenge, more exploration is needed to systematically measure transferability across domains and prevent negative transfer. Additionally, to address the difficulties that scientists are concerned about, AI methods must be developed and evaluated in real-world scenarios, such as rationally realized synthetic pathways for drug design, including calibration uncertainty estimators to assess model reliability before transitioning to real-world implementation. .

Scientific data are multi-modal, including images (such as black hole images in cosmology), natural language (such as scientific documents), time series (such as thermal yellowing of materials), sequences (such as biological sequences), graphics (such as complex systems) and structures (e.g. 3D protein ligand conformations). For example, in high-energy physics, jets are collimated jets of particles produced by high-energy quarks and gluons. Identifying their substructure from radiation patterns can aid in the search for new physics. The substructure of a jet can be described in terms of images, sequences, binary trees, general graphs, and sets of tensors. Although the use of neural networks to process images has been extensively studied, processing particle images alone is not sufficient. Similarly, other representations of jet substructures used in isolation cannot give a holistic and integrated system view of complex systems. While integrating multimodal observations remains a challenge, the modular nature of neural networks means that different neural modules can convert different data modalities into a common vector representation.

Scientific knowledge, such as rotational equivariances in molecules, equality constraints in mathematics, disease mechanisms in biology and multi-scale structures in complex systems, can all be incorporated into AI models. However, it remains unclear which principles and knowledge are most helpful and practical. Since AI models require large amounts of data to fit, incorporating scientific knowledge into the model can aid learning when data sets are small or sparsely annotated. Therefore, research must establish principled methods for integrating knowledge into AI models and understand the trade-off between domain knowledge and learning from measured data.

AI methods often operate as black boxes, meaning users cannot fully explain how the output is produced and which inputs are critical to producing the output. Black-box models reduce user trust in predictions and have limited applicability in areas where model outputs must be understood before real-world implementation, such as human space exploration, and the impact of predictions on policy, such as climate science. Despite a plethora of interpretability techniques, transparent deep learning models remain elusive. However, the fact that the human brain can synthesize high-level explanations that, even if imperfectly, can convince others, offers hope that, by modeling phenomena at a similarly high level of abstraction, future AI models will provide explainable, at least as much The explanations provided by the human brain are just as valuable. It also suggests that studying higher-level cognition may inspire future deep learning models that combine current deep learning capabilities with the ability to manipulate expressible abstractions, cause reasoning, and generalize distributions.

Conduct of science and scientific enterprise

Going forward, the demand for AI expertise will be influenced by two forces. First, there are existing problems that will benefit from AI applications, such as autonomous driving laboratories. Second, the ability of smart tools to advance the state of the art and create new opportunities—such as examining biological, chemical, or physical processes occurring on length and time scales not accessible experimentally. Building on these two forces, the paper anticipates changes in the composition of research teams to include artificial intelligence experts, software and hardware engineers, and new forms of collaboration involving all levels of government, educational institutions, and companies. Recent state-of-the-art deep learning models continue to grow. These models are composed of millions or even billions of parameters, and their size has increased 10 times year-on-year. Training these models involves passing data through complex parametric mathematical operations and updating parameters to drive the model output to the desired value. However, the computational and data requirements for computing these updates are enormous, resulting in huge energy consumption and high computational costs. As a result, big tech companies have invested heavily in computing infrastructure and cloud services, pushing the limits of scale and efficiency. While for-profit and non-academic organizations have access to extensive computing infrastructure, higher education institutions can better integrate across multiple disciplines. Additionally, academic institutions possess unique historical databases and measurement techniques that may not exist elsewhere but are necessary for AI4science. These complementary assets facilitate new models of industry-academia collaboration, which may influence the choice of research questions explored.

As artificial intelligence systems can match and exceed human performance, their use as a replacement for routine laboratory work is becoming feasible. This approach enables researchers to iteratively develop predictive models from experimental data and select experiments to improve them without manually performing laborious and repetitive tasks. To support this paradigm shift, educational programs are emerging to train scientists to design, implement, and apply laboratory automation and artificial intelligence in scientific research. These projects help scientists understand when the use of AI is appropriate and prevent misleading conclusions from AI analyses.

The misuse of AI tools and misinterpretation of their results can have significant negative consequences. Widespread use exacerbates these risks. However, the misuse of AI is not just a technical issue; it also depends on the motivations of those who dominate AI innovation and investment in AI implementation. It is critical to establish ethical review procedures and responsible implementation strategies, including a comprehensive overview of the scope and applicability of AI. Additionally, security risks associated with AI must be considered, as it has become easier to repurpose dual-use algorithm implementations. Because algorithms are suitable for a wide range of applications, they can be developed for one purpose but also used for another, creating vulnerabilities to threats and operations.

7. Summary

AI systems can contribute to scientific understanding, with the ability to study processes and objects that cannot be visualized or explored in any other way, and to systematically stimulate ideas by building models from data and combining them with simulations and scalable computing. To realize this potential, the safety and security issues posed by AI must be addressed through responsible and thoughtful deployment of the technology. To use AI responsibly in scientific research, papers need to measure the levels of uncertainty, error, and utility of AI systems. This understanding is critical to accurately interpreting AI output and ensuring that there is not too much reliance on potentially flawed results. As AI systems continue to evolve, prioritizing reliable implementations with appropriate safeguards is key to minimizing risks and maximizing benefits. Artificial intelligence has the potential to unlock scientific discoveries that were previously out of reach.

Guess you like

Origin blog.csdn.net/INTSIG/article/details/133789493