[Intensive reading of papers] CSET - Big AI Potential of Small Data

[Intensive reading of papers] CSET - Big AI Potential of Small Data

【Original text】 : Small Data's Big AI Potential

Author information】: Husanjot Chahal, Helen Toner, Ilya Rahkovsky

Husanjot Chahal is a research analyst at CSET, Helen Toner is a strategy director and Ilya Rahkovsky is a data scientist.

获取地址:https://cset.georgetown.edu/publication/small-datas-big-ai-potential/

Blogger keywords: small data, application analysis

Recommended related papers:

- 无

overview:

This problem brief provides an introduction and overview of "small data" AI approaches, that is, methods that help address situations where little or no labeled data is available, and reduce our reliance on massive datasets collected from the real world. According to the traditional understanding of artificial intelligence, data is an important strategic resource, and any meaningful progress in cutting-edge artificial intelligence technology will require a large amount of data. This overemphasis on "big data" ignores the existence and obscures the potential of the methods we describe in this paper, which do not require large datasets for training.

We conduct the analysis in two parts. Part I introduces and categorizes the major small data approaches , which we broadly group into five categories—transfer learning, data labeling, artificial data, Bayesian methods, and reinforcement learning—and list why they are important. In doing so, we aim not only to point out the potential benefits of using small data methods, but also to deepen the non-technical reader's understanding of when and how data can be useful to AI. The second part draws on the original CSET dataset, presents some exploratory findings , assesses the current and projected progress of small data methods in scientific research, provides an overview of which countries are leading the way, and the main sources of funding for this research. Based on our findings, we concluded the following four key takeaways:

a) AI is not synonymous with big data, there are several alternatives that can be used in different small data settings.

b) Research on transfer learning is growing rapidly (even faster than the larger and better-known field of reinforcement learning), making it possible that this approach will work better in the future than it does today, and be more widely used.

c) The U.S. and China compete fiercely in small data methods, with the U.S. leading in the two largest categories of reinforcement learning and Bayesian methods, while China has a smaller but growing category in transfer learning, the fastest growing category leading edge.

d) For the time being, transfer learning may be a promising target for more US government funding due to its small share of investment in small data methods relative to the investment pattern in the overall AI field.

Introduction:

Conventional wisdom holds that cutting-edge AI relies on vast amounts of data. According to this AI concept, data is an important strategic resource, and how much data a country (or company) has access to is seen as a key indicator of AI progress. This understanding of the role of data in AI is not entirely inaccurate—many current AI systems do use a lot of data. But policymakers are going astray if they think this is an eternal truth for all AI systems . Overemphasizing data ignores the existence and underestimates the potential of several AI approaches that do not require large labeled datasets or data collected from real-world interactions. In this paper, we refer to these methods as "small data" methods.

What we mean by "small data" is not a clear category, so there is no single, formal, consistent definition. Academic articles discuss small data in relation to the application domain under consideration, often relating it to the size of the sample, such as kilobytes or megabytes versus terabytes of data Popular media articles attempt to describe small data in relation to various factors, Many references to data often end up treating it as a generic resource, as in its availability and human comprehension, or as the amount and format of data that makes it accessible, informative, and actionable, especially for business decisions. However, data is irreplaceable, and AI systems in different domains require different types of data and different types of methods, depending on the problem at hand

This study describes small data from a policymaker's perspective. Government actors are often considered potentially powerful players in the field of AI because of their access to the nature of real-world interactions, and their ability to collect vast amounts of data—e.g., climate monitoring data, geological surveys, border control, social security, voter Registration, vehicle and driver records and more. Most national comparisons of AI competitiveness agree that China has a unique advantage because it has access to more data, citing its large population, strong data collection capabilities, and lack of privacy protection. Part of our motivation for writing this paper is Illustrates a range of techniques that make this situation less real than is commonly assumed.

Finally, it is sometimes argued that government agencies will only benefit from the AI ​​revolution if they are able to digitize, clean, and label vast amounts of data. While this suggestion makes sense, it would be inaccurate to suggest that all advances in AI depend on these conditions. This belief belies the view that the future of AI may not only be about big data, but that without massive investments in big data infrastructure, AI innovation in government (and beyond) can happen .

In the articles that follow, our goal is not only to point out the potential benefits of using a small data approach, but also to deepen the non-technical reader's understanding of when and how data is useful. This introduction can be considered a primer on small data methods or methods that can minimize reliance on "big data". This analysis is divided into two parts. The first part explains technically what "small data" methods are, which categories form part of these methods, and why they are important. It provides the conceptual basis for the analysis of the data plotted in Section II. The second part comes from the original CSET dataset, especially our merged academic literature corpus, which covers more than 90% of the world's academic output, to demonstrate our support for small data on the three pillars of research progress, national competitiveness and funding. method discovery. Through these methods we attempt to examine current and projected scientific research progress and identify which country is leading the way, as well as the main source of funding for the research under study. Based on our findings, we summarize four key takeaways.

insert image description here

"Small data" methods are categorized as:

The research in this paper is roughly divided into five categories of "small data" methods: a) transfer learning, b) data labeling, c) artificial data generation, d) Bayesian methods, and e) reinforcement learning. We describe these classifications in more detail below, although they are imperfect. Artificial intelligence and machine learning research incorporates a wide range of different methods, approaches, and paradigms to solve many different types of problems, making simple categorization difficult. Our goal in describing these categories below is to give the reader some rough conceptual approaches to training AI systems without large pre-labeled datasets. The taxonomies we use are not completely separable in practice, they are neither mutually exclusive nor exhaustive.

Transfer learning works by first learning how to perform a task in a data-rich environment, and then "transferring" the knowledge learned there to a task with much less data. This is useful in settings where only a small amount of labeled data is available for the problem of interest, but a large amount of labeled data is available for related problems.

For example, someone developing an app to identify rare bird species may only have a handful of photos of each bird, each tagged with its species. To use transfer learning, they can first train a basic image classifier using a larger, more general image database such as ImageNet, which has millions of images labeled according to thousands of categories. Once the classifier can tell dogs from cats, flowers from fruit, sparrows from swallows, they can feed it smaller datasets of rare birds. The model can then "transfer" what it already knows about how to classify images, using that knowledge to learn a new task (identifying rare bird species) from much less data.

Data labeling is a method that starts with limited labeled data but includes large amounts of unlabeled data. Such methods use a range of methods to make sense of available unlabeled data, such as automatically generating labels (automatic labeling) or identifying data points for which labels are particularly useful (active learning).

For example, active learning has been used in research on skin cancer diagnosis. An image classification model is initially trained on 100 photos, labeled according to whether they depict skin cancer or healthy skin. The model then has access to a larger set of potential training images from which to select 100 additional photos to label and add to its training data . To learn as much as possible from the existing data, the model was designed to select additional photos to label based on which images were the most informative in learning to distinguish photos of healthy skin from photos of skin cancer.

Artificial data generation is a method that seeks to extract maximum information from small amounts of data through the creation of new data points or other related techniques. This can range from simply making small changes to existing data (e.g. cropping or rotating images in an image classification dataset) to more sophisticated approaches aimed at reasoning about the underlying structure of the available data and extrapolating from it.

As a simple example, computer vision researchers have been able to use computer-aided design (CAD) software—a tool widely used in industries from shipbuilding to advertising—to generate realistic 3D images of everyday objects, and then use these images to enhance existing images Datasets Such an approach is more feasible when there is a single source of information about the data of interest (in this case crowdsourced CAD models). In other cases, a more sophisticated approach may be required. In general, data generation requires making one or another strong assumption about the data in question, and the usefulness of the generated data depends on how valid those assumptions are.

The ability to generate additional data is not only useful when dealing with small datasets. In some cases, details of any single piece of data may be sensitive (e.g., an individual's health records), but the overall distribution of the data is of interest to researchers, and synthetic data can be used to obscure private information by making random changes to the data , making it less identifiable.

Bayesian methods are a large class of methods in machine learning and statistics that share two characteristics. First, they try to explicitly incorporate information about the structure of the problem—the so-called "prior" information—into their approach to problem solving. This is in contrast to most other machine learning methods, which tend to be knowledgeable about the problem in question Make minimal assumptions. By incorporating this "prior" information before further improvement on existing data, Bayesian methods are better suited for certain data-limited situations, but can write information about the problem in a useful mathematical form. Second, Bayesian methods focus on calibrating the uncertainty in their predictions. This is helpful where data availability is limited, as Bayesian methods of estimating uncertainty make it easier to identify data points that, if collected, would be most valuable in reducing uncertainty.

As an example of Bayesian work using small data, Bayesian methods have been used to monitor global seismicity, which is relevant both for detecting earthquakes and for verifying nuclear treaties. By building a model that incorporates prior knowledge of seismology, the researchers can leverage existing data to improve the model.

The family of Bayesian methods is a large one, and does not consist solely of methods that are particularly good at working with small datasets. For simplicity, we err on the side of inclusiveness in this study, although this may mean that some of the studies included in this category used large datasets.

Reinforcement learning is a broad term referring to machine learning methods in which an agent (computer system) learns how to interact with its environment through trial and error. Reinforcement learning is commonly used to train gaming systems, robots and self-driving cars.

For example, reinforcement learning has been used to train AI systems that learn to play video games, from simple arcade games like Pong, to strategy games like StarCraft. In each case, the system started out knowing little (or nothing) about how to play the game, but gradually learned by trying and seeing what produced positive reward signals. (In the case of video games, the reward signal usually comes in the form of a player's score.

Reinforcement learning systems often end up learning from large amounts of data and require a lot of computing resources, so they seem like an unintuitive category. Nevertheless, we include them because they use data that is typically generated while the system is training —often in a simulated environment—rather than collected and labeled beforehand. In reinforcement learning problems, the ability of an agent to interact with the environment is crucial.

Figure 1 shows how these different regions are interconnected. Each point represents a research cluster (i.e. a group of papers) that we identified as falling into one of the above categories (see Appendices for methodological details). The thickness of the line connecting one research cluster to another indicates the strength of the citation link between the two research clusters. No line indicates no citation link. We can see that while clusters do tend to be most connected to other clusters of the same category, there are also a large number of connections between clusters of different categories. The graph also shows that the clusters we identified under "Reinforcement Learning" form a particularly coherent grouping, while the "Artificial Data" clusters are more diffuse.

insert image description here

Figure 1 Small data research cluster network diagram

Significance of the small data method:

AI methods that do not rely on large pre-collected labeled datasets have many advantages over data-intensive methods. Among other factors, these methods can:

Reduce capability differences between large and small entities

The growing value of large data sets for many AI applications has raised concerns about differences in the ability of different organizations to collect, store and process the required data. This dynamic has the potential to create a gap between AI “haves” (such as big tech companies) and “have-nots,” depending on who has the capacity to meet those needs. If methods such as transfer learning, automatic labeling, Bayesian methods, etc. can apply artificial intelligence with less data, the barriers to entry for small organizations will be lower when it comes to data, helping to reduce the gap between large and small entities. difference in ability.

Reduce the incentive to collect large amounts of personal data

Some surveys show that most Americans believe that artificial intelligence will greatly reduce the space for personal privacy. This concern stems from the view that large technology companies continue to collect more and more personally identifiable consumer data to train their AI. Artificial intelligence algorithm. Certain small data approaches have the potential to reduce this concern by reducing the need to collect real data for training machine learning models. In particular, methods that enable the artificial generation of new data (such as synthetic data generation), or methods that use simulations to train algorithms, either do not rely on personally generated data or have the potential to synthesize data to remove sensitive personally identifiable attributes, although this It does not mean that all privacy concerns will be resolved, but by reducing the need to collect large amounts of real-world data, this approach could make the use of machine learning somewhat less critical to the large-scale collection, use, or disclosure of consumer data. concerns.

Making progress in areas where fewer data points are available

Many of the recent advances in artificial intelligence have been made possible by the explosion of available data. However, for many important problems, there may be little or no data that can be fed into an AI system. For example, imagine building an algorithm that predicts disease risk for a group of people without electronic health records, or predicts the likelihood of a volcanic eruption with prolonged eruptive recurrence. Small data approaches could provide us with a principled way to Deal with this lack or absence of data. It can do this by leveraging both labeled and unlabeled data to transfer knowledge from related problems. Small data can also help us create more data points using the few data points we have at hand, leverage prior knowledge about related domains, or venture into a new domain by building simulations or encoding structural hypotheses.

Avoid dirty data problems

Certain small data approaches can benefit large organizations, and while the data may exist, it's a long way from being clean, well-structured, and ready for analysis. For example, due to siled data infrastructure and legacy systems, the U.S. Department of Defense has a large amount of "dirty data," which requires time-consuming and labor-intensive data cleaning, labeling, and organization processes. For example, methods in the data labeling category can be automated Generate labels to more easily handle large amounts of unlabeled data. Transfer learning, Bayesian approaches, or artificial data approaches that rely on related datasets, structured models, and synthetic data, respectively, can significantly reduce the size of the dirty data problem by reducing the amount of data that needs to be cleaned.

More generally, we also believe that it is important for policymakers whose work is related to AI to have a clear understanding of the role that data plays (and does not play) in the development of AI. The above factors do not apply to all the methods we describe. For example, reinforcement learning often requires large amounts of data, but this data is generated during training (for example, when an AI system moves a robotic arm or navigates a virtual environment), rather than collected beforehand.

Discover:

To explore how research on small data methods is being conducted, we use CSET's research cluster dataset to identify research related to the five categories above (transfer learning, data labeling, artificial data generation, Bayesian methods, and reinforcement learning). A research cluster is a group of scientific research articles connected by citation links, in which instances researchers are communicating ideas, methods, results they use, or in any other way build on the work of other researchers.

For our analysis, we identified 150 study clusters, belonging to one of our five categories. For comparison, the dataset includes 735 AI clusters. The 150 identified clusters comprise approximately 80,324 papers drawn from CSET's consolidated corpus of scholarly literature, representing more than 90% of the world's scholarly output. To determine which papers fall into our "small data" category, we first worked with technical experts to define a set of keywords related to our five categories. Next, we search clusters for any keyword in the top phrases extracted from the papers in the cluster. Finally, we manually excluded clusters that were apparently irrelevant to the small data. Once we identified the 150 clusters we wanted to work with, each associated with one of our five categories, we assigned all papers within these research clusters to the corresponding category. Following this approach When we tried to balance accuracy and inclusiveness, it is quite possible that we missed some relevant papers that did not cite too many authors in their research community, or that some research papers we included might be connected to A cluster, but probably not directly related to the topic under consideration. Therefore, we encourage readers to consider the analyzes in the following sections to be exploratory rather than conclusive. See Appendix A for more details on our approach.

In the subsections below, we present findings for all papers we identified in relevant research clusters across three pillars (research progress, national competitiveness and funding). Through this analysis, we wish to examine the current and projected progress in scientific research to develop these methods, which country is leading, and the main sources of funding for this research.

Research progress:

In terms of research volume, our five categories of "small data" methods have had very different trajectories over the past decade. As shown in Figure 2, reinforcement learning and Bayesian methods are the two categories with the highest number of papers. While the number of papers on Bayesian clustering has grown steadily over the past decade, reinforcement learning clustering only started growing in 2015, and then grew particularly rapidly between 2017 and 2019. This may be due to the fact that deep reinforcement learning has been facing technical challenges until 2015, when it achieved revolutionary progress. In contrast, the number of artificial data generation and data labeling research papers published annually in clusters has been fairly low over the past decade. Finally, the transfer learning category started small in 2010 but has grown substantially in 2020.

insert image description here

Figure 2. Trends in small data publications, 2010–2020

Of course, the sheer number of publications is not indicative of paper quality. We consider two metrics to assess the quality of papers in each category cluster: h-index and age-corrected citations. The h-index is a commonly used metric for capturing the publication activity and total citation impact of a collection of papers—in our case, the groups of papers attributed to each category. However, one limitation of the h-index is that it does not take into account the age of papers (that is, older papers have had more time to accumulate citations). Therefore, the h-index underestimates the group of papers whose most influential papers are newer and whose citations have not yet been collected. To adjust for this, Figure 3 also depicts age-corrected citations. It can be seen from the figure that only on the h-index, reinforcement learning and Bayesian methods are roughly equal, but after considering the age of the paper, reinforcement learning comes out on top. This means that for the research clusters we identified, the cumulative impact of Bayesian methods appears to be higher, but RL stands out for its relatively recent surge in paper output and citation impact.

insert image description here

Figure 3. H-index and age-corrected citations by category, 2010– 2020

However, it would be wrong to think that reinforcement learning has grown the most over the past decade. Looking more closely at the growth of each category over time, Figure 4 clearly shows that transfer learning saw the most stable growth between 2011 and 2020, with all but two years experiencing the highest growth. The graph also shows the growth that human data generation has seen over the past five years, which is less evident in Figure 3 due to the lower number of total papers in this category. However, it also saw the biggest decline in growth figures between 2012 and 2015, making it difficult to draw concrete conclusions about the category's growth trajectory.

insert image description here

Figure 4. Year-on-year growth by category, 2011–2020

Figure 5 compares the three-year growth forecasts for each category based on the forecasting model developed by CSET, and compares the “AI overall” papers in another category as a benchmark. As shown in the figure, transfer learning is the only projected growth rate faster than AI The category that studies overall, far outpaces all other categories, consistent with consistent growth in previous years.

insert image description here

Figure 5. Growth forecast for 2023 by category

Note: The future growth index is calculated based on CSET's forecast of research cluster growth. For more details on the methodology, see Appendix A

National Competitiveness:

In this section, we explore country competitiveness in small data methods by looking at the research progress of the top 10 countries in the world in these methods. We use simple measures such as the number of papers published and age-adjusted citations to get an initial picture of countries' relative standing in each category, but we encourage readers to explore other indicators to fully understand how a country ranks among small Potential in Data Methods.

Table 1 shows the total number of papers published by category in the top 10 countries for small data publications. Consistent with the overall results for AI research, China and the US are the top two producers of papers in the clusters we identified containing small data-related research, followed by the UK. China leads in the volume of academic papers on data labeling and transfer learning methods, while the US leads in Bayesian methods, reinforcement learning, and artificial data generation.

insert image description here

Table 1. Number of publications by category for top 10 countries globally

It is worth noting that, except for the United States and China, all the top 10 countries in small data research are allies or partners of the United States, and countries such as Russia are obviously not on the list. However, the data trends here may also be due to our multiple statistics papers with multiple authors from different countries, while papers in which researchers from the United States and its allies collaborate reflect higher individual counts due to double counting. Our analysis of the co-authors of these papers supports this assessment.

Paper citations are often used as a measure of research quality and impact, and our findings suggest that the large number of studies in China may not be of high quality in all small data categories. As shown in Table 2, when looking at the number of citations (which can be roughly understood as the number of citations per year), China ranks lower than the United States in all methods. China ranks second in age-corrected citations across all small data categories, except for Bayesian methods, where China slips further to seventh. This means that although China may have published a large number of papers on Bayesian methods, the quality and impact of research in this area has suffered the most compared to other methods. The United States leads the world in age-corrected citations across all methods.
insert image description here

Table 2. Number of age-corrected citations by category for top 10 countries globally

Figure 6 shows the three-year growth forecast by country. The most notable finding here is how much higher growth in transfer learning methods is expected to be in China relative to the US and the rest of the world. If accurate, this prediction would mean that China could move further ahead in transfer learning, at least in terms of the number of papers published.
insert image description here

Figure 6. Growth forecast for 2023 by category for the United States, China, and the rest of the world (ROW)

Note: The future growth index is calculated based on CSET's forecast of cluster growth. See Appendix A for more details on the methodology.

Funding:

We analyzed funding data available for small data methods to obtain estimates of the types of entities that funded papers in the research clusters we identified as belonging to these methods. An important caveat to the findings presented here is that we only have funding information for about 20-30% of papers, although we have no reason to believe that there is a systematic difference between papers with and without funding data

Across disciplines, in government, corporations, academia, and nonprofit organizations, government actors are often the largest funders of research, and authors are often affiliated with academia. With this in mind, we compared the results of small data studies to those of AI research in general to see how different they are A larger share in the cluster than in the overall AI field. As shown in Figure 7, across all five categories, the share of government funding is disproportionately high compared to the overall funding allocation for AI research. We also observe that nonprofits spend a smaller proportion of their funding in small data research than they typically do in other areas of AI. The funding model for the Bayesian approach most closely resembles the overall model for AI.

insert image description here

Figure 7. Funding sources for small data approaches relative to AI overall

Figure 8 further breaks down funding information related to government entities by country. Our results show that, despite the overall trend that government funding is overrepresented in small data, the US government's share of funding for small data research is lower than its share for AI research. On the other hand, private sector companies tend to provide a larger share of funding for small data research in the United States than for AI research as a whole (see Figure 9 in Appendix B for details).

insert image description here

Figure 8. Government funding for small data approaches relative to AI overall, by China, the United States, and the rest of the world (ROW)

This trend is almost reversed when we look at data from other parts of the world, where government agencies provide a much higher proportion of funding for small data research, especially compared to the private sector. Interestingly, non-profit organizations in other parts of the world, such as research trusts and foundations, are less inclined to fund small data papers than their support for AI in general (see Appendix B, Figure 10 for details).

In China, apart from manual data generation, the share of government funding for small data methods is generally smaller than for AI, although the difference is not as large as in the United States.

Key elements:

This article introduces and outlines a range of "small data" approaches to artificial intelligence. Finally, based on our findings, we make the following main points:

AI is not synonymous with big data, especially not large pre-labeled datasets. The role of big data in the AI ​​boom of the past decade is undeniable, but making large-scale data collection and labeling a prerequisite for AI advancements would lead policymakers astray. There are various approaches to choose from, and different approaches can be used in different situations: if the problem at hand is data-scarce but related problems are data-rich, perhaps transfer learning is useful; environment where the agent can learn by trial and error rather than pre-collected data, then reinforcement learning may be required; etc...

Research on transfer learning is growing especially rapidly—even faster than the larger and better-known field of reinforcement learning. The implication is that this method may work better and be more widely used in the future than it does now. Therefore, if policymakers are faced with a lack of data for a problem of interest, it would be helpful to seek to identify relevant datasets that might serve as a starting point for transfer learning-based approaches.

According to our cluster-based research approach, the United States and China are highly competitive in small data approaches , and both the United States and China are the top two countries (by the number of research papers) in the five categories we consider. While the U.S. has a large lead in the two largest methods (reinforcement learning and Bayesian methods), China has a smaller but growing lead in transfer learning (the fastest growing category).

For the time being, transfer learning may be a promising target for more U.S. government funding . U.S. government funding represents a relatively small share of funding for small data approaches relative to the investment pattern in the AI ​​field as a whole. This may be because research in these areas is not prioritized by the US government, or because US private sector actors tend to allocate a higher proportion of funding to research in these methods. Regardless, considering transfer learning as a rapidly emerging field, it may represent a promising opportunity to increase funding from U.S. government sources.

【Paper Express | Featured】

Forum address: https://bbs.csdn.net/forums/paper

Guess you like

Origin blog.csdn.net/qq_36396104/article/details/129803695