Thoughts on neural networks and data sets: The bigger the data set, the better the performance?

Generally speaking, the relationship between neural networks and big data is very complicated, and the influencing factors include: the size of the model, the size of the data set, computing performance, and other factors, such as: manpower, time, etc. The following is a summary of the existing content:

1. Data volume VS network performance 

1.Overview

In "Revisiting Unreasonable Effectiveness of Data in Deep Learning Era", Sun et al. attributed the major success of computer vision technology in the past 10 years to: 1> more complex models, 2> the improvement of computing performance [ Reference 1 , Reference 2 ], 3> The emergence of large-scale labeled data sets. 

Regarding the first question, we can see the improvement of computing performance and model complexity every year, from the 7-layer AlexNet in 2017 to the 101-layer ResNet in 2015, and the Transformer technology with a larger number of parameters.

For the second question, you can refer to 1 and 2. In general, researchers have found that by updating the GPU every year, the performance improvement brought about by improving the GPU performance is even higher than the update of the model itself. The powerful computing power brought by the new GPU can make model inference faster and more efficient.

Regarding the third question, we all know that deep learning is data-driven, so if the training set is expanded by 10 times or 100 times, will the accuracy be doubled, and is there a bottleneck? The following focuses on this issue. 

2. Research objectives

The article pointed out that in recent years, the model size and GPU performance have increased, but the data set has not improved, so they built a data set of 300 million images for experimental verification. Their research goals are to:

1) Using the current algorithm, if more and more images with noisy labels are provided, whether the visual performance can still be optimized;

2) For standard vision tasks such as classification, object detection, and image segmentation, what is the relationship between data and performance;

3) Utilize large-scale learning technology to develop state-of-the-art models capable of various tasks in the field of computer vision.

3. Data construction

The problem lies in how to construct the data set. Fortunately, Google has been working hard to build such data sets to optimize computer vision algorithms. With the efforts of Geoff Hinton, Francois Chollet and others, Google internally built a data set containing 300 million images, labeled the images into 18291 categories, and named it JFT-300M (not open source).

The dataset's image tagging uses an algorithm that blends complex raw network signals with correlations between web pages and user feedback. Through this method, these 300 million images received more than 1 billion tags (one image can have multiple tags). Of these 1 billion tags, approximately 375 million were selected algorithmically to maximize the accuracy of the tags for the selected images. However, there is still noise in these labels: about 20% of the labels of the selected images are noise. Simply put, the larger the amount of data, the greater the noise, and the more difficult it is to train the model.

4. Core Experimental Results

The author obtained some results through experimental verification:

* Better representation learning can help .

The first observation we make is that large-scale data facilitates representation learning that optimizes performance on all the vision tasks we studied. Our findings demonstrate the importance of building large-scale datasets for pre-training. This also shows that unsupervised representation learning and semi-supervised representation learning methods have good prospects. It appears that the size of the data continues to suppress the noise present in the labels.

* As the magnitude of training data increases, task performance increases logarithmically.

Perhaps the most surprising finding is the relationship between visual task performance and performance learning on the logarithm of the amount of training data. We found that the relationship is still linear. Even if the training image size reaches 300 million, we have not observed any stagnation in performance improvement. As shown below:

* Model capacity is critical.

We observed that if we want to fully utilize the 300 million image dataset, we need larger (deeper) models.

For example, with ResNet-50, the increase in COCO object detection score is very limited, only 1.87%, while with ResNet-152, this score increase reaches 3%.

* Long tail training.

Our data has very long tails, but the representation learning seems to work. This long tail does not seem to adversely affect random training of ConvNets (the training will still converge).

* New state-of-the-art results.

Our paper uses JFT-300M to train the model, and many scores have reached the highest level in the industry. For example, for the COCO object detection score, a single model can currently achieve 37.4 AP, up from 34.3 AP previously.

It should be pointed out that the training system, learning progress and parameters we used are based on the previous experience gained from training ConvNets with ImageNet 1M images.

Since we did not explore optimal hyperparameters in this work (which would require considerable computational effort), it is possible that we have not yet achieved the best results possible from training with this dataset. We therefore believe that quantitative performance reporting may underestimate the actual impact on this data set.

This work did not focus on task-specific data, such as studying whether more bounding boxes affect the performance of the model. We believe that despite the challenges, obtaining large-scale datasets for specific tasks should be a focus of future research.

Furthermore, building a dataset containing 300M images is not the ultimate goal. We should explore whether the model can continue to improve with a larger data set (more than 1 billion images).

5. Other experimental results

* Fine-tuning of pre-training weights is very important

 

2. Pre-training weights VS performance

1. Overview

Google researchers published a paper called BigTransfer, "Big Transfer (BiT): General Visual Representation Learning" , which explores how to effectively use the super-conventional image data scale to pre-train the model and systematically conduct the training process. Deep research.

In order to explore the impact of data scale on model performance, they revisited the currently commonly used pre-training configurations (including activation functions and normalization of weights, model width and depth, and training strategies), while utilizing three different scales of data. The sets include: ILSVRC-2012  (1.28 million images in 1000 categories),  ImageNet-21k  (14 million images in 21,000 categories) and  JFT  (300 million images in 18,000 categories). More importantly, researchers based on these data Previously untapped scales of data can be explored.

2. Research content

* The relationship between data set size and model capacity

The authors chose different variants of ResNet for training. From the standard size "R50x1" to the x4 width, to the deeper 152-layer "R152x4", all were trained on the above dataset. The researchers then made the key discovery that if they want to take full advantage of big data, they must also increase the capacity of the model.

The left half shows that as the amount of data increases, the capacity of the model needs to be expanded. The expansion of the red arrow means that the small model architecture becomes worse under large data sets, while the large model architecture is improved. The figure on the right shows that pre-training under large data sets does not necessarily improve, but needs to increase training time and computational overhead to take full advantage of the advantages of large data.

Training time also plays a critical role in model performance. If sufficient training is not performed on a large-scale data set to adjust the computational overhead, the performance will drop significantly (half of the red points to the blue points in the above figure decrease), but it can be obtained by appropriately adjusting the model training time. Significant performance improvements.

* Appropriate normalized BN can effectively improve performance

1> Replacing batch normalized BN with group normalized GN can effectively improve the performance of the pre-training model on large-scale data sets, mainly due to two aspects:

  • First, the state of BN needs to be adjusted when migrating from pre-training to the target task, but GN is stateless, thus avoiding the difficulty of adjustment;
  • Second, BN utilizes per-batch statistics, which become unreliable for small batches per device, while training on multiple devices is unavoidable for large models. Since GN does not need to calculate statistics for each batch, it once again successfully avoids this problem;

* Transfer learning

Based on the method in the process of building BERT, the researchers tuned the BiT model on a series of downstream tasks, and only used very limited data in the tuning process. The pre-training model already has a good understanding of visual features.

Data size, ILSVRC < ImageNet < JFT-300M. When using very few samples to perform transfer learning on BiT, the researchers found that as the amount of data and architecture capacity used in the pre-training process increased, the resulting migrated model Performance has also increased significantly. When increasing model capacity on the smaller dataset ILSVRC, the gains obtained by migrating CIFAR are smaller in both 1-shot and 5-shot cases (green line in the figure below). When pre-training on the large-scale JFT data set, the increase in model capacity will bring significant gains (shown by the red-brown line). BiT-L can achieve 64% and 95% on single samples and five samples. precision.

3. Conclusion

This study found that under the training of large-scale general data, a simple migration strategy can achieve impressive results, whether it is based on big data, small sample data or even single sample data, through large-scale pre-trained models in downstream tasks Significant performance improvements can be achieved. BiT pre-trained models will provide vision researchers with a new alternative to ImageNet pre-trained models.

3. Two excellent answers from Zhihu

1. Angle 1

It is recommended to read Section 3.2 of PRML, a classic textbook (of course, in order to understand 3.2, you must first read Chapter 1), which explains in detail what happens when the amount of data increases . The core conclusion is as follows:

When the amount of data is fixed, there is a tradeoff between bias and variance. When one increases, the other decreases. As the amount of data increases, the sum of these two terms can be further reduced, but the noise term cannot be eliminated.

Therefore, this problem has the following simple and preliminary conclusions:

  • If the amount of data is unlimited and accurately annotated, then in theory the machine learning model can fit a perfect function, provided that the model has sufficient complexity and accuracy. By the way, infinitely accurate data scenarios are possible, such as using a virtual engine to generate data. At this time, although a model cannot use unlimited data in a limited time, as the training progresses, the amount of data can become infinite.
  • If the amount of data is unlimited but not accurately labeled, the accuracy of the final model will be limited by the label noise.
  • If the amount of data is limited, the model must be a trade-off between bias and variance: to make the model stable on a wide range of test data (variance is relatively small), the general prediction accuracy of the model will decrease (bias is relatively large); to make the model To perform well on some subset of test data (with a small bias), the model must sacrifice accuracy on other possible test data.

2. Angle 2

In particular, when you have enough training data, the generalization error is likely to be very small. This can be concluded from classic machine learning theory:

If \mathcal{H}it is a finite space and 0 <  \delta < 1, then for any  h \in \mathcal{H}:
P\left(|E(h)-\widehat{E}(h)| \leqslant \sqrt{\frac{\ln |\mathcal{H}|+\ln (2 / \delta)}{2 m}}\right) \geqslant 1-\delta

Among them, m represents the amount of training data. When m tends to infinity, that is, there are enough training samples, the older the difference between the learned classifier \widehat{E}(h)and the ideal classifier, {E}(h)the smaller the difference between the two. In the above formula, it means that the difference between the two is less than one, especially small The probability of the number is greater than 1-\delta, that is, the probability of this happening is very high.

 Again, would such a network overfit? Overfitting is an important issue in the field of machine learning. It is closely related to your network structure, training method, data difficulty, etc. For example, if you have a lot of data, but they are almost the same (high similarity between data), then this network is actually underfitting, because the data in the real world is more complex; another extreme perspective, if you The data is not only large but also different, which will of course lead to overfitting.

Finally, when the amount of data is huge, the network may become saturated. But like the previous question, it still depends on the quality of your data. Ideally, if a network is used to fit all the known data in the world, then of course it will be saturated.

Note: It is mentioned in the above theory that the hypothesis space is limited. But we cannot be sure whether the hypothesis space is really limited when there is a lot of training data, so we still cannot answer this question. After all, machine learning is data-based science.

4. Personal thinking:

Data volume, pre-trained model weights, data quality, neural network capacity

The more data, the better the prediction, but when the training sample size is large, if there are too few network layers and insufficient feature training, it will lead to insufficient training. Therefore, the larger the data set, the better the effect. One of the prerequisites is that the feature extraction ability of the network cannot be too poor (neural network capacity problem).

The larger the amount of data, the greater the sample size requirement, and the data quality needs to be improved. It cannot contain too much noise and similar data, which will have a negative impact on the learning of the network.

The weights of the pre-trained model on the large data set have good transfer effects on other data sets.

reference:

https://blog.csdn.net/emprere/article/details/98858910

https://zhuanlan.zhihu.com/p/144254628

https://www.zhihu.com/question/525413729/answer/2419093179

Guess you like

Origin blog.csdn.net/qq_37424778/article/details/124074888