Can't start academic research? 27 machine learning pit avoidance guides, so that your papers can be published without detours

Content overview: If you are new to machine learning and want to carry out academic research in this field in the future, don't miss this "Guide to Avoiding Pitfalls" tailored for you.

Keywords: machine learning scientific research norms academic research

As an academic novice in machine learning, how to avoid pitfalls gracefully and get your papers published smoothly?

Associate Professor Michael A. Lones from the School of Mathematics and Computer Science, Heriot-Watt University, Scotland, published a paper in 2021 – "How to avoid machine learning pitfalls: a guide for academic researchers", which discussed it in detail .

Read the full paper (V2):

Michael A. Lones' main research interests include optimization, machine learning data science, complex systems and non-standard computing, and applications in biology, medicine, robotics, and security problems
Michael A. Lones' main research interests include optimization, machine learning
, data science, complex systems, and nonstandard computing
, with applications in biology, medicine, robotics, and security problems

In this paper, from the perspective of academic research, combined with his own scientific research experience and teaching experience, the author collects five major problems that frequently appear and require special attention in the complete link of using machine learning technology, and puts forward corresponding solution.

For people:

Students or scholars who are relatively new to the field of ML and only have basic ML knowledge

Warm reminder: This article focuses on issues that are generally concerned by the academic community, such as: how to strictly evaluate and compare models so that papers can be published smoothly

Next, we will follow the complete process of ML model training and describe it in stages.

Phase 1: Before creating the model

Many students are eager to train and evaluate the model as soon as they come up, often ignoring the more important "homework". These "homework" include:

  • what is the goal of the project
  • What kind of data is needed to achieve this goal
  • Will the data have limitations, if so how to solve it
  • How is R&D progressing in this area and what has been done?

These pre-works are not done well, and if you just run the model in a hurry, then it is very likely that the model will not be able to prove the expected conclusion, and the scientific research work will not be published.

1.1 Understand and analyze data

Reliable data sources, scientific collection methods, and high-quality data will be of great benefit to the publication of papers. It should be noted here that the most widely used data sets are not necessarily of good quality, and it may also be because it is easy to obtain. **Before selecting the data, some exploratory data analysis was carried out to exclude the limitations of the data.

1.2 Don't look at all the data, separate the test data before starting

Leakage of information from the test set into the training process is a common reason why machine learning models fail to generalize. For this reason, in the data exploratory analysis stage, do not look at the test data too carefully, avoid intentionally or unintentionally making untestable assumptions, and limit the generality of the model.

Note: It is okay to make assumptions, but these assumptions should only be incorporated into the training of the model, not the testing.

1.3 Prepare sufficient data

Insufficient data may reduce the generalization and generality of the model, which depends on the signal-to-noise ratio (Signal-to-Noise Ratio, SNR) of the data set. In the field of machine learning research, a common problem is that the amount of data is not enough. At this time, techniques such as cross-validation and data enhancement can be used to improve the availability of existing data.

1.4 Actively seek advice from experts in the field

Experts in the field have rich experience in scientific research and can help us identify the problems to be solved, the most suitable feature set and machine learning model, and guide the release of our research results, which can achieve twice the result with half the effort.

1.5 Do a good job in literature research

Academic advancement is an iterative process, with each study providing information that can guide the next. **By ignoring previous research, you risk allowing yourself to miss out on valuable information. **Instead of writing a thesis, racking your brains to explain why you study the same topic and why you don't start research on the existing results, it is better to do a literature review before starting work.

1.6 Think ahead about model deployment

If the ultimate goal of academic research is to create a machine learning model that can be deployed in the real world, deployment issues need to be considered early, such as the impact of environmental constraints on model complexity, whether there is a time limit, how to integrate with software systems, etc. .

Phase 2: Creation of models reliably

It is very important to create models in an organized way, which allows us to use the data correctly and give due consideration to the choice of models.

2.1 Test data cannot participate in the model training process

Once the test data participates in the configuration, training or selection of the model, the reliability and versatility of the data will be greatly affected. This is also a common reason why the published machine learning models are often not applicable to real-world data.

❎ Error example (pay attention to avoid):

  • During data preparation, use the mean and range information of the entire dataset variable for variable scaling (the correct way is to do it only in the training data)

  • Feature selection before splitting the data

  • Assess generalizability of multiple models using the same test data

  • Apply data augmentation before splitting test data

In order to avoid the above problems, the best way is to divide a subset of data before the project starts, and only use this independent test set to test the generality of a single model at the end of the project.

Warm reminder: special care should be taken when handling time series data, because random splitting of data can easily cause leakage and overfitting.

2.2 Try several different models

There is no universal machine learning model in the world. Our research work is to find machine learning models that are applicable to specific problems. Modern machine learning libraries such as Python, R, Julia, etc., can try many models and find the most effective model with only a small modification of the code.

Kind tips:

  • Don't use unsuitable models, use validation set instead of test set to evaluate the model
  • When comparing models, optimize the hyperparameters of the models and perform multiple evaluations and correct for multiple comparisons when publishing results.

2.3 Don't use inappropriate models

Modern machine learning libraries lower the threshold for implementing machine learning, and also make it easy for us to choose inappropriate models, such as applying models suitable for categorical features to data sets containing numerical features, or using regression models when it is time to use classification model. When choosing a model, you should choose the one that fits your use case as closely as possible.

2.4 Deep learning is sometimes not the optimal solution

Although deep neural network (DNN) performs well on some tasks, it does not mean that DNN is suitable for all problems, especially when the data is limited, the underlying pattern is quite simple, or the model needs to be interpretable, the performance of DNN may not be the same. Not as good as some old fashioned machine learning models, such as random forest, SVM.

2.5 Optimize the hyperparameters of the model

Hyperparameters have a huge impact on a model's performance and often need to be tailored to a specific dataset. Testing aimlessly may not be the best way to find suitable hyperparameters. It is recommended to use hyperparameter optimization strategies such as random search and grid search.

Reminder: These strategies are not suitable for models with a large number of hyperparameters or high training costs. You can use technologies such as AutoML and data mining pipelines to optimize the selection of models and their hyperparameters.

2.6 Additional care is required when optimizing hyperparameters and selecting features

Hyperparameter optimization and feature selection are part of model training. Do not perform feature selection on the entire data set before model training begins, which will leak information from the test set into the training process. A common technique for optimizing the hyperparameters or features of a model, preferably using the exact same data that was used to train the model, is nested cross-validation (also known as double cross-validation.

Phase 3: Evaluate the model robustly

Irrational model evaluation is very common, which can hinder the progress of academic research. Therefore, careful thought needs to be given to how the data is used in experiments, to measure the true performance of the model, and to report it.

3.1 Using an Appropriate Test Set

Use the test set to measure the generalizability of the machine learning model and ensure that the data for the test set is suitable. The test set should not overlap with the training set, and needs to cover a wider range of conditions, such as a photographic dataset of an object. If both the training set and the test set are collected outdoors on a sunny day, the test set is not independent because there is no capture to wider weather conditions.

3.2 Do not perform data augmentation before splitting the data

Data augmentation is beneficial to balance the data set and improve the versatility and robustness of the machine learning model. It should be noted that data augmentation should only be applied to the training set and not the test set to prevent overfitting.

3.3 Using the validation set

Model performance is measured using a separate validation set, which contains a set of samples that are not used directly for training, but are used to guide training. Another benefit of the validation set is that early stopping is possible.

3.4 Evaluating the model multiple times

A single evaluation of the model is not reliable and may underestimate or overestimate the true performance of the model. For this reason, multiple evaluations of the model are required, mostly involving multiple trainings of the model using different subsets of training data. Cross-Validation is a particularly popular and diverse method, such as Ten-fold Cross-Validation.

Reminder: While reporting the mean and standard deviation of multiple evaluations, it is recommended to keep a single score record for subsequent comparison of models using statistical tests.

3.5 Keep some data to evaluate the final model instance

Perhaps the best way to reliably assess the generality of model instances is to use another test set. So, if the amount of data is large enough, it is best to keep some and use it for unbiased evaluation of the final selected model instances.

3.6 Do not use accuracy for unbalanced datasets

Carefully choose metrics for evaluating machine learning models. For example, the most commonly used metric for classification models is accuracy, which works well if the dataset is balanced (each category is represented by a similar number of samples in the dataset); Accuracy can be a very misleading metric if you have an unbalanced dataset.

In this case, it is better to use indicators that are not sensitive to class size imbalance, such as F1 score, Cohn-Kappa coefficient (κ) or Matthews correlation coefficient (MCC).

Phase 4: Comparing models fairly

Comparing models is fundamental to academic research, but comparing them in an unfair way and publishing them can bias other researchers. So, you need to make sure you evaluate different models under the same conditions and use statistical tests correctly.

4.1 For the model, it is not that the higher the number, the better the performance

This expression often appears in papers "the accuracy rate in previous studies was 94%, and the accuracy rate of this model is as high as 95%, so it is better". There are various reasons why a higher number does not equate to a better model . If the model is trained or evaluated on different partitions of the same dataset, the difference in performance may be small; if a completely different dataset is used Possibility of huge difference in performance. Not doing the same amount of hyperparameter optimization can also affect the difference in model performance.

Therefore, to scientifically compare the performance of two models, the models should be optimized to the same degree, evaluated multiple times, and then statistically tested to determine whether the performance difference is significant.

4.2 Comparing models with statistical tests

Statistical tests are recommended to compare the performance difference of two models. Broadly speaking, the tests for comparing machine learning models are divided into two categories: the first category is used to compare similar model instances, such as when comparing two trained decision trees, you can use the McNemar test; the second category is suitable for more general models For comparison, such as comparing decision trees and neural networks which is more suitable, use the Mann-Whitney U test.

4.3 Correcting for multiple comparisons

Comparing more than two models with a statistical test is somewhat complicated, a multiple pairwise test is like using the test set multiple times, which can lead to overly-optimistic interpretations of significance.

A multiple test correction is recommended to address this issue, such as the Bonferroni correction.

4.4 Don't put too much faith in the results of community benchmarks

For problems in certain fields, many people will choose benchmark data sets to evaluate the performance of new machine learning models, because everyone uses the same data to train and test models, so the comparison will be more intuitive. This approach has some major drawbacks.

First, if the test set has unrestricted access, there is no guarantee that someone else is not using it as part of the training process, which can lead to overly optimistic results. Also, even if the test set is used only once by everyone using the data, overall the test set is used many times by the community, which can lead to overfitting of the model. For this reason, the results of benchmark datasets should be interpreted cautiously, and the performance improvement should be reasonably judged.

Phase 5: Reporting Results

Scholarly research needs to contribute to knowledge, and this requires reporting on the overall picture of research efforts, including what has succeeded and what has failed. Machine learning is often associated with trade-offs, and it is rare for one model to be better than another in all respects. So this needs to be reflected when reporting the results.

5.1 Reporting needs to be transparent

Sharing all research work transparently makes it easier for others to replicate the experiment and for people to compare models. Do yourself and others good by documenting your experiments clearly and writing clean code. The machine learning community pays more and more attention to the reproducibility of experiments, and the workflow is not well documented, which may affect subsequent publications.

5.2 Multiple ways to report performance

A more rigorous approach when evaluating model performance is to use multiple datasets, which can help overcome any pitfalls associated with a single dataset and give a comprehensive picture of model performance. It is good practice to report multiple metrics for each dataset, since different metrics can present different results, increasing the transparency of the work.

5.3 Generalization on data only

Don't come up with invalid conclusions that lead other researchers astray. A common mistake is to publish general conclusions that are not supported by the data for training and evaluating models. Just because a model performs well on one dataset does not mean it will do well on other datasets. While reliable insights can be obtained by using multiple datasets, there are always limits to what can be studied and inferred from experiments. Don't exaggerate discoveries, be aware of limitations.

5.4 Report significant differences with caution

The statistical tests discussed above can help examine differences between models. However, statistical tests are not perfect, and may underestimate or overestimate the significance of the model, resulting in false positives or false negatives. In addition, more and more statisticians advocate abandoning the confidence threshold (confidence threshold) and directly reporting the p value to determine the significance of the model.

Besides statistical significance, another consideration is whether the difference between the two models is actually significant. Because as long as the sample is sufficient, significant differences can always be found, even if the actual performance difference is negligible. Therefore, when judging the importance, you can measure the effect size (effect size), including Cohen's d statistic (more general), Kolmogorov Smirnov (better effect, recommended), etc.

5.5 Pay attention to the operating principle of the model

The trained model contains a lot of valid information, but many authors only report the performance indicators of the model, without explaining the principle of the model. The purpose of research is not to obtain a slightly higher accuracy rate than others, but to summarize knowledge and share it with the research community, thereby increasing the possibility of publication of work results, such as providing model visualization for simple models such as decision trees; for deep neural networks For complex models such as networks, consider using XAI (Explainable Artificial Intelligence) technology to extract relevant information.

The above is the complete content of the "Guide to Avoiding Pitfalls". I hope that every student who is new to machine learning can keep this book well, read it often, find the research direction, choose the topic, and publish the paper as soon as possible!

Looking forward to your good news~

Original link

参考链接:How to avoid machine learning pitfalls: a guide for academic researchers

-- over--

Guess you like

Origin blog.csdn.net/HyperAI/article/details/128866164