How to scientifically divide medical image data sets

  When performing medical image classification tasks, how to scientifically divide the data set is an important issue. The answer to this question depends on the characteristics of your data and the goals of your experiment. Generally speaking, there are two common ways to divide data: by proportion and by case.

Divide according to proportion

  Partitioning by proportion is a common method, which mixes all the data together and then randomly divides it into training sets according to a certain ratio (such as 80%:10%:10% or 70%:15%:15%, etc.) , validation set and test set.

advantage:

  • Representative: Because it is randomly divided, each set (training set, verification set, test set) can contain various types of data, which can ensure the accuracy of the data. representative.
  • Simple and intuitive: This method is simple to operate. It only needs to randomly shuffle the data and then divide it according to proportion.

shortcoming:

  • Data leakage: If there are correlations between the data (for example, different slices of the same case), then this approach may lead to data leakage, that is, the information in the training set is appears in the validation set or test set, which may lead to model overfitting.
  • Poor stability: Because it is randomly divided, the results obtained by each division may be different, which may affect the stability of the model.

Solution:

  • Data leakage: In order to avoid data leakage, we can cluster the data of the same case together before dividing the data set, and then randomly divide it at the case level, so that It can be ensured that the data of the same case will not appear in the training set and validation set/test set at the same time.
  • Poor stability: In order to improve the stability of the model, we can use the cross-validation method. Cross-validation is a practical way to statistically cut a data sample into smaller subsets. In this method, we divide and train multiple times and then average the results, which can improve the stability of the model.

According to cases

  Dividing by case is another common method. It takes the data of each case as a whole and divides it into a training set, a verification set and a test set according to a certain proportion.

advantage:

  • Avoid data leakage: Since it is divided by cases, it can avoid data from the same case appearing in the training set and validation set/test set at the same time, thereby avoiding data leakage.
  • Consider data correlations: If there is a correlation between the data (for example, different slices from the same case), then partitioning by case can better account for this correlation.

shortcoming:

  • Poor representation: If the differences between cases are large, then splitting by case may result in a lack of certain types of data in some sets, thus affecting the representation of the data sex.
  • Complex operation: The data of each case needs to be tracked, and the operation is relatively complicated.

Solution:

  • Poor representativeness: In order to improve the representativeness of the data, we can perform stratified sampling on the data before dividing the data set to ensure that each set contains various types The data.
  • Complex operation: Although the operation by case is relatively complex, we can simplify the process by writing scripts or using data processing tools.

in conclusion

  When choosing a data partitioning method, you need to decide based on your data characteristics and experimental goals. If there is a correlation between your data, it might be better to split it by cases. If your data is independently and identically distributed, then it may be better to divide it proportionally. In addition, you can further improve the robustness and reliability of the model through methods such as cross-validation. Hope this blog post helps you!

Guess you like

Origin blog.csdn.net/qq_50993557/article/details/134650591