[Paper analysis] TLDFP, TradaBoost, Ji Zhang, ICPP, 2019

Transfer Learning based Failure Prediction for Minority Disks in Large Data Centers of Heterogeneous Disk Systems,ICPP,2019

The author Ji Zhang, Ph.D., Huazhong University of Science and Technology
Want to know him? Please poke here

Note: The paper has been uploaded to the resource, and the children's shoes needed can be downloaded for free.

The thesis mainly proposes a TLDFP model based on " migration learning + minority disk failure prediction " . The idea of TrAdaBoost (weight-adjusted transfer learning method) is mainly adopted .

Insert picture description here

1. Summary and introduction

1. Background

Storage systems in large data centers are usually built on thousands or even millions of disks, and disk failures occur from time to time. If the lost data cannot be recovered, a disk failure may result in severe data loss, resulting in system unavailability or even catastrophic consequences. In a large-scale storage system scenario, as time goes by, a large number of new disks gradually enter the storage system, replacing failed disks, causing the storage system to consist of heterogeneous disks from different vendors and different models from the same vendor. Disk composition .

2. Research object

Few disks: a small number of new disks from different vendors/different models from the same vendor

3. Goal

Reduce the risk of data loss while reducing data recovery costs associated with recovering data on failed disks

4. Innovation driven

1) Due to the lack of sufficient training data , traditional machine learning methods cannot provide satisfactory prediction performance in an evolutionary storage system composed of a small number of heterogeneous disks , and will cause overfitting problems .
2) The inherent self-monitoring, analysis and reporting technology (SMART) technology of the disk adopts the "threshold method", but only achieves 3%-10% failure detection rate (FDR) and 0.1% false alarm rate (FAR)

5. Method

1) Propose a minority disk failure prediction model TLDFP based on migration learning
2) Firstly, a new method based on KLD value to select the appropriate majority disk model
3) Developed a method based on KLD value for a minority disk model The failure prediction of the disk model, as different disk models are gradually put into the actual storage system to replace the failed disk, has important practical application value

6. Experiments and results

The evaluation results on the two real data sets Backblaze and Tencent show that it is compatible with the four popular traditional machine learning-based algorithms GBRT (iterative decision tree), RGF (regularized greedy forest), SVM (support vector) and RNN (cyclic Neural network) Compared with the two latest transfer learning methods SSDB and TLBN (not searched) prediction models, TLDFP can provide more accurate results.

2. Main content

1. Research history

Insert picture description here

2. Related background knowledge

Each SMART attribute item consists of five elements, which are described as tuples.

• ID: The designated serial number of the SMART attribute.

• Normalization: the current or last normalized value (most normalization is the value between the best value 253 and the worst value 1 calculated by the manufacturer-specific algorithm using the original value).

• Raw value: the raw value corresponding to the count or physical state provided by the sensor and the supplier.

• Threshold: The threshold exceeded when the disk alarms and fails.

• Worst: The lowest or worst value for a given attribute.

3. Research goals

(1) What: What is the definition of a few disk data sets in terms of failure prediction?
Insert picture description here

(2) Why: Why do we use migration learning to predict the failure of a few disks?
Insert picture description here
Insert picture description here
Insert picture description here

(3) How: How to use migration learning method to predict a small number of disk failures?
Insert picture description here
Insert picture description here
(4) When: When is migration learning used to predict a few disk failures?
Insert picture description here

4. Experimental part

Insert picture description here
Insert picture description here
Insert picture description here

The blogger's research direction is time series anomaly detection, welcome to communicate. Recently, we are doing anomaly detection of disk time series. This TrAdaBoost paper is being reproduced (using BackBlaze disk data). The knowledge is shallow, welcome to correct me~

Guess you like

Origin blog.csdn.net/qq_16488989/article/details/109283500