tensorflow the breakpoint training

tensorflow the breakpoint training

2019-09-07

As the name suggests, Resume training means that for some reason the model has not yet completed training is interrupted, the next time to continue training base training can once trained on the upper rather than starting from scratch; this way you long for those training time the model is very friendly.

If you want to Resume training, then have to meet two conditions:

(1) Local saved snapshot of model training; (that is, data breakpoints saved)

(2) can be restored model training field environment by reading snapshots. (Breakpoint data recovery)

These two operations have used the train.Saver class tensorflow in.

 

1.tensorflow.trainn.Saver类

__init__(
    var_list=None,
    reshape=False,
    sharded=False,
    max_to_keep=5,
    keep_checkpoint_every_n_hours=10000.0,
    name=None,
    restore_sequentially=False,
    saver_def=None,
    builder=None,
    defer_build=False,
    allow_empty=False,
    write_version=tf.train.SaverDef.V2,
    pad_step_number=False,
    save_relative_paths=False,
    filename=None
)
All parameters introduced here not only describes the common parameters 
max_to_keep: allows you to save the model number, the default is 5; when the number of saved more than five, automatically delete the oldest model in order to ensure that there is a maximum of five models simultaneously ; If set to 0 or None, all training will be saved in the model, but this does not make sense apart from much of the external hard disk.
Other parameters will generally use the default value on it.
saver = tf.train.Saver(max_to_keep=10)

Have the opportunity to supplement the use of other parameters.

2. Save the data breakpoint

Use saver save method to save the object model:

save(
    sess,
    save_path,
    global_step=None,
    latest_filename=None,
    meta_graph_suffix='meta',
    write_meta_graph=True,
    write_state=True,
    strip_default_attrs=False,
    save_debug_info=False
)

Common parameters:

sess: the need to save the session, the general is our program sess;

save_path: Save the model file path and name, such as "ckpt / my_model", pay attention to if you want to save in ckpt folder, you need to add in the back ckpt slash /;

global_step: training times, saver will automatically be added to the value stored in the file name.

saver.save(sess,"my_model",global_step=1)
saver.save(sess,"my_model",global_step=100)
saver.save(sess,"ckpt/my_model",global_step=1)

1,2,3 respectively which line of code would be:

1: In the path codes "my_model_1 file" name generated;

2: In the path codes "my_model_100 file" name generated;

3: Generate called "my_model_1 File" in ckpt folder.

 The most common usage:

for epoch in range(n_iter):
    '''
    training process
    '''
    saver.save(sess,ckpt_dir+"model_name",global_step=epoch)

Wherein ckpt_dir break data is stored in the path.

 

3. breakpoint data recovery

First establish a same as the previous model; there is no break and then check the data, and if so, to recover.

= ckpt_dir is " CKPT / " 
# create objects Saver 
Saver = tf.train.Saver ()
 # if there is a breakpoint file, read the recent break file 
CKPT = tf.train.latest_checkpoint (ckpt_dir is) 

IF ! CKPT = None: 
    Saver .restore (sess, ckpt)

No need to provide the model name, tf.train.latest_checkpoint (ckpt_dir) will go to ckpt_dir folder automatically looking for the latest model file.

Guess you like

Origin www.cnblogs.com/sienbo/p/11482878.html