Sunday, September 23, 2018

Tensorflow eager choose checkpoint max to keep

Leave a Comment

I'm writing a process-based implementation of a3c with tensorflow in eager mode. After every gradient update, my general model writes its parameters as checkpoints to a folder. The workers then update their parameters by loading the last checkpoints from this folder. However, there is a problem.

Often times, while the worker is reading the last available checkpoint from the folder, the master network will write new checkpoints to the folder and sometimes will erase the checkpoint that the worker is reading. A simple solution would be raising the maximum of checkpoints to keep. However, tfe.Checkpoint and tfe.Saver don't have a parameter to choose the max to keep.

Is there a way to achieve this?

2 Answers

Answers 1

For the tf.train.Saver you can specify max_to_keep:

tf.train.Saver(max_to_keep = 10) 

and max_to_keep seems to be present in the both fte.Saver and it's tf.training.Saver.

I haven't tried if it works though.

Answers 2

It seems the suggested way of doing checkpoint deletion is to use the CheckpointManager.

import tensorflow as tf checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model) manager = tf.contrib.checkpoint.CheckpointManager(      checkpoint, directory="/tmp/model", max_to_keep=5) status = checkpoint.restore(manager.latest_checkpoint) while True: # train   manager.save() 
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment