I want to implement the Asynchronous Advantage Actor Critic (A3C) model for reinforcement learning in my local machine (1 CPU, 1 cuda compatible GPU). In this algorithm, several "learner" networks interact with copies of an environment and update a central model periodically.
I've seen implementations that create n "worker" networks and one "global" network inside the same graph and use threading to run these. In these approaches, the global net is updated by applying gradients to the trainable parameters with a "global" scope.
However, I recently read a bit about distributed tensorflow and now I'm a bit confused. Would it be easier/faster/better to implement this using the distributed tensorflow API? In the documentation and talks they always make expicit mention of using it in multi-device environments. I don't know if it's an overkill to use it in a local async algorithm.
I would also like to ask, is there a way to batch the gradients calculated by every worker to be applied together after n steps?
1 Answers
Answers 1
I found using threading simpler than the distributed tensorflow API, however it also runs slower. The more CPU cores you use, the faster distributed tensorflow becomes compared to threads.
However this only holds for asynchronous training. If the available CPU cores are limited and you want to make use of a GPU, you might want to use synchronous training with multiple workers instead, like OpenAI does in their A2C implementation. There only the environments are parallelized (through multiprocessing) and tensorflow uses the GPU without any graph parallelization. OpenAI reported that their results were better with synchronous training than with A3C.
Post a Comment