I've used TensorFlow but am new to distributed TensorFlow for training models. My understanding is that current best practices favor the data parallel model with asynchronous updates:
A paper published by the Google Brain team in April 2016 benchmarked various approaches and found that data parallelism with synchronous updates using a few spare replicas was the most efficient, not only converging faster but also producing a better model. -- Chapter 12 of Hands-On Machine Learning with Scikit-Learn and Tensorflow.
Now, my confusion from reading further about this architecture is figuring out which component applies the parameter updates: the workers or the parameter server?
In my illustration below, it's clear to me that the workers compute the gradients dJ/dw (the gradient of the loss J with respect to the parameter weights w). But who applies the gradient descent update rule?
What's a bit confusing is that this O'Reilly article on Distributed TensorFlow states the following:
In the more centralized architecture, the devices send their output in the form of gradients to the parameter servers. These servers collect and aggregate the gradients. In synchronous training, the parameter servers compute the latest up-to-date version of the model, and send it back to devices. In asynchronous training, parameter servers send gradients to devices that locally compute the new model. In both architectures, the loop repeats until training terminates.
The above paragraph suggests that in asynchronous training:
- The workers compute gradients and send it to the parameter server.
- The parameter server broadcasts the gradients to the workers.
- Each worker receives the broadcasted gradients and applies the update rule.
Is my understanding correct? If it is, then that doesn't seem very asynchronous to me because the workers have to wait for the parameter server to broadcast the gradients. Any explanation would be appreciated.
1 Answers
Answers 1
Usually the parameter servers only store the global parameters and the workers directly apply their gradients to the global parameters (which are stored on the parameter servers). In asynchronous training no broadcasting takes place! The workers do the following in a loop:
- Get current global parameters from PS
- Calculate gradient
- Apply gradient to global parameters (after applying the gradients to variables stored on the parameter servers, tensorflow will send the gradients to the parameter servers and apply them there)
In between of the steps 1 and 3 the global parameters change because other workers apply their gradients. The applying of the gradients is usually hogwild.
In asynchronous training, parameter servers send gradients to devices
I don't think this happens in any asynchronous implementations. Don't know what the author tried to say here.
0 comments:
Post a Comment