Monday, June 19, 2017

Why should preprocessing be done on CPU rather than GPU?

Leave a Comment

The performance guide advises to do the preprocessing on CPU rather that on GPU. The listed reasons are

  1. This prevent the data from going from CPU to GPU to CPU to GPU back again.
  2. This frees the GPU of these tasks to focus on training.

I am not sure to understand either arguments.

  1. Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU? Why preprocessing operations and not any other operation on the graph, why are they/should they be special?
  2. Even though I understand the rationale behind putting the CPU to work rather than keeping it idle, compared to the huge convolutions and other gradient backpropagation a training step has to do, I would have assumed that random cropping, flip and other standard preprocessing steps on the input images should be nowhere near in term of computation needs, and should be executed in a fraction of the time. Even if we think of preprocessing as mostly moving things around (crop, flips), I think GPU memory should be faster for that. Yet doing preprocessing on the CPU can yield a 6+-fold increase in throughput according to the same guide.

I am assuming of course that preprocessing does not result in a drastic decrease in size of the data (e.g. supersampling or cropping to a much smaller size), in which case the gain in transfer time to the device is obvious. I suppose these are rather extreme cases and do not constitute the basis for the above recommendation.

Can somebody make sense out of this?

2 Answers

Answers 1

It is based on the same logic on how CPU and GPU works. GPU is good at doing repetitive parallelised tasks very well, whereas CPU is good at other computations, which require more processing capabilities.

For example, consider a program, which accepts inputs of two integers from the user and runs a for-loop for 1 Million times to sum the two numbers.

How we can achieve this with the combination of CPU and GPU processing?

We do the initial data (two user input integers) intercept part from the user on CPU and then send the two numbers to GPU and the for-loop to sum the numbers runs on the GPU because that is the repetitive, parallelizable yet simple computation part, which GPU is better at. [Although this example wasn't really exactly related to tensorflow but this concept is the heart of all CPU and GPU processing. Regarding your query: Processing abilities like random cropping, flip and other standard preprocessing steps on the input images might not be computational intensive but GPU doesn't excel in such kind of interrupt related computation either.]

Another thing we need to keep in mind that the latency between CPU and GPU also plays a key role here. Copying and transferring data to and fro CPU and GPU is expensive if compared to the transfer of data between different cache levels inside CPU.

As Dey, 2014 [1] have mentioned:

When a parallelized program is computed on the GPGPU, first the data is copied from the memory to the GPU and after computation the data is written back to the memory from the GPU using the PCI-e bus (Refer to Fig. 1.8). Thus for every computation, data has to be copied to and fro device-host-memory. Although the computation is very fast in GPGPU, but because of the gap between the device-host-memory due to communication via PCI-e, the bottleneck in the performance is generated.

enter image description here

For this reason it is advisable that:

You do the preprocessing on CPU, where the CPU does the initial computation, prepares and sends the rest of the repetitive parallelised tasks to the GPU for further processing.

I once developed a buffer mechanism to increase the data processing between CPU and GPU, and henceforth reduce the negative effects of latency between CPU and GPU. Have a look at this thesis to gain a better understanding of this issue:

EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU)

Now, to answer your question:

Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU?

As quoted from the performance guide of Tensorflow [2],

When preprocessing occurs on the GPU the flow of data is CPU -> GPU (preprocessing) -> CPU -> GPU (training). The data is bounced back and forth between the CPU and GPU.

If you remember the dataflow diagram between the CPU-Memory-GPU mentioned above, the reason for doing the preprocessing on CPU improves performance because:

  • After computation of nodes on GPU, data is sent back on the memory and CPU fetches that memory for further processing. GPU does not have enough memory on-board (on GPU itself) to keep all the data on it for computational prupose. So back-and-forth of data is inevitable. To optimise this data flow, you do preprocessing on CPU, then the data (for training purposes), which is prepared for parallelizable tasks, is sent to the memory and GPU fetches that preprocessed data and work on it.

In the performance guide itself it also mentions that by doing this, and having an efficient input pipeline, you won't starve either CPU or GPU or both, which itself proves the aforementioned logic. Again, in the same performance doc, you will also see the mentioning of

If your training loop runs faster when using SSDs vs HDDs for storing your input data, you could could be I/O bottlenecked.If this is the case, you should pre-process your input data, creating a few large TFRecord files.

This again tries to mention the same CPU-Memory-GPU performance bottleneck, which is mentioned above.

Hope this helps and in case you need more clarification (on CPU-GPU performance), do not hesitate to drop a message!

References:

[1] Somdip Dey, EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU), 2014

[2] Tensorflow Performance Guide: https://www.tensorflow.org/performance/performance_guide

Answers 2

I quote at first two arguments from the performance guide and I think your two questions concern these two arguments respectively.

The data is bounced back and forth between the CPU and GPU. ... Another benefit is preprocessing on the CPU frees GPU time to focus on training.

(1) Operations like file reader, queue and dequeue can only be performed in CPU, operations like reshape, cast, per_image_standardization can be in CPU or GPU. So a wild guess for your first question: if the code doesn't specify /cpu:0, the program will perform in CPU by readers, then pre-process images in GPU, and finally enqueue and dequeue in CPU. (Not sure I am correct. waiting for an expert to verify...)

(2) For the second question, I have the same doubt too. When you train a large network, most of time is spent on the huge convolutions and the gradient computation, not on preprocessing images. However, when they mean 6X+ increase in samples/sec processed, I think they mean the training on MNIST, where a small network is usually used. So that makes sense. Smaller convolutions spend much less time so the time spent on preprocessing is relatively large. 6X+ increase is possible for this case. But preprocessing on the CPU frees GPU time to focus on training is a reasonable explanation.

Hope this could help you.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment