The following script executes very slow. I just want to count the total number of lines in the twitter-follwer-graph (textfile with ~26 GB).
I need to perform a machine learning task. This is just a test on accessing data from the hdfs by tensorflow.
import tensorflow as tf import time filename_queue = tf.train.string_input_producer(["hdfs://default/twitter/twitter_rv.net"], num_epochs=1, shuffle=False) def read_filename_queue(filename_queue): reader = tf.TextLineReader() _, line = reader.read(filename_queue) return line line = read_filename_queue(filename_queue) session_conf = tf.ConfigProto(intra_op_parallelism_threads=1500,inter_op_parallelism_threads=1500) with tf.Session(config=session_conf) as sess: sess.run(tf.initialize_local_variables()) coord = tf.train.Coordinator() threads = tf.train.start_queue_runners(coord=coord) start = time.time() i = 0 while True: i = i + 1 if i%100000 == 0: print(i) print(time.time() - start) try: sess.run([line]) except tf.errors.OutOfRangeError: print('end of file') break print('total number of lines = ' + str(i)) print(time.time() - start)
The process needs about 40 secs for the first 100000 lines. I tried to set intra_op_parallelism_threads
and inter_op_parallelism_threads
to 0, 4, 8, 40, 400 and 1500. But it didn't effect the execution time significantly ...
Can you help me?
system specs:
- 16 GB RAM
- 4 CPU cores
2 Answers
Answers 1
You can split the big file into smaller ones, it may help. And set intra_op_parallelism_threads and inter_op_parallelism_threads to 0
Answers 2
Try this and it should improve your timing:
session_conf = tf.ConfigProto (intra_op_parallelism_threads=0,inter_op_parallelism_threads=0)
It is not good to take the Config in your own hands when you do not know what is an optimum value.
0 comments:
Post a Comment