Wednesday, April 13, 2016

Using a graph-tool efficiently

Leave a Comment

After a long thought, I finally decided to post this question here. Few days back I started using graph-tool to do various things. I have been using Networkx before that. I have already seen the impressive performance comparision and thought that everything would be simple enough. However, I immediately ran into the speed issue and asked a question related to a particular aspect of it. I got a quick answer that satisfied me. However, now this speed issue is bugging me every now and then and I can't find any documentation about graph-tool that is related to efficiently using it. For example, from the answer to my last question, I came to realize that it is better to add all edges together instead of one by one which is a very important point to note but hasn't been mentioned anywhere! I am now having two more similar issues:

(1) How do I choose a random neighbour of a given node? I can only see the following solution:

nbr = np.random.choice(list(v.all_neighbours())) 

since v.all_neighbours() is a generator, I must convert it into the list to choose a random element. This slows down the code but I don't see any better way.

(2) I want to assign a 1d vector (is list okay?) to each of the vertices in the graph and later I am exchanging and modifying them in a particular way. This is simply a property map and I would like to see some documentation about how to use this efficiently. However, I can't locate anything.

(3) I am trying to simulate a triadic closure in some network which itself is changing with time. Thus, at every time step, I need an information about the neighbours of each vertex in the graph. Again, I must create a list (or numpy array):

nbrs = [w for w in v.neighbours()] 

which decreases the speed of my code substantially. This means that I am not doing this correctly but I couldn't find any documentation that would tell me how to use neighbours efficiently in graph-tool.

Somehow Networkx programs that I have written for same tasks have completely outperformed the graph-tool codes I simply can't buy this.

This list might increase and hence I would be very glad if somebody could point me to some documentation about using graph-tool efficiently apart from answering the above mentioned specific questions.

Thanks in advance.

2 Answers

Answers 1

  1. To choose a random item from a list of items, you can use numpy.random.choice. It also allows for the definition of value for probability of occurrence. See the documentations here. The alternative would be to get a random number between the boundaries of your data and use it as an index against your list / generator. This can be done using numpy.random.randint. Read more on this here.

    data = range(100) max_data = max(data) min_data = min(data) index = np.randint(min_data, max_data) random_choice = data[index] 
  2. From what I understand, I think your best option is to use a non-relational table, either in the memory, or a CSV file, or a SQLite database. You may store functions and variables (serialise them) using pickle. You can find out more about it from here.

  3. One way to avoid iterating through your data and increase the speed of your code is as follows:

    data = [1, 2, 3, 4, 5, 6, 7]  # or: data = range(1, 8) query = [1,4, 9]  answer = list(set(query) & set(data)) print(answer) 

    which returns:

    [1, 4] 

    Note that this works for a generator objects too and takes a fraction of the time otherwise spent iterating. The alternative, however, would be to use the map function, which is explained here.

    Another alternative is to use the numpy.in1d function:

    data = np.array(range(100)) query = np.array([1, 52, 121])  answer = np.in1d(data, query)  # Returns a logical array. new_array = data[answer]  print(new_array) 

    which returns:

    array([ 1, 52]) 

    This is the most efficient method, as it uses vectorisation algorithms. For this set of data, it takes an average of 11.7 µs for 100k loops; whereas the one using set takes 12.4 µs. Bear in mind that both set and map cache the data, and do not perform operations until the answer is requested. NumPy, on the other hand, performs all operations immediately and as such, may be beneficial in some circumstances. Although if you're using generators, other methods are probably just as valid.

You indicate in your question that networkx outperforms graph-tools, and that you "don't buy" this. I take it you say this because the latter is chiefly implemented in C++. If it is, here is why: Whilst this is a valid argument, it is often taken for granted that anything implemented in C is necessarily better in performance when compared to Python. This is generally true, however, when it comes to mathematical operations that do NOT involve iterations, Python might actually be faster. The reason for this is that unlike many well-known languages (namely C, C++ and Java), Python takes advantage of vectorisation (as do, for instance, Julia, MATLAB, and R). This means that a mathematical operation is applied collectively to an entire array, as opposed to inducing an iteration that goes through individual elements. A single line of operation in Python using NumPy can require pages and pages in Java for this very reason.

I hope this helps to an extent. To give you a tailored help, however, I would need to see your code.

Answers 2

I'll try to make more "graph-tool-specific" answers:

1) well actually this one is related to python, so you might want to use the solution from this thread using random.shuffle However, if you are going to do this repeatedly, (and if you have enough available memory), I think it might be better to get the adjacency matrix as a scipy sparse matrix then use that matrix:

import graph_tool from graph_tool.spectral import adjacency import numpy as np  g = graph_tool.Graph() g.add_vertex(100) edges = np.random.randint(0, 100, (500,2)) g.add_edge_list(edges)  mat = adjacency(g).tolil() rows = mat.rows.copy() # mat.rows = in-neighbours for each vertex map(np.random.shuffle, rows) rnd_neighbours = { i:row[0] for i, row in enumerate(rows) if row } 

Where rnd_neighbours contains the index of one random neighbour for each vertex of nonzero out-degree.

2) reading the graph-tool documentation regarding PropertyMaps and the detailed version, lists are accepted as python::object or simply object. You can then access them as elements in the PropertyMap.

3) For triadic closure, look at the clustering module and at this thread on the mailing list.

EDIT: by the way, I forgot to mention it, but you can access and change the number of OpenMP threads with openmp_enabled, openmp_get_num_threads, and openmp_set_num_threads in graph-tool. Can't find it in the doc, though... But here is the source.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment