Saturday, December 23, 2017

Calculate value of double summation of function over the pairs of vertices of a graph

Leave a Comment

Given a directed graph G with node pair (p,q) we have

enter image description here

I want to calculate the value of this recursive function where L(p) denotes the set of undirected link neighbors of node p. This function is for the (k+1)th value.

I know how to calculate L(p) and L(q).

Here is my attempt-

from __future__ import division import copy  import numpy as np import scipy import scipy.sparse as sp import time import warnings     class Algo():     def __init__(self, c=0.8, max_iter=5, is_verbose=False, eps=1e-3):         self.c = c         self.max_iter = max_iter         self.is_verbose = is_verbose         self.eps = eps      def compute_similarity(self, a):         '''          Note: The adjacency matrix is a (0,1)-matrix with zeros on its          diagonal.         '''                                          a_transpose=np.transpose(a)         a_undirected=np.add(a,a_transpose)          self.num_nodes = np.shape(a)[0]         current_sim = np.eye(self.num_nodes)         prev_sim = np.zeros(shape=(self.num_nodes, self.num_nodes))          #Determine the set of Neighbors for each node         nbs = []          for i in range(self.num_nodes):              nbs.append(np.nonzero(a_undirected[:, i])[0])           for itr in range(self.max_iter):              if scipy.allclose(current_sim, prev_sim, atol=self.eps):                 break              prev_sim = copy.deepcopy(current_sim)              for i in range(self.num_nodes):                 for j in range(self.num_nodes):                     if i == j:                         continue                      if len(nbs[i]) == 0 or len(nbs[j]) == 0:                         current_sim[i][j] = 0                     else:                         #commonSet=0.0                         differenceSetofp_ij=0.0                         differenceSetofq_ij=0.0                         commonSet=len(np.intersect1d(nbs[i],nbs[j]))/len(np.union1d(nbs[i],nbs[j]))                           for x in np.setdiff1d(nbs[i],nbs[j]):                             for y in nbs[j]:                                  differenceSetofp_ij+=prev_sim[x][y]                                 differenceSetofp_ij*=1/(len(np.union1d(nbs[i],nbs[j]))*len(nbs[j]))                                 #print differenceSetofp                            for x in np.setdiff1d(nbs[j],nbs[i]):                             for y in nbs[i]:                                  differenceSetofq_ij+=prev_sim[x][y]                                 differenceSetofq_ij*=1/(len(np.union1d(nbs[j],nbs[i]))*len(nbs[i]))                           current_sim[i][j] = self.c*(commonSet+differenceSetofp_ij+differenceSetofq_ij)           if self.is_verbose:              print('SimRank: converged after {0} iterations'.format(itr))         return current_sim                      if __name__ == "__main__":      a = np.array([ [0, 1, 1, 0, 0, 1, 0, 0, 0],                    [0, 0, 0, 0, 1, 0, 0, 0, 0],                    [0, 0, 0, 1, 0, 0, 0, 0, 0],                    [0, 0, 0, 0, 0, 0, 0, 0, 0],                    [0, 0, 0, 0, 0, 0, 1, 0, 1],                    [0, 0, 0, 1, 0, 0, 0, 0, 0],                    [0, 0, 0, 0, 0, 1, 0, 0, 0],                    [0, 0, 0, 0, 0, 0, 1, 0, 1],                    [0, 0, 0, 0, 0, 0, 0, 0, 0]])                            obj = Algo(is_verbose=True, eps=1e-2)     start_time = time.time()     S = obj.compute_similarity(a)     print(S)     print('Runtime: ', time.time() - start_time, '\n')   

The output I am getting is not correct.

2 Answers

Answers 1

The expected answer R_inf[1,2] = 0.8 and R_inf[6,8] = 0.8 of OP's example a is the result of the algorithm applied to the directed graph a.T, not the undirected.

Here is R_2 computed with OP's code (with some minor fixes) and with my code. And R_inf also computed with my code:

after 2 iterations with pp_loop: 0.8000 0      0      0      0      0      0      0.8000 0      0      0.8000 0.8000 0      0      0.4000 0.3200 0      0.3200 0      0.8000 0.8000 0      0      0.4000 0.3200 0      0.3200 0      0      0      0.8000 0.4800 0      0      0      0      0      0      0      0.4800 0.8000 0      0      0      0      0      0.4000 0.4000 0      0      0.8000 0.1600 0      0.1600 0      0.3200 0.3200 0      0      0.1600 0.8000 0      0.8000 0.8000 0      0      0      0      0      0      0.8000 0      0      0.3200 0.3200 0      0      0.1600 0.8000 0      0.8000 after 2 iterations with OP: 0.8000 0      0      0      0      0      0      0.8000 0      0      0.8000 0.8000 0      0      0.4000 0.3200 0      0.3200 0      0.8000 0.8000 0      0      0.4000 0.3200 0      0.3200 0      0      0      0.8000 0.4800 0      0      0      0      0      0      0      0.4800 0.8000 0      0      0      0      0      0.4000 0.4000 0      0      0.8000 0.1600 0      0.1600 0      0.3200 0.3200 0      0      0.1600 0.8000 0      0.8000 0.8000 0      0      0      0      0      0      0.8000 0      0      0.3200 0.3200 0      0      0.1600 0.8000 0      0.8000 in fixed point: 0.8000 0      0      0      0      0      0      0.8000 0      0      0.8000 0.8000 0      0      0.4000 0.3200 0      0.3200 0      0.8000 0.8000 0      0      0.4000 0.3200 0      0.3200 0      0      0      0.8000 0.4800 0.0960 0.0256 0      0.0256 0      0      0      0.4800 0.8000 0.1280 0      0      0      0      0.4000 0.4000 0.0960 0.1280 0.8000 0.1600 0      0.1600 0      0.3200 0.3200 0.0256 0      0.1600 0.8000 0      0.8000 0.8000 0      0      0      0      0      0      0.8000 0      0      0.3200 0.3200 0.0256 0      0.1600 0.8000 0      0.8000 

My solution is faster and can compute the fixed point for networks up to ~1000 nodes.

A bit of explanation: The strategy is to express the recurrence relation by matrix algebra, because then we can leverage the highly optimized blas powered linear algebra that comes with numpy/scipy. For example, the inner sums - all of them in one go - can be done by multiplying the matrix R_k of current values with the transposed adjacency matrix. This may seem wasteful considering all the zeros in the adjacency matrix but blas is just very, very fast. Similarly, we can compute the sizes of the intersections by multiplying the adjacency matrix with its transpose, etc.

Assuming that the ultimate goal is not so much the sequence of R's but the fixed point we can save some more time by using a proper linear solver, because the recurrence is linear (+ offset). As writing the n^2xn^2 matrix of this operation is not efficient we use an iterative solver that can use a LinearOperator in place of a matrix.

If tolerances are set tight enough both approaches yield the same fixed point.

Code including OP's with fixes:

import numpy as np from scipy import sparse from scipy.sparse import linalg  def mock_data(n, p):     data = (np.random.random((n, n)) < p).astype(int)     np.einsum('ii->i', data)[:] = 0     return data  import copy  import scipy import scipy.sparse as sp import time import warnings  class Algo():     def __init__(self, c=0.8, max_iter=5, is_verbose=False, eps=1e-3):         self.c = c         self.max_iter = max_iter         self.is_verbose = is_verbose         self.eps = eps      def compute_similarity(self, a, R0=None, use_undirected=False,                            no_neighbor_policy='10'):         '''          Note: The adjacency matrix is a (0,1)-matrix with zeros on its          diagonal.         '''                                          a_transpose=np.transpose(a)         a_undirected=np.add(a,a_transpose)         a_use = a_undirected if use_undirected else a          self.num_nodes = np.shape(a)[0]         current_sim = np.eye(self.num_nodes) if R0 is None else R0         prev_sim = np.zeros(shape=(self.num_nodes, self.num_nodes))          #Determine the set of Neighbors for each node         nbs = []          for i in range(self.num_nodes):              nbs.append(np.nonzero(a_use[i])[0])         R = []          for itr in range(self.max_iter):              if scipy.allclose(current_sim, prev_sim, atol=self.eps):                 break              prev_sim = copy.deepcopy(current_sim)             R.append(prev_sim)             for i in range(self.num_nodes):                 for j in range(self.num_nodes):                     if len(nbs[i]) == 0 and len(nbs[j]) == 0:                         current_sim[i][j] = (                             0 if (no_neighbor_policy=='00' or                                   (no_neighbor_policy=='10' and i!=j))                             else self.c)                     elif i == j:                         current_sim[i][j] = self.c                     elif len(nbs[i]) == 0 or len(nbs[j]) == 0:                         current_sim[i][j] = 0                     else:                         #commonSet=0.0                         differenceSetofp_ij=0.0                         differenceSetofq_ij=0.0                         commonSet=len(np.intersect1d(nbs[i],nbs[j]))/np.maximum(len(np.union1d(nbs[i],nbs[j])), 1)                           for x in np.setdiff1d(nbs[i],nbs[j]):                             for y in nbs[j]:                                  differenceSetofp_ij+=prev_sim[x][y]                         differenceSetofp_ij*=1/np.maximum(len(np.union1d(nbs[i],nbs[j]))*len(nbs[j]), 1)                                 #print differenceSetofp                            for x in np.setdiff1d(nbs[j],nbs[i]):                             for y in nbs[i]:                                  differenceSetofq_ij+=prev_sim[x][y]                         differenceSetofq_ij*=1/np.maximum(len(np.union1d(nbs[j],nbs[i]))*len(nbs[i]), 1)                           current_sim[i][j] = self.c*(commonSet+differenceSetofp_ij+differenceSetofq_ij)           if self.is_verbose:              print('SimRank: converged after {0} iterations'.format(itr))         R.append(current_sim)         return np.array(R)  def f_OP(adj, c, max_n, R0=None, use_undirected=False, no_neighbor_policy='10'):     return Algo(c, max_n).compute_similarity(adj, R0, use_undirected,                                              no_neighbor_policy)  def f_pp(adj, c, max_n, R0=None, use_undirected=False,          no_neighbor_policy='10', solver=linalg.gcrotmk):     n, n = adj.shape     if use_undirected:          adj = adj | adj.T     aind = np.where(adj.T)     n_neigh = adj.sum(axis=1)     n_inter = adj @ adj.T     n_union = np.add.outer(n_neigh, n_neigh) - n_inter     r_union = c / np.maximum(n_union, 1)     if no_neighbor_policy == '11':         n_inter[n_union==0] = 1         r_union[n_union==0] = c     elif no_neighbor_policy == '10':         np.einsum('ii->i', n_inter)[np.einsum('ii->i', n_union)==0] = 1         np.einsum('ii->i', r_union)[np.einsum('ii->i', n_union)==0] = c     elif no_neighbor_policy != '00':         raise ValueError     avg_ngh = adj.T / np.maximum(n_neigh, 1)     if solver is None:         R = np.empty((max_n+1, n, n))         R[0] = 0 if R0 is None else R0         if R0 is None:             np.einsum('ii->i', R[0])[:] = 1                for i in range(max_n):             rowsums = R[i] @ avg_ngh             rowsums[aind] = 0             rec_sum = adj @ rowsums             R[i+1] = r_union * (n_inter + rec_sum + rec_sum.T)             if np.allclose(R[i+1], R[i], rtol=1e-7, atol=1e-10):                 return R[:i+2]         return R     else:         def LO(x):             R = x.reshape(n, n)             rowsums = R @ avg_ngh             rowsums[aind] = 0             rec_sum = adj @ rowsums             return (r_union * (rec_sum + rec_sum.T) - R).ravel()         R0 = np.identity(n) if R0 == None else R0         LO = linalg.LinearOperator((n*n, n*n), matvec=LO)#, rmatvec=LO)         res = solver(LO, (-r_union * n_inter).ravel(), R0.ravel(), tol=1e-9)         assert res[1]==0         return res[0].reshape(n, n)  def f_pp_loop(adj, c, max_n, R0=None, use_undirected=False,               no_neighbor_policy='10'):     return f_pp(adj, c, max_n, R0, use_undirected, no_neighbor_policy, None)   # test on OP's example: a = np.array([ [0, 1, 1, 0, 0, 1, 0, 0, 0],                [0, 0, 0, 0, 1, 0, 0, 0, 0],                [0, 0, 0, 1, 0, 0, 0, 0, 0],                [0, 0, 0, 0, 0, 0, 0, 0, 0],                [0, 0, 0, 0, 0, 0, 1, 0, 1],                [0, 0, 0, 1, 0, 0, 0, 0, 0],                [0, 0, 0, 0, 0, 1, 0, 0, 0],                [0, 0, 0, 0, 0, 0, 1, 0, 1],                [0, 0, 0, 0, 0, 0, 0, 0, 0]])  a = a.T m, c, nnp = 2, 0.8, '11'  sol = {} for f in f_pp_loop, f_OP:     name = f.__name__[2:]     sol[name] = f(a, c, m, None, False, nnp)     print(f"after {len(sol[name])-1} iterations with {name}:")     print("\n".join(" ".join("0     " if i == 0 else f"{i:6.4f}" for i in row) for row in sol[name][-1])) print("in fixed point:") print("\n".join(" ".join("0     " if i == 0 else f"{i:6.4f}" for i in row) for row in f_pp(a, c, 64, None, False, nnp)))  print()  # demo large network n, m, p, c, nnp = 1000, 256, 0.02, 0.9, '11' adj = mock_data(n, p) print(f'size {n}, maxiter {m}, connection prob {p}, c {c}') from time import time t = time() spp_loop = f_pp_loop(adj, c, m, None, False, nnp) print('pp_loop: {:8.5f} sec'.format(time()-t)) print('squared rel err after {} iterations: {}'.format(len(spp_loop)-1, np.nanmax(((spp_loop[-1]-spp_loop[-2])/(spp_loop[-1]+spp_loop[-2]))**2))) t = time() spp = f_pp(adj, c, m, None, False, nnp) print('pp: {:8.5f} sec'.format(time()-t)) def comp(a, b):     print(np.allclose(a, b))     print('abserr', np.nanmax(np.abs(a-b)), 'relerr', np.nanmax(2 * np.abs(a-b)/(a+b))) print('fixed points equal:', end=' ') comp(spp_loop[-1], spp) 

Answers 2

I'll use a NumPy array to hold the output (for ease of creation and indexing) and networkx for generating an example of a graph and providing sets of neighbors. Neither module is essential, as I'm not using any advanced features of them. Here is the setup:

import numpy as np import networkx as nx  G = nx.petersen_graph()    # example graph vlist = list(G.nodes)      # list of vertices V = len(vlist)             # number of vertices kmax = 4                   # maximal k C = 2                      # mysterious constant C R = np.zeros((kmax, V, V)) # array of results, R[k, i, j] np.fill_diagonal(R[0], 1)  # initialize R[0] by putting 1 on its diagonal 

And here is the loop calculating R[k+1, i, j]. The main characters are Lp and Lq, the sets of neighbors of p and q. Python sets are intersected with Lp & Lq, joined with Lp | Lq, and subtracted with Lp - Lq. The lines with Rij assignments are pretty much verbatim copies of the given formulas.

for k in range(kmax-1):   for i in range(V):     for j in range(V):       Lp, Lq = set(G[vlist[i]]), set(G[vlist[j]])       Rij = len(Lp & Lq) / len(Lp | Lq)       Rij += sum([R[k, p1, q1] for q1 in Lq for p1 in Lp - Lq]) / (len(Lp | Lq) * len(Lq))       Rij += sum([R[k, p1, q1] for p1 in Lp for q1 in Lq - Lp]) / (len(Lp | Lq) * len(Lp))       R[k+1, i, j] = C*Rij    
If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment