Given a directed graph G with node pair (p,q) we have
I want to calculate the value of this recursive function where L(p) denotes the set of undirected link neighbors of node p. This function is for the (k+1)th value.
I know how to calculate L(p) and L(q).
Here is my attempt-
from __future__ import division import copy import numpy as np import scipy import scipy.sparse as sp import time import warnings class Algo(): def __init__(self, c=0.8, max_iter=5, is_verbose=False, eps=1e-3): self.c = c self.max_iter = max_iter self.is_verbose = is_verbose self.eps = eps def compute_similarity(self, a): ''' Note: The adjacency matrix is a (0,1)-matrix with zeros on its diagonal. ''' a_transpose=np.transpose(a) a_undirected=np.add(a,a_transpose) self.num_nodes = np.shape(a)[0] current_sim = np.eye(self.num_nodes) prev_sim = np.zeros(shape=(self.num_nodes, self.num_nodes)) #Determine the set of Neighbors for each node nbs = [] for i in range(self.num_nodes): nbs.append(np.nonzero(a_undirected[:, i])[0]) for itr in range(self.max_iter): if scipy.allclose(current_sim, prev_sim, atol=self.eps): break prev_sim = copy.deepcopy(current_sim) for i in range(self.num_nodes): for j in range(self.num_nodes): if i == j: continue if len(nbs[i]) == 0 or len(nbs[j]) == 0: current_sim[i][j] = 0 else: #commonSet=0.0 differenceSetofp_ij=0.0 differenceSetofq_ij=0.0 commonSet=len(np.intersect1d(nbs[i],nbs[j]))/len(np.union1d(nbs[i],nbs[j])) for x in np.setdiff1d(nbs[i],nbs[j]): for y in nbs[j]: differenceSetofp_ij+=prev_sim[x][y] differenceSetofp_ij*=1/(len(np.union1d(nbs[i],nbs[j]))*len(nbs[j])) #print differenceSetofp for x in np.setdiff1d(nbs[j],nbs[i]): for y in nbs[i]: differenceSetofq_ij+=prev_sim[x][y] differenceSetofq_ij*=1/(len(np.union1d(nbs[j],nbs[i]))*len(nbs[i])) current_sim[i][j] = self.c*(commonSet+differenceSetofp_ij+differenceSetofq_ij) if self.is_verbose: print('SimRank: converged after {0} iterations'.format(itr)) return current_sim if __name__ == "__main__": a = np.array([ [0, 1, 1, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0]]) obj = Algo(is_verbose=True, eps=1e-2) start_time = time.time() S = obj.compute_similarity(a) print(S) print('Runtime: ', time.time() - start_time, '\n')
The output I am getting is not correct.
2 Answers
Answers 1
The expected answer R_inf[1,2] = 0.8
and R_inf[6,8] = 0.8
of OP's example a
is the result of the algorithm applied to the directed graph a.T
, not the undirected.
Here is R_2
computed with OP's code (with some minor fixes) and with my code. And R_inf also computed with my code:
after 2 iterations with pp_loop: 0.8000 0 0 0 0 0 0 0.8000 0 0 0.8000 0.8000 0 0 0.4000 0.3200 0 0.3200 0 0.8000 0.8000 0 0 0.4000 0.3200 0 0.3200 0 0 0 0.8000 0.4800 0 0 0 0 0 0 0 0.4800 0.8000 0 0 0 0 0 0.4000 0.4000 0 0 0.8000 0.1600 0 0.1600 0 0.3200 0.3200 0 0 0.1600 0.8000 0 0.8000 0.8000 0 0 0 0 0 0 0.8000 0 0 0.3200 0.3200 0 0 0.1600 0.8000 0 0.8000 after 2 iterations with OP: 0.8000 0 0 0 0 0 0 0.8000 0 0 0.8000 0.8000 0 0 0.4000 0.3200 0 0.3200 0 0.8000 0.8000 0 0 0.4000 0.3200 0 0.3200 0 0 0 0.8000 0.4800 0 0 0 0 0 0 0 0.4800 0.8000 0 0 0 0 0 0.4000 0.4000 0 0 0.8000 0.1600 0 0.1600 0 0.3200 0.3200 0 0 0.1600 0.8000 0 0.8000 0.8000 0 0 0 0 0 0 0.8000 0 0 0.3200 0.3200 0 0 0.1600 0.8000 0 0.8000 in fixed point: 0.8000 0 0 0 0 0 0 0.8000 0 0 0.8000 0.8000 0 0 0.4000 0.3200 0 0.3200 0 0.8000 0.8000 0 0 0.4000 0.3200 0 0.3200 0 0 0 0.8000 0.4800 0.0960 0.0256 0 0.0256 0 0 0 0.4800 0.8000 0.1280 0 0 0 0 0.4000 0.4000 0.0960 0.1280 0.8000 0.1600 0 0.1600 0 0.3200 0.3200 0.0256 0 0.1600 0.8000 0 0.8000 0.8000 0 0 0 0 0 0 0.8000 0 0 0.3200 0.3200 0.0256 0 0.1600 0.8000 0 0.8000
My solution is faster and can compute the fixed point for networks up to ~1000
nodes.
A bit of explanation: The strategy is to express the recurrence relation by matrix algebra, because then we can leverage the highly optimized blas
powered linear algebra that comes with numpy/scipy. For example, the inner sums - all of them in one go - can be done by multiplying the matrix R_k of current values with the transposed adjacency matrix. This may seem wasteful considering all the zeros in the adjacency matrix but blas
is just very, very fast. Similarly, we can compute the sizes of the intersections by multiplying the adjacency matrix with its transpose, etc.
Assuming that the ultimate goal is not so much the sequence of R's but the fixed point we can save some more time by using a proper linear solver, because the recurrence is linear (+ offset). As writing the n^2xn^2
matrix of this operation is not efficient we use an iterative solver that can use a LinearOperator
in place of a matrix.
If tolerances are set tight enough both approaches yield the same fixed point.
Code including OP's with fixes:
import numpy as np from scipy import sparse from scipy.sparse import linalg def mock_data(n, p): data = (np.random.random((n, n)) < p).astype(int) np.einsum('ii->i', data)[:] = 0 return data import copy import scipy import scipy.sparse as sp import time import warnings class Algo(): def __init__(self, c=0.8, max_iter=5, is_verbose=False, eps=1e-3): self.c = c self.max_iter = max_iter self.is_verbose = is_verbose self.eps = eps def compute_similarity(self, a, R0=None, use_undirected=False, no_neighbor_policy='10'): ''' Note: The adjacency matrix is a (0,1)-matrix with zeros on its diagonal. ''' a_transpose=np.transpose(a) a_undirected=np.add(a,a_transpose) a_use = a_undirected if use_undirected else a self.num_nodes = np.shape(a)[0] current_sim = np.eye(self.num_nodes) if R0 is None else R0 prev_sim = np.zeros(shape=(self.num_nodes, self.num_nodes)) #Determine the set of Neighbors for each node nbs = [] for i in range(self.num_nodes): nbs.append(np.nonzero(a_use[i])[0]) R = [] for itr in range(self.max_iter): if scipy.allclose(current_sim, prev_sim, atol=self.eps): break prev_sim = copy.deepcopy(current_sim) R.append(prev_sim) for i in range(self.num_nodes): for j in range(self.num_nodes): if len(nbs[i]) == 0 and len(nbs[j]) == 0: current_sim[i][j] = ( 0 if (no_neighbor_policy=='00' or (no_neighbor_policy=='10' and i!=j)) else self.c) elif i == j: current_sim[i][j] = self.c elif len(nbs[i]) == 0 or len(nbs[j]) == 0: current_sim[i][j] = 0 else: #commonSet=0.0 differenceSetofp_ij=0.0 differenceSetofq_ij=0.0 commonSet=len(np.intersect1d(nbs[i],nbs[j]))/np.maximum(len(np.union1d(nbs[i],nbs[j])), 1) for x in np.setdiff1d(nbs[i],nbs[j]): for y in nbs[j]: differenceSetofp_ij+=prev_sim[x][y] differenceSetofp_ij*=1/np.maximum(len(np.union1d(nbs[i],nbs[j]))*len(nbs[j]), 1) #print differenceSetofp for x in np.setdiff1d(nbs[j],nbs[i]): for y in nbs[i]: differenceSetofq_ij+=prev_sim[x][y] differenceSetofq_ij*=1/np.maximum(len(np.union1d(nbs[j],nbs[i]))*len(nbs[i]), 1) current_sim[i][j] = self.c*(commonSet+differenceSetofp_ij+differenceSetofq_ij) if self.is_verbose: print('SimRank: converged after {0} iterations'.format(itr)) R.append(current_sim) return np.array(R) def f_OP(adj, c, max_n, R0=None, use_undirected=False, no_neighbor_policy='10'): return Algo(c, max_n).compute_similarity(adj, R0, use_undirected, no_neighbor_policy) def f_pp(adj, c, max_n, R0=None, use_undirected=False, no_neighbor_policy='10', solver=linalg.gcrotmk): n, n = adj.shape if use_undirected: adj = adj | adj.T aind = np.where(adj.T) n_neigh = adj.sum(axis=1) n_inter = adj @ adj.T n_union = np.add.outer(n_neigh, n_neigh) - n_inter r_union = c / np.maximum(n_union, 1) if no_neighbor_policy == '11': n_inter[n_union==0] = 1 r_union[n_union==0] = c elif no_neighbor_policy == '10': np.einsum('ii->i', n_inter)[np.einsum('ii->i', n_union)==0] = 1 np.einsum('ii->i', r_union)[np.einsum('ii->i', n_union)==0] = c elif no_neighbor_policy != '00': raise ValueError avg_ngh = adj.T / np.maximum(n_neigh, 1) if solver is None: R = np.empty((max_n+1, n, n)) R[0] = 0 if R0 is None else R0 if R0 is None: np.einsum('ii->i', R[0])[:] = 1 for i in range(max_n): rowsums = R[i] @ avg_ngh rowsums[aind] = 0 rec_sum = adj @ rowsums R[i+1] = r_union * (n_inter + rec_sum + rec_sum.T) if np.allclose(R[i+1], R[i], rtol=1e-7, atol=1e-10): return R[:i+2] return R else: def LO(x): R = x.reshape(n, n) rowsums = R @ avg_ngh rowsums[aind] = 0 rec_sum = adj @ rowsums return (r_union * (rec_sum + rec_sum.T) - R).ravel() R0 = np.identity(n) if R0 == None else R0 LO = linalg.LinearOperator((n*n, n*n), matvec=LO)#, rmatvec=LO) res = solver(LO, (-r_union * n_inter).ravel(), R0.ravel(), tol=1e-9) assert res[1]==0 return res[0].reshape(n, n) def f_pp_loop(adj, c, max_n, R0=None, use_undirected=False, no_neighbor_policy='10'): return f_pp(adj, c, max_n, R0, use_undirected, no_neighbor_policy, None) # test on OP's example: a = np.array([ [0, 1, 1, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0]]) a = a.T m, c, nnp = 2, 0.8, '11' sol = {} for f in f_pp_loop, f_OP: name = f.__name__[2:] sol[name] = f(a, c, m, None, False, nnp) print(f"after {len(sol[name])-1} iterations with {name}:") print("\n".join(" ".join("0 " if i == 0 else f"{i:6.4f}" for i in row) for row in sol[name][-1])) print("in fixed point:") print("\n".join(" ".join("0 " if i == 0 else f"{i:6.4f}" for i in row) for row in f_pp(a, c, 64, None, False, nnp))) print() # demo large network n, m, p, c, nnp = 1000, 256, 0.02, 0.9, '11' adj = mock_data(n, p) print(f'size {n}, maxiter {m}, connection prob {p}, c {c}') from time import time t = time() spp_loop = f_pp_loop(adj, c, m, None, False, nnp) print('pp_loop: {:8.5f} sec'.format(time()-t)) print('squared rel err after {} iterations: {}'.format(len(spp_loop)-1, np.nanmax(((spp_loop[-1]-spp_loop[-2])/(spp_loop[-1]+spp_loop[-2]))**2))) t = time() spp = f_pp(adj, c, m, None, False, nnp) print('pp: {:8.5f} sec'.format(time()-t)) def comp(a, b): print(np.allclose(a, b)) print('abserr', np.nanmax(np.abs(a-b)), 'relerr', np.nanmax(2 * np.abs(a-b)/(a+b))) print('fixed points equal:', end=' ') comp(spp_loop[-1], spp)
Answers 2
I'll use a NumPy array to hold the output (for ease of creation and indexing) and networkx for generating an example of a graph and providing sets of neighbors. Neither module is essential, as I'm not using any advanced features of them. Here is the setup:
import numpy as np import networkx as nx G = nx.petersen_graph() # example graph vlist = list(G.nodes) # list of vertices V = len(vlist) # number of vertices kmax = 4 # maximal k C = 2 # mysterious constant C R = np.zeros((kmax, V, V)) # array of results, R[k, i, j] np.fill_diagonal(R[0], 1) # initialize R[0] by putting 1 on its diagonal
And here is the loop calculating R[k+1, i, j]. The main characters are Lp and Lq, the sets of neighbors of p and q. Python sets are intersected with Lp & Lq
, joined with Lp | Lq
, and subtracted with Lp - Lq
. The lines with Rij assignments are pretty much verbatim copies of the given formulas.
for k in range(kmax-1): for i in range(V): for j in range(V): Lp, Lq = set(G[vlist[i]]), set(G[vlist[j]]) Rij = len(Lp & Lq) / len(Lp | Lq) Rij += sum([R[k, p1, q1] for q1 in Lq for p1 in Lp - Lq]) / (len(Lp | Lq) * len(Lq)) Rij += sum([R[k, p1, q1] for p1 in Lp for q1 in Lq - Lp]) / (len(Lp | Lq) * len(Lp)) R[k+1, i, j] = C*Rij
0 comments:
Post a Comment