Python, Scipy: Building triplets using large adjacency matrix

2023-03-24 20:23 问答作者：

I am using an adjacency matrix to represent a network of friends which can be visually interpreted as

Mary     0        1      1      1

Joe      1        0      1      1

Bob      1        1      0      1

Susan    1        1      1      0 

         Mary     Joe    Bob    Susan

Using this matrix, I want to compile a list of all possible friendship triangles with the condition that user 1 is friends with user 2, and user 2 is friends with user 3. For my list, it is not required that user 1 is friends with user 3.

(joe, mary, bob)
(joe, mary, susan)
(bob, mary, susan)
(bob, joe, susan)

I have a bit of code that works well with small triangles, but I need it to scale for very large sparse matrices.

from numpy import *
from scipy import *

def buildTriangles(G):
    # G is a sparse adjacency matrix
    start = time.time()
    ctr = 0
    G = G + G.T          # I do this to make sure it is symmetric
    triples = []
    for i in arange(G.shape[0] - 1):  # for each row but the last one
        J,J = G[i,:].nonzero()        # J: primary friends of user i
                                      # I do J,J because I do not care about the row values
        J = J[ J < i ]                # only computer the lower triangle to avoid repetition
        for j in J:
            K, buff = G[:,j].nonzero() # K: secondary friends of user i
            K = K[ K > i ]             # only compute below i to avoid repetition
            for k in K:
                ctr = ctr + 1
                triples.append( (i,j,k) )
    print("total number of triples: %d" % ctr)
    开发者_如何学JAVAprint("run time is %.2f" % (time.time() - start())
    return triples

I was able to run the code on a csr_matrix in approximately 21 minutes. The matrix was 1032570 x 1032570 and contained 88910 stored elements. There were a total of 2178893 triplets generated.

I need to be able to do something similar with a 1968654 x 1968654 sparse matrix with 9428596 stored elements.

I'm very new to python (little less than a month of experience) and not the greatest at linear algebra, which is why my code does not take advantage of matrices operations. Can anyone make any suggestions for improvement or let me know if my objective is even realistic?

I think you can find triangles only in rows or columns. for example:

Susan    1        1      1      0 
        Mary     Joe    Bob    Susan

this means Mary, Joe, Bob are all friends of Susan, so, use combinations to choose two person from [Mary, Joe, Bob], and combine it with Susan will get one triangle. itertools.combinations() do this quickly.

Here is the code:

import itertools
import numpy as np

G = np.array(   # clear half of the matrix first
    [[0,0,0,0],
     [1,0,0,0],
     [1,1,0,0],
     [1,1,1,0]])
triples = []     
for i in xrange(G.shape[0]):
    row = G[i,:]
    J = np.nonzero(row)[0].tolist() # combinations() with list is faster than NumPy array.
    for t1,t2 in itertools.combinations(J, 2):
        triples.append((i,t1,t2))
print triples

Here's some suggestions for optimization:

K = K[ K > i ]             # only compute below i to avoid repetition
for k in K:
    ctr = ctr + 1
    triples.append( (i,j,k) )

Don't increment in a loop, it's terribly slow. Just ctr += K.shape[0] will do. Then, eliminate the most deeply nested loop altogether by replacing the append with

triples += ((i, j, k) for k in K[K > i])

Now, if you want real performance on this task, you will have to get into some linear algebra. "I want to compile a list of all possible friendship triangles" means that you want to square the adjacency matrix, which you can do with a simple **2.

Then realize that 1.968.654² means a very big matrix, and even though it's very sparse, its square will be much less so and will take a lot of memory. (I once tackled a similar problem where I considered links between Wikipedia articles at distance two, which took 20 minutes to solve, on a supercomputer cluster node, in C++. This is not a trivial problem. The Wikipedia adjacency matrix was a few orders of magnitude denser, though.)

继续阅读：adjacency-matrix data-mining numpy python scipy

Python, Scipy: Building triplets using large adjacency matrix

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？