开发者

Challenge,how to implement an algorithm for six degree of separation?

UserA-UserB-UserC-UserD-UserF

Users connected by '-' know each other.

And I need an algorithm for these 2 tasks:

  1. Calculate the path from UserX to UserY
  2. For UserX,calculate all users that is no more than 3 steps away.

Is there an efficient solution?

EDIT

开发者_StackOverflow社区My purpose is not to prove it right or wrong,but to calculate the result real time when necessary.

Plus,I think the most expressive way is code,even pseudo ones.

EDIT AGAIN

I've decided that this kind of job must be done inside database,so it must be a sql solution!


Represent this list of users by a graph

  • Each user is a node
  • There is an edge between any two users who know each other
  1. Calculate the path from UserX to UserY
  2. For UserX,calculate all users that is no more than 3 steps away.

These questions are so closely related that the same algorithm actually solves both of them. You can call it "Dijkstra's algorithm with all edges having weight 1," or "breadth-first search."

Essentially, starting at the first node, visit all its relatives; then mark them all as visited, record the shortest path to each (the shortest path to them + the edge you just traversed), and repeat for each of them. Stop after you've reached your destination for Problem #1, stop after the shortest path is > 3 for Problem #2.

This will run in O(n) time. No, there is no faster way.

The fastest O(n) algorithm for six-degrees of separation would probably be finding the sets of all users 1-step away from UserX and UserY, and finding the intersection of those two sets. If none, then add the users 2-steps from UserX and intersect; then add users 2-steps from UserY and intersect; etc. up to 3.

If each person has an average of 100 friends, this could require looking at up to about 2,020,200 users, as opposed to the 1,010 billion for Dijkstra's algorithm. In practice, these numbers would be much smaller, since often two of your friends are also friends with each other.

This is the only method of solving six-degrees of separation (of those mentioned so far) that will work in practice.


Graph algorithms can help you here. Learning about them is fun too!

  • Graph connectivity for the connectivity.
  • Dijkstra (A*) for paths between users
  • Simple DFS for finding all users N nodes away from your user

Dijkstra or similar should be used if you want to find the shortest path between two users. There may be several paths between them, and Dijkstra takes care of noting when it found a shorted path than another that was found before.

To find the shortest paths between all users you'll have to use something like Floyd-Warshall. It is a nice example of dynamic programming and is quite simple to implement. The pseudocode from Wikipedia is:

 1 /* Assume a function edgeCost(i,j) which returns the cost of the edge from i to j
 2    (infinity if there is none).
 3    Also assume that n is the number of vertices and edgeCost(i,i) = 0
 4 */
 5
 6 int path[][];
 7 /* A 2-dimensional matrix. At each step in the algorithm, path[i][j] is the shortest path
 8    from i to j using intermediate vertices (1..k−1).  Each path[i][j] is initialized to
 9    edgeCost(i,j) or infinity if there is no edge between i and j.
10 */
11
12 procedure FloydWarshall ()
13    for k := 1 to n
14       for i := 1 to n
15          for j := 1 to n
16             path[i][j] = min ( path[i][j], path[i][k]+path[k][j] );


I have a suggestion which is quite different from those you already have. If you have to stick with an SQL database and you don't know any java this suggestion won't be of much use.

You problem is specifically a graph problem, so I would suggest that while using an SQL database to store the graph will work, a different approach would be to use a solution that is intended specifically for graph problems.

The Neo4j project provides a disk-based graph database, together will lots of graph algorithms to work with it. Quote:

Neo4j is a graph database. It is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. A graph (mathematical lingo for a network) is a flexible data structure that allows a more agile and rapid style of development.

An appropriate example of using Neo4j on their wiki demonstrates a degrees-of-separation web application using IMDB data. The example illustrates shortest-path calculations between any actor and Kevin Bacon.

I like this example since it talks a lot about modeling the domain your graph will represent. Modeling your domain ensures you have thought about things like:

  1. Directed vs Undirected
  2. Edge types/relationships
  3. Attributes such as edge weights

As has been mentioned in other posts, there are a number of algorithms for calculating shortest-paths, such as Dijkstra, Floyd Warshall or BFS. All these have been implemented in Neo4j and some examples are provided here.


Assuming source data is in a table: Connections:(PersonID, KnowsPersonID)

1) This will have to use a breadth first approach. Your potential for good performance is limited due to the exponential nature of the problem (although this exponential nature is why theoretically you only need 6 degrees :D). Make sure you limit the depth of your search. Whichever flavour of SQL you choose you'll probably be better off using its iterative extensions as opposed to a pure set based solution.

The following would be the basic approach using Microsoft's T-SQL:

CREATE PROCEDURE FindPath (@UserX int, @UserY int)

CREATE TABLE #SixDegrees(
  ConnectedUser int,
  Depth int,
  Path varchar(100),
  CONSTRAINT PK_SixDegrees PRIMARY KEY CLUSTERED (
    ConnectedUser
  )
)

DECLARE @Depth int,
        @PathFound varchar(100)
SET @Depth = 0

INSERT INTO #SixDegrees (@UserX, 0, CAST(@UserX as varchar))
/*Next line just in case X & Y are the same*/
SET @PathFound = (SELECT Path 
                  FROM #SixDegrees 
                  WHERE ConnectedUser = @UserY)

WHILE @Depth < 6 AND @PathFound IS NULL
BEGIN
  SET @Depth = @Depth + 1
  INSERT INTO #SixDegrees
  SELECT  k.KnowsPersonID, @Depth, (SELECT Path 
                                    FROM #SixDegrees 
                                    WHERE ConnectedUser = k.Link) + ',' + CAST(k.KnowsPersonID AS varchar)
  FROM (
      SELECT  MIN(ConnectedUser) Link, KnowsPersonID
      FROM    #SixDegrees
              JOIN Connections ON
                PersonID = ConnectedUser
      WHERE   Depth = @Depth
              /*EDIT: Added the following*/
              AND KnowsPersonID NOT IN (
                  SELECT  ConnectedUser
                  FROM    #SixDegrees
                  )
      GROUP BY KnowsPersonID
      ) k

  SET @PathFound = (SELECT Path 
                    FROM #SixDegrees 
                    WHERE ConnectedUser = @UserY)
END

IF @Path IS NULL
  PRINT 'No path found'
ELSE
  PRINT @Path
GO

EDIT: In the above solution I originally forgot to exclude users already in the #SixDegrees temp table.

2) A little tweak on the above to always loop to a depth of 3 would leave you with #SixDegrees containing all the Users you're interested in.

However, the following pure set based solution should be more efficient:

SELECT  DISTINCT KnowsPersonID
FROM    Connections
WHERE   PersonID IN (
    SELECT  DISTINCT KnowsPersonID
    FROM    Connections
    WHERE   PersonID IN (
        SELECT  KnowsPersonID
        FROM    Connections
        WHERE   PersonID = @UserX
        ) l1
    ) l2


The following scripts are written in sybase sql. You might have to go through minor modifications according to your db server.

Problem 1.

create table #connections (
    my_user  varchar(10)  not null  ,
    knows varchar(10)  not null  ,
        CONSTRAINT connection_pk PRIMARY KEY CLUSTERED ( my_user, knows)   
) 

create table #traversed (id varchar(10) primary key)

insert into #connections VALUES ('UserA','UserB')
insert into #connections VALUES ('UserB','UserA')
insert into #connections VALUES ('UserB','UserC')
insert into #connections VALUES ('UserC','UserB')
insert into #connections VALUES ('UserC','UserD')
insert into #connections VALUES ('UserD','UserC')
insert into #connections VALUES ('UserD','UserF')
insert into #connections VALUES ('UserF','UserD')

DECLARE @str_sql   varchar(200)               
DECLARE @str_order varchar(60)

declare @start varchar(10)
set @start = ('UserD')
declare @end varchar(10)
set @end = ('UserA')

if (@start >= @end)
    set @str_order = " order by id desc"
else
    set @str_order = " order by id asc"


INSERT INTO #traversed VALUES (@start)

WHILE (select count(*) from #traversed where id = @end) = 0    
BEGIN     
  INSERT INTO #traversed (id)    
  SELECT DISTINCT knows  
  FROM #connections e JOIN #traversed p ON p.id = e.my_user  
  WHERE e.knows NOT IN (SELECT id FROM #traversed)     
  AND e.knows between (select case when @start < @end then @start else @end end)  
      and (select case when @start < @end then @end  else @start end) 
END

set @str_sql = "SELECT #traversed.id FROM #traversed" + @str_order 
exec (@str_sql)

Problem 2.

create table #connections (
    my_user  varchar(10)  not null  ,
    knows varchar(10)  not null  ,
        CONSTRAINT connection_pk PRIMARY KEY CLUSTERED ( my_user, knows)   
) 

create table #traversed (id varchar(10) primary key)

insert into #connections VALUES ('UserA','UserB')
insert into #connections VALUES ('UserB','UserA')
insert into #connections VALUES ('UserB','UserC')
insert into #connections VALUES ('UserC','UserB')
insert into #connections VALUES ('UserC','UserD')
insert into #connections VALUES ('UserD','UserC')
insert into #connections VALUES ('UserD','UserF')
insert into #connections VALUES ('UserF','UserD')

declare @start varchar(10)
set @start = ('UserB')

declare @higher_counter int
declare @lower_counter int

set @higher_counter = 0
set @lower_counter = 0

INSERT INTO #traversed VALUES (@start)

WHILE (@higher_counter < 3)
BEGIN     
  INSERT INTO #traversed (id)    
  SELECT DISTINCT knows  
  FROM #connections e JOIN #traversed p ON p.id = e.my_user  
  WHERE e.knows NOT IN (SELECT id FROM #traversed)     
  AND e.knows > @start 

  set @higher_counter = @higher_counter +1
END  

WHILE (@lower_counter < 3)
BEGIN     
  INSERT INTO #traversed (id)    
  SELECT DISTINCT knows  
  FROM #connections e JOIN #traversed p ON p.id = e.my_user  
  WHERE e.knows NOT IN (SELECT id FROM #traversed)     
  AND e.knows < @start 

  set @lower_counter = @lower_counter +1
END   

SELECT #traversed.id FROM #traversed


I had a look at this a while ago and could not come up with an efficient solution for a web application.

I ended up with 5 levels rather than six

See here for my google group post there is a SQL and a C# solution.

Note: that you should google for "Dijkstra's algorithm" as it is known to be a good algorithm to find a shortest path.

Edit: Try this link

BTW The CLR method executed the fastest.


The first question can be solved using dijkstra's algorithm. The second on using a DFS algorithm. This has already been said by the other guys, just wanted to point out, that the most efficient solution for both problems is not available in one algorithm.

pseudocode can be found at:

[Wikipedia][1]

for dijkstra and one in python for DFS at:

http://en.wikipedia.org/wiki/Depth-first_search


For task 2, you're not going to do better than breadth-first search, except maybe by caching.

For task 1, apply your solution for task 2. Find all users no more than 3 hops away from user X. Of course if user Y is in that set, you're done. If not, do a breadth-first search starting from user Y, and stop as soon as you reach any user that you already know is reachable from X.

(If you cache a little information about how you reached each user during task 2, then it will be easy to reconstruct the exact path when you find a link in task 1.)


(This answer is equivalent to Djikstra's. It is basically an implementation detail.)

To answer #2, you can use boolean matrix multiplication to determine connectivity up to degree P.

Assuming you have a boolean matrix M where:

M(A, B)= A is directly connected to B

Then

(M(A, B))^P= A is connected to B within P links.

The matrix multiplication should use AND for multiply and OR for add:

You can optimize this a lot by only doing the multiplication for entries that were previously false as well as realizing that the matrix is symmetric. That is left as an exercise to the reader.


There is actually a reasonably efficient way to do this with MariaDB using OQGraph.

Assuming the data is contained in two tables:

CREATE TABLE `entity` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `type` enum('ACTOR','MOVIE','TV MOVIE','TV MINI','TV SERIES','VIDEO MOVIE','VIDEO GAME','VOICE','ARCHIVE') NOT NULL,
  `name` varchar(128) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `type` (`type`,`name`) USING BTREE
) ENGINE=InnoDB;

CREATE TABLE `link` (
  `rel_id` int(11) NOT NULL AUTO_INCREMENT,
  `link_from` int(11) NOT NULL,
  `link_to` int(11) NOT NULL,
  PRIMARY KEY (`rel_id`),
  KEY `link_from` (`link_from`,`link_to`),
  KEY `link_to` (`link_to`)
) ENGINE=InnoDB;

The OQGraph virtual table can then be declared as:

CREATE TABLE movie_graph (
  latch SMALLINT UNSIGNED NULL,
  origid BIGINT UNSIGNED NULL,
  destid BIGINT UNSIGNED NULL,
  weight DOUBLE NULL,
  seq BIGINT UNSIGNED NULL,
  linkid BIGINT UNSIGNED NULL,
  KEY (latch, origid, destid) USING HASH,
  KEY (latch, destid, origid) USING HASH
) ENGINE=OQGRAPH 
  data_table='link' origid='link_from' destid='link_to';

Then the data can be queried:

MariaDB [imdb]> SELECT
             -> GROUP_CONCAT(name ORDER BY seq SEPARATOR ' -> ') AS path
             -> FROM imdb_graph JOIN entity ON (id=linkid)
             -> WHERE latch=1
             -> AND origid=(SELECT a.id FROM entity a
             ->             WHERE name='Kevin Bacon')
             -> AND destid=(SELECT b.id FROM entity b
                            WHERE name='N!xau')\G
*************************** 1. row ***************************
path: Kevin Bacon -> The 45th Annual Golden Globe Awards (1988) -> Richard Attenborough -> In Darkest Hollywood: Cinema and Apartheid (1993) -> N!xau
1 row in set (10 min 6.55 sec)

Graph of approximately 3.7 million nodes with 30 million edges. Tables are about 3.5GB and InnoDB configured for 512MB buffer pool on laptop hard disk . Approximately 16 million secondary key reads. Cold, no data preloaded into buffer pool. 2010 MacBook Pro.

Of course, it's a lot faster if the table could be held in the buffer pool.

This example comes from: https://www.slideshare.net/AntonyTCurtis/oqgraph-scale2013-17332168/21


Google it and you'll find a lot.

I doubt you can find a pseudo code (at least I've yet to). Here are some of the interesting reads:

"Six Degrees of Separation" Explained In New Computer Algorithm
CU computer scientist helps explain how 'six degrees of separation' works
SO - How can I prove the “Six Degrees of Separation” concept programmatically?


I'm responding to the SQL solution only. This gives all the paths 3 steps away though it may not be "efficient" for large datasets. The tables "KNOW", "KNOW_1", etc. are all identical and have two fields P1 and P2. It has an entry only if 1) P1 knows P2 or 2) P2 knows P1. The data in P1 and P2 can be arbitrary strings corresponding to each person.

This Acccess SQL query should yield all the paths where a knows b knows c knows d without cycles (e.g a knows b knows c knows a). You still need to eliminate duplicates (abcd = dcba), but should be able to do it easily in a second step. The cycles are eliminated by preventing repeats of previous people in the Where statements.

SELECT Know.P1, Know.P2, Know_1.P2, Know_2.P2

FROM (Know INNER JOIN Know AS Know_1 ON Know.P2 = Know_1.P1)

INNER JOIN Know AS Know_2 ON Know_1.P2 = Know_2.P1

WHERE (((Know_1.P2)<>[Know].[P1]) AND ((Know_2.P2)<>[Know].[P1] And (Know_2.P2)<>[Know].[P2]))

ORDER BY Know.P1, Know.P2, Know_1.P2, Know_2.P2;

Not as elegant as the previous solutions, but it seems to work OK. We've had some experience with doing similar work with constraint programming and found that the SQL process is faster for certain problems.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜