Trying to group values?
I have some data like this:
1 2
3 4
5 9
2 6
3 7
and am looking for an output like this (group-id and the members of that group):
1: 1 2 6
2: 3 4 7
3: 5 9
First row because 1 is "connected" to 2 and 2 is connected to 6. Second row because 3 is connected to 4 and 3 is connected to 7
This looked to me like a graph traversal but the final order does not matter so I was wondering if someone can suggest a simpler solution that I can use on a large dat开发者_如何转开发aset (billions of entries).
From the comments:
- The problem is to find the set of disjoint sub-graphs given a set of edges.
- The edges are not directed; the line '1 2' means that 1 is connected to 2 and 2 is connected to 1.
- The '1:' in the sample output could be 'A:' without changing the meaning of the answer.
EDIT 1:
Problem looks solved now. Thanks to everyone for their help. I need some more help picking the best solution that can be used on billions of such entries.
EDIT 2:
Test Input file:
1 27
1 134
1 137
1 161
1 171
1 275
1 309
1 413
1 464
1 627
1 744
2 135
2 398
2 437
2 548
2 594
2 717
2 738
2 783
2 798
2 912
5 74
5 223
7 53
7 65
7 122
7 237
7 314
7 701
7 730
7 755
7 821
7 875
7 884
7 898
7 900
7 930
8 115
9 207
9 305
9 342
9 364
9 493
9 600
9 676
9 830
9 941
10 164
10 283
10 380
10 423
10 468
10 577
11 72
11 132
11 276
11 306
11 401
11 515
11 599
12 95
12 126
12 294
13 64
13 172
13 528
14 396
15 35
15 66
15 210
15 226
15 360
15 588
17 263
17 415
17 474
17 648
17 986
21 543
21 771
22 47
23 70
23 203
23 427
23 590
24 286
24 565
25 175
26 678
27 137
27 161
27 171
27 275
27 309
27 413
27 464
27 627
27 684
27 744
29 787
Benchmarks:
I tried out everything and the version posted by TokenMacGuy is the fastest on the sample dataset that I tried. The dataset has about 1 million entries for which it took me about 6 seconds on a Dual Quad-Core 2.4GHz machine. I haven't gotten a chance to run it on the entire dataset yet but I will post the benchmark as soon as it is available.
I've managed O(n log n).
Here is a (somewhat intense) C++ implementation:
#include <boost/pending/disjoint_sets.hpp>
#include <boost/property_map/property_map.hpp>
#include <map>
#include <set>
#include <iostream>
typedef std::map<int, int> rank_t;
typedef std::map<int, int> parent_t;
typedef boost::associative_property_map< rank_t > rank_pmap_t;
typedef boost::associative_property_map< parent_t > parent_pmap_t;
typedef boost::disjoint_sets< rank_pmap_t, parent_pmap_t > group_sets_t;
typedef std::set<int> key_set;
typedef std::map<int, std::set<int> > output;
With some typedefs out of the way, here's the real meat. I'm using boost::disjoint_sets, which is just happens to be an exceptionally good representation for the problem. This first function checks to see if either of the keys given have been seen before, and adds them to the collections if needed. the important part is really the union_set(a, b)
which links the two sets together. If one or the other of the sets are already in the groups
collection, they get linked too.
void add_data(int a, int b, group_sets_t & groups, key_set & keys)
{
if (keys.count(a) < 1) groups.make_set(a);
if (keys.count(b) < 1) groups.make_set(b);
groups.union_set(a, b);
keys.insert(a);
keys.insert(b);
}
This isn't too exciting, it just iterates through all of the keys we've seen and gets the representative key for that key, then adds the pair (representative, key) to a map. Once that's done, print out the map.
void build_output(group_sets_t & groups, key_set & keys)
{
output out;
for (key_set::iterator i(keys.begin()); i != keys.end(); i++)
out[groups.find_set(*i)].insert(*i);
for (output::iterator i(out.begin()); i != out.end(); i++)
{
std::cout << i->first << ": ";
for (output::mapped_type::iterator j(i->second.begin()); j != i->second.end(); j++)
std::cout << *j << " ";
std::cout << std::endl;
}
}
int main()
{
rank_t rank;
parent_t parent;
rank_pmap_t rank_index(rank);
parent_pmap_t parent_index(parent);
group_sets_t groups( rank_index, parent_index );
key_set keys;
int a, b;
while (std::cin >> a)
{
std::cin >> b;
add_data(a, b, groups, keys);
}
build_output(groups, keys);
//std::cout << "number of sets: " <<
// groups.count_sets(keys.begin()), keys.end()) << std::endl;
}
I stayed up many hours learning how to use boost::disjoint_sets
on this problem. There doesn't seem to be much of any documentation on it.
About the performance. The disjoint_sets
structure is O(α(n) ) for its key operations (make_set
, find_set
and union_set
) which is pretty close to constant, and so if it were just a matter of building the structure, the whole algorithm would be O(n α(n) ) (which is effectively O(n) ) but we have to print it out. That means we have to build up some associative containers, which cannot perform better than O(n log n). It might be possible to get a constant factor speedup by choosing a different associative containers (say, hash_set
etc), since once you populate the initial list, you can reserve an optimal amount of space.
Ok so I got something working in parallel to the other solution posted by @Jonathan (first of all, many thanks for your time). My solution looks deceptively simple but would love some suggestions on whether this is correct (maybe I'm missing a corner case somewhere?) because it seems to produce the output I wanted but I'll have to parse it in a second pass to group the same group numbers which is trivial. The logic is that everytime it finds a new number not in the array it increments a group_id counter:
My code in PHP:
<?php
//$fp = fopen("./resemblance.1.out", "r");
$fp = fopen("./wrong", "r");
$groups = array();
$group["-1"] = 1;
$groups[] = $group;
$map = array();
//Maintain a count
$group = 1;
while(!feof($fp)) {
$source = trim(fgets($fp, 4096));
//echo $source."\n";
$source = explode(" ", $source);
if(array_key_exists($source[0], $map) && !array_key_exists($source[1], $map)) {
$map[$source[1]] = $map[$source[0]];
} else if(array_key_exists($source[1], $map) && !array_key_exists($source[0], $map)) {
$map[$source[0]] = $map[$source[1]];
} else if(array_key_exists($source[1], $map) && array_key_exists($source[0], $map) && $map[$source[1]] != $map[$source[0]]) {
// Adjust the groups - change the groups of one of the elements to the other
$keys = array_keys($map, $map[$source[1]]);
print_r($keys);
foreach($keys as $key) {
$map[$key] = $map[$source[0]];
}
} else {
$group++;
$map[$source[0]] = $group;
$map[$source[1]] = $group;
}
}
print_r($map);
?>
Output:
Array
(
[1] => 2
[2] => 2
[3] => 3
[4] => 3
[5] => 4
[9] => 4
[6] => 2
[7] => 3
[] => 5
)
EDIT: Fixed the bug that was mentioned in the comment. Just playing around out of curiosity :) Feel free to point out any other bugs. I am currently testing out which solution is faster.
Here is a sample Perl solution that works on the original data set:
1 2
3 4
5 9
2 6
3 7
Group 1: 1 2 6
Group 2: 3 4 7
Group 3: 5 9
On the big data set, it produces the output:
Group 1: 1 27 134 137 161 171 275 309 413 464 627 684 744
Group 2: 2 135 398 437 548 594 717 738 783 798 912
Group 3: 5 74 223
Group 4: 7 53 65 122 237 314 701 730 755 821 875 884 898 900 930
Group 5: 8 115
Group 6: 9 207 305 342 364 493 600 676 830 941
Group 7: 10 164 283 380 423 468 577
Group 8: 11 72 132 276 306 401 515 599
Group 9: 12 95 126 294
Group 10: 13 64 172 528
Group 11: 14 396
Group 12: 15 35 66 210 226 360 588
Group 13: 17 263 415 474 648 986
Group 14: 21 543 771
Group 15: 22 47
Group 16: 23 70 203 427 590
Group 17: 24 286 565
Group 18: 25 175
Group 19: 26 678
Group 20: 29 787
Whether it is efficient enough is a separate matter...
use strict;
use warnings;
my %cache = ();
while (<>)
{
chomp;
my($x,$y) = split /\s+/;
#print "$x $y\n";
$cache{$x}{$y} = 1;
$cache{$y}{$x} = 1;
}
my $grp = 1;
foreach my $key (sort { $a <=> $b } keys %cache)
{
#print "key: $key\n";
if (defined $cache{$key})
{
my %result = ();
subkey_search(\%result, $key);
print "Group $grp:";
$grp++;
foreach my $val (sort { $a <=> $b } keys %result)
{
print " $val";
}
print "\n";
}
}
sub subkey_search
{
my($resultref, $key) = @_;
my %hash = %{$cache{$key}};
delete $cache{$key};
$resultref->{$key} = 1;
foreach my $subkey (sort keys %hash)
{
#print "subkey: $subkey\n";
subkey_search($resultref, $subkey) if (defined $cache{$subkey});
}
}
Here's a slightly different version in Python, which builds a graph containing the edges specified, then converts that to a list of connected subgraphs.
I might want to use this later so I wrote it as a general-purpose version that doesn't do input from a file or output with print statements, just converting data structures.
def graph_to_connected_subgraphs(graph):
trees = []
for start in graph.keys():
if start in graph:
list = [start]
append_tree_from(graph, start, list)
trees.append(list)
return trees
def append_tree_from(graph, node, list):
if node in graph:
for endpoint in graph[node]:
list.append(endpoint)
append_tree_from(graph, endpoint, list)
del graph[node]
return list
def add_edge(graph, f, s):
if s < f: # ensure f < s to handle cyclic graphs
f, s = s, f
if f not in graph:
graph[f] = [s]
else:
graph[f].append(s)
graph = {}
add_edge(graph, 1,2)
add_edge(graph, 2,6)
add_edge(graph, 3,4)
add_edge(graph, 5,9)
add_edge(graph, 3,7)
print graph_to_connected_subgraphs(graph)
Output
[[1, 2, 6], [3, 4, 7], [5, 9]]
This is a typical application of DFS (Depth First Search) algorithm performed on graphs. Try read this dfs Complexity of this algorithm is O(|V|+|E|), where V - number of vertices and E - number of edges
Here's my stab at an answer. Python.
groups = []
infile = open("so2.txt")
for line in infile.readlines():
newset = set(line.split())
matchgroups = []
excludegroups = []
for group in groups:
if len(newset & group):
newset |= group
else:
excludegroups.append(group)
groups = excludegroups
groups.append( newset)
for i, s in enumerate(groups):
print "%d: %s"%(i, " ".join(s))
The Idea here is that forming graphs is not really right. Each pair of numbers in the input is a set. The rule is to return only disjoint sets. So I read each line and convert them to sets, then I check all of the existing sets for intersections, and merge those into the new set. Nonintersecting sets are just added to the new list of sets, and once I'm done I add the new, merged set into the new list of sets. This way I can be sure that only disjoint sets make it into the list.
My version in PHP actually is only a refactoring of your code. It fixes one issue in your code (you have one group too much) and is implemented slightly more efficient (Exec time drops from 0.0035 to 0.0020 on slow machine):
$group = 0;
$map = array();
do {
list($a, $b) = explode(' ', fgets($file));
$a = (int) $a;
$b = (int) $b;
if (!isset($map[$a]) && !isset($map[$b])) {
$map[$a] = $map[$b] = ++$group;
} elseif (!isset($map[$b])) {
$map[$b] = $map[$a];
} elseif (!isset($map[$a])) {
$map[$a] = $map[$b];
} elseif ($map[$a] != $map[$b]) {
// move one group to the other
foreach ($map as $n => $g) {
if ($g == $map[$b]) {
$map[$n] = $map[$a];
}
}
}
} while (!feof($file));
// print results
$results = array();
foreach ($map as $val => $group) {
$results[$group][] = $val;
}
echo '<pre>';
$i = 0;
foreach ($results as $result) {
sort($result);
echo 'Group ', ++$i, ': ', implode(' ', $result), "\n";
}
If the input is sorted, like your huge test set, you can do it in Θ(input) time and Ο(input) space. If the input is unsorted, you could modify this fairly easily and get Ο(input log input) time and Θ(input) space.
This is very quick because :
- it stores a hash map from a node to its "ultimate owner" (which is the node with the lowest ID in the connected component) which allows Ο(1) "ultimate owner" lookup and insertion.
- it stores a hash map from each "ultimate owner" to a result line which allows Ο(1) result line lookup and insertion.
- each result line is a linked list which allows Ο(1) appending.
- it stores a linked list of result lines which allows :
- Ο(1) appending.
- access to it without going through the expense of asking the prior hash map for all its values
import java.util.Scanner;
import java.util.Map;
import java.util.HashMap;
import java.util.List;
import java.util.LinkedList;
public final class Solver {
public static void main(String[] args) {
Scanner in = new Scanner(System.in);
final Map<Integer, Integer> ultimateOwners = new HashMap<Integer, Integer>();
final Map<Integer, List<Integer>> ownerToOwned = new HashMap<Integer, List<Integer>>();
final List<List<Integer>> results = new LinkedList<List<Integer>>();
while (in.hasNextInt()) {
// Get ultimate owner.
int owner = in.nextInt();
if (ultimateOwners.containsKey(owner)) owner = ultimateOwners.get(owner);
// Get owned and register its ultimate owner.
final int owned = in.nextInt();
ultimateOwners.put(owned, owner);
// Add owned to result.
if (ownerToOwned.containsKey(owner)) ownerToOwned.get(owner).add(owned);
else {
final List<Integer> resultLine = new LinkedList<Integer>();
resultLine.add(owner);
resultLine.add(owned);
ownerToOwned.put(owner, resultLine);
results.add(resultLine);
}
}
int lineNumber = 1;
for (final List<Integer> line : results) {
System.out.printf("%d: ", lineNumber++);
for (final Integer value : line) {
System.out.printf("%d ", value);
}
System.out.println();
}
}
}
After not being completely satisfied with my first 2 attempts at this, and some research, I came across this recipe for disjoint sets in Python, with Raymond Hettinger's blessing and input. (Raymond Hettinger is a long-time very active Python core developer.)
Here is an adaptation of that recipe that's very close to my first 2 attempts, but the recipe itself may be a better reference.
Collecting should be as efficient as possible for very large sets of data, as much of the set operations in Python are implemented in C. The input data does not have to be sorted. For printing, I sorted outputs solely for readability, but if this affects performance, connections can be printed without sorting.
# ~~~~~
# data, setup
input = '''
1 2
3 4
2 3
''' # etc.
def lgen():
for l in input.splitlines():
l = l.strip()
if l:
yield tuple(int(i) for i in l.split())
# ~~~~~
# collect
connections = {} # this is a mapping of values to the connections they are in
# each node will map to a shared object instance of the connection it is in
# e.g. {1: set([1,2]), 2: set([1,2])}, where the 2 sets are the same object
for x, y in lgen():
cx = connections.setdefault(x, set([x])) # if not found, create new connection with this single value
cy = connections.get(y) # defaults to None if not found
if not cy: # if we haven't come across this value yet...
cx.add(y) # ...add it to the current connection...
connections[y] = cx # ...and update the reference
elif cy is not cx: # if the cy connection is not the exact same object as the cx connection...
if len(cy) > len(cx): # \
cx, cy = cy, cx # >... merge them ...
cx |= cy # /
connections[y] = cx # ...and update the reference
# ~~~~~
# print
seen = set()
for key in sorted(connections.keys()):
if key not in seen:
c = connections[key]
print sorted(c)
seen |= c
This actually is a very classic Union-find.
If M is the number of edges, N the number of nodes, the time complexity is O(M * α(M)) which is a O(M) for all practical M and the space complexity if O(N) with N the number of nodes.
The algorithm is online and does not need to know in advance all the edges (compared to other graph-traversal solutions) and hence can scale very well. Also there is no need to order the edges, they can be given in any order.
For graph with billions of nodes you'll need 64 bits / long int and a lot of RAM, but it should handle millions of nodes and billions of edges very well.
The implementation is in C++ but only use vector/map that you can find in almost any language you might want to use.
But since you have unique id for each element we need to map these id to (contiguous) integers.
First version without mapping (suppose that all nodes between 1 and N exists):
#include <cstdio>
#include <algorithm>
#include <vector>
using namespace std;
const int MAX_N = 1000*1000;
int p[MAX_N],f[MAX_N];
int parent(int a) {
return p[a] == a ? a : p[a] = parent(p[a]);
}
bool join(int a, int b) {
p[a = parent(a)] = parent(b);
return p[a] != a;
}
int main()
{
// First integer in the file : number of nodes in the graph
int N;
scanf("%d",&N);
// Union-find in O(M * alpha(M)) ~= O(M)
// M = number of lines in the file
for(int i = 1; i <= N ; i++)
{
p[i] = i;
f[i] = -1;
}
int a,b;
while(scanf("%d%d",&a,&b) != EOF)
join(a,b);
// Determine the number of groups : O(M)
int nG = 0;
for(int i = 1 ; i <= N ; i++)
{
p[i] = parent(p[i]);
if(f[p[i]] == -1)
f[p[i]] = nG++;
}
// Build groups : O(M)
vector< vector<int> > Groups(N+1);
for(int i = 1 ; i <= N ; i++)
Groups[ f[p[i]] ].push_back(i);
// Output result
for(int i = 0 ; i < Groups.size() ; i++)
{
if(!Groups[i].size())
continue;
printf("%d : ",i);
for(int j = 0 ; j < Groups[i].size() ; j++)
printf("%d ",Groups[i][j]);
printf("\n");
}
}
Version with mapping : for that version we need to build the mapping. Since I don't know anything about your data, I'm using a classical map
to build it in O(M log(N)), if you can send the id of all nodes at the begin of the input file, it can be O(N log(N)) or even better if your are using a Hash Map ( O(N)) or if you can build the mapping yourself, with some knowing of the graph.
Anyway, here is the code :
#include <cstdio>
#include <algorithm>
#include <vector>
#include <map>
using namespace std;
const int MAX_N = 1000*1000;
int p[MAX_N],f[MAX_N];
int parent(int a) {
return p[a] == a ? a : p[a] = parent(p[a]);
}
bool join(int a, int b) {
p[a = parent(a)] = parent(b);
return p[a] != a;
}
// Mapping
int N = 0;
map<int,int> indMap,invMap;
int IND(int x) {
if(indMap.find(x) == indMap.end())
{
N++;
p[N] = N;
f[N] = -1;
indMap[x] = N;
}
invMap[ indMap[x] ] = x;
return indMap[x];
}
int main()
{
// Union-find in O(M * alpha(M)) ~= O(M)
// M = number of lines in the file
int a,b;
while(scanf("%d%d",&a,&b) != EOF)
join(IND(a),IND(b));
// Determine the number of groups : O(M)
int nG = 0;
for(int i = 1 ; i <= N ; i++)
{
p[i] = parent(p[i]);
if(f[p[i]] == -1)
f[p[i]] = nG++;
}
// Build groups : O(M)
vector< vector<int> > Groups(N+1);
for(int i = 1 ; i <= N ; i++)
Groups[ f[p[i]] ].push_back(i);
// Output result
for(int i = 0 ; i < Groups.size() ; i++)
{
if(!Groups[i].size())
continue;
printf("%d : ",i+1);
for(int j = 0 ; j < Groups[i].size() ; j++)
printf("%d ", invMap[ Groups[i][j] ]);
printf("\n");
}
}
精彩评论