Trying to group values?

2023-01-19 00:42 问答作者：

I have some data like this:

and am looking for an output like this (group-id and the members of that group):

1: 1 2 6
2: 3 4 7
3: 5 9

First row because 1 is "connected" to 2 and 2 is connected to 6. Second row because 3 is connected to 4 and 3 is connected to 7

This looked to me like a graph traversal but the final order does not matter so I was wondering if someone can suggest a simpler solution that I can use on a large dat开发者_如何转开发aset (billions of entries).

From the comments:

The problem is to find the set of disjoint sub-graphs given a set of edges.
The edges are not directed; the line '1 2' means that 1 is connected to 2 and 2 is connected to 1.
The '1:' in the sample output could be 'A:' without changing the meaning of the answer.

EDIT 1:

Problem looks solved now. Thanks to everyone for their help. I need some more help picking the best solution that can be used on billions of such entries.

EDIT 2:

Test Input file:

Benchmarks:

I tried out everything and the version posted by TokenMacGuy is the fastest on the sample dataset that I tried. The dataset has about 1 million entries for which it took me about 6 seconds on a Dual Quad-Core 2.4GHz machine. I haven't gotten a chance to run it on the entire dataset yet but I will post the benchmark as soon as it is available.

I've managed O(n log n).

Here is a (somewhat intense) C++ implementation:

#include <boost/pending/disjoint_sets.hpp>
#include <boost/property_map/property_map.hpp>

#include <map>
#include <set>
#include <iostream>


typedef std::map<int, int> rank_t;
typedef std::map<int, int> parent_t;

typedef boost::associative_property_map< rank_t > rank_pmap_t;
typedef boost::associative_property_map< parent_t > parent_pmap_t;

typedef boost::disjoint_sets< rank_pmap_t, parent_pmap_t > group_sets_t;

typedef std::set<int> key_set;
typedef std::map<int, std::set<int> > output;

With some typedefs out of the way, here's the real meat. I'm using boost::disjoint_sets, which is just happens to be an exceptionally good representation for the problem. This first function checks to see if either of the keys given have been seen before, and adds them to the collections if needed. the important part is really the union_set(a, b) which links the two sets together. If one or the other of the sets are already in the groups collection, they get linked too.

void add_data(int a, int b, group_sets_t & groups, key_set & keys)
{
  if (keys.count(a) < 1) groups.make_set(a);
  if (keys.count(b) < 1) groups.make_set(b);
  groups.union_set(a, b);
  keys.insert(a);
  keys.insert(b);
}

This isn't too exciting, it just iterates through all of the keys we've seen and gets the representative key for that key, then adds the pair (representative, key) to a map. Once that's done, print out the map.

void build_output(group_sets_t & groups, key_set & keys)
{
  output out;
  for (key_set::iterator i(keys.begin()); i != keys.end(); i++)
    out[groups.find_set(*i)].insert(*i);

  for (output::iterator i(out.begin()); i != out.end(); i++)
  {
    std::cout << i->first << ": ";
    for (output::mapped_type::iterator j(i->second.begin()); j != i->second.end(); j++)
      std::cout << *j << " ";
    std::cout << std::endl;
  }
}

int main()
{

  rank_t rank;
  parent_t parent;
  rank_pmap_t rank_index(rank);
  parent_pmap_t parent_index(parent);


  group_sets_t groups( rank_index, parent_index );
  key_set keys;

  int a, b;
  while (std::cin >> a)
  {
    std::cin >> b;
    add_data(a, b, groups, keys);
  }  


  build_output(groups, keys);
  //std::cout << "number of sets: " << 
  //  groups.count_sets(keys.begin()), keys.end()) << std::endl;

}

I stayed up many hours learning how to use boost::disjoint_sets on this problem. There doesn't seem to be much of any documentation on it.

About the performance. The disjoint_sets structure is O(α(n) ) for its key operations (make_set, find_set and union_set) which is pretty close to constant, and so if it were just a matter of building the structure, the whole algorithm would be O(n α(n) ) (which is effectively O(n) ) but we have to print it out. That means we have to build up some associative containers, which cannot perform better than O(n log n). It might be possible to get a constant factor speedup by choosing a different associative containers (say, hash_set etc), since once you populate the initial list, you can reserve an optimal amount of space.

Ok so I got something working in parallel to the other solution posted by @Jonathan (first of all, many thanks for your time). My solution looks deceptively simple but would love some suggestions on whether this is correct (maybe I'm missing a corner case somewhere?) because it seems to produce the output I wanted but I'll have to parse it in a second pass to group the same group numbers which is trivial. The logic is that everytime it finds a new number not in the array it increments a group_id counter:

My code in PHP:

<?php

//$fp = fopen("./resemblance.1.out", "r");
$fp = fopen("./wrong", "r");

$groups = array();
$group["-1"] = 1;
$groups[] = $group;

$map = array();

//Maintain a count
$group = 1;

while(!feof($fp)) {
        $source = trim(fgets($fp, 4096));
        //echo $source."\n";

        $source = explode(" ", $source);

        if(array_key_exists($source[0], $map) && !array_key_exists($source[1], $map)) {
                $map[$source[1]] = $map[$source[0]];
        } else if(array_key_exists($source[1], $map) && !array_key_exists($source[0], $map)) {
                $map[$source[0]] = $map[$source[1]];
        } else if(array_key_exists($source[1], $map) && array_key_exists($source[0], $map) && $map[$source[1]] != $map[$source[0]]) {
                // Adjust the groups - change the groups of one of the elements to the other
                $keys = array_keys($map, $map[$source[1]]);
                print_r($keys);
                foreach($keys as $key) {
                        $map[$key] = $map[$source[0]];
                }
        } else {
                $group++;
                $map[$source[0]] = $group;
                $map[$source[1]] = $group;
        }
}

print_r($map);
?>

Output:

Array
(
    [1] => 2
    [2] => 2
    [3] => 3
    [4] => 3
    [5] => 4
    [9] => 4
    [6] => 2
    [7] => 3
    [] => 5
)

EDIT: Fixed the bug that was mentioned in the comment. Just playing around out of curiosity :) Feel free to point out any other bugs. I am currently testing out which solution is faster.

Here is a sample Perl solution that works on the original data set:

1 2
3 4
5 9
2 6
3 7

Group 1: 1 2 6
Group 2: 3 4 7
Group 3: 5 9

On the big data set, it produces the output:

Group 1: 1 27 134 137 161 171 275 309 413 464 627 684 744
Group 2: 2 135 398 437 548 594 717 738 783 798 912
Group 3: 5 74 223
Group 4: 7 53 65 122 237 314 701 730 755 821 875 884 898 900 930
Group 5: 8 115
Group 6: 9 207 305 342 364 493 600 676 830 941
Group 7: 10 164 283 380 423 468 577
Group 8: 11 72 132 276 306 401 515 599
Group 9: 12 95 126 294
Group 10: 13 64 172 528
Group 11: 14 396
Group 12: 15 35 66 210 226 360 588
Group 13: 17 263 415 474 648 986
Group 14: 21 543 771
Group 15: 22 47
Group 16: 23 70 203 427 590
Group 17: 24 286 565
Group 18: 25 175
Group 19: 26 678
Group 20: 29 787

Whether it is efficient enough is a separate matter...

use strict;
use warnings;
my %cache = ();
while (<>)
{
    chomp;
    my($x,$y) = split /\s+/;
    #print "$x $y\n";
    $cache{$x}{$y} = 1;
    $cache{$y}{$x} = 1;
}

my $grp = 1;
foreach my $key (sort { $a <=> $b } keys %cache)
{
    #print "key: $key\n";
    if (defined $cache{$key})
    {
        my %result = ();
        subkey_search(\%result, $key);
        print "Group $grp:";
        $grp++;
        foreach my $val (sort { $a <=> $b } keys %result)
        {
            print " $val";
        }
        print "\n";
    }
}

sub subkey_search
{
    my($resultref, $key) = @_;
    my %hash = %{$cache{$key}};
    delete $cache{$key};
    $resultref->{$key} = 1;
    foreach my $subkey (sort keys %hash)
    {
        #print "subkey: $subkey\n";
        subkey_search($resultref, $subkey) if (defined $cache{$subkey});
    }
}

Here's a slightly different version in Python, which builds a graph containing the edges specified, then converts that to a list of connected subgraphs.

I might want to use this later so I wrote it as a general-purpose version that doesn't do input from a file or output with print statements, just converting data structures.

def graph_to_connected_subgraphs(graph):
    trees = []
    for start in graph.keys():
        if start in graph:
            list = [start]
            append_tree_from(graph, start, list)
            trees.append(list)
    return trees

def append_tree_from(graph, node, list):
    if node in graph:
        for endpoint in graph[node]:
            list.append(endpoint)
            append_tree_from(graph, endpoint, list)
        del graph[node]
    return list

def add_edge(graph, f, s):
    if s < f: # ensure f < s to handle cyclic graphs
        f, s = s, f
    if f not in graph:
        graph[f] = [s]
    else:
        graph[f].append(s)

graph = {}

add_edge(graph, 1,2)
add_edge(graph, 2,6)
add_edge(graph, 3,4)
add_edge(graph, 5,9)
add_edge(graph, 3,7)

print graph_to_connected_subgraphs(graph)

Output

[[1, 2, 6], [3, 4, 7], [5, 9]]

This is a typical application of DFS (Depth First Search) algorithm performed on graphs. Try read this dfs Complexity of this algorithm is O(|V|+|E|), where V - number of vertices and E - number of edges

Here's my stab at an answer. Python.

groups = []

infile = open("so2.txt")

for line in infile.readlines():
  newset = set(line.split())
  matchgroups = []
  excludegroups = []
  for group in groups:
    if len(newset & group):
      newset |= group
    else:
      excludegroups.append(group)
  groups = excludegroups
  groups.append( newset)

for i, s in enumerate(groups):
  print "%d: %s"%(i, " ".join(s))

The Idea here is that forming graphs is not really right. Each pair of numbers in the input is a set. The rule is to return only disjoint sets. So I read each line and convert them to sets, then I check all of the existing sets for intersections, and merge those into the new set. Nonintersecting sets are just added to the new list of sets, and once I'm done I add the new, merged set into the new list of sets. This way I can be sure that only disjoint sets make it into the list.

My version in PHP actually is only a refactoring of your code. It fixes one issue in your code (you have one group too much) and is implemented slightly more efficient (Exec time drops from 0.0035 to 0.0020 on slow machine):

$group = 0;
$map = array();

do {
    list($a, $b) = explode(' ', fgets($file));
    $a = (int) $a;
    $b = (int) $b;

    if (!isset($map[$a]) && !isset($map[$b])) {
        $map[$a] = $map[$b] = ++$group;
    } elseif (!isset($map[$b])) {
        $map[$b] = $map[$a];
    } elseif (!isset($map[$a])) {
        $map[$a] = $map[$b];
    } elseif ($map[$a] != $map[$b]) {
        // move one group to the other
        foreach ($map as $n => $g) {
            if ($g == $map[$b]) {
                $map[$n] = $map[$a];
            }
        }
    }
} while (!feof($file));

// print results
$results = array();
foreach ($map as $val => $group) {
    $results[$group][] = $val;
}

echo '<pre>';
$i = 0;
foreach ($results as $result) {
    sort($result);
    echo 'Group ', ++$i, ': ', implode(' ', $result), "\n";
}

If the input is sorted, like your huge test set, you can do it in Θ(input) time and Ο(input) space. If the input is unsorted, you could modify this fairly easily and get Ο(input log input) time and Θ(input) space.

This is very quick because :

it stores a hash map from a node to its "ultimate owner" (which is the node with the lowest ID in the connected component) which allows Ο(1) "ultimate owner" lookup and insertion.
it stores a hash map from each "ultimate owner" to a result line which allows Ο(1) result line lookup and insertion.
each result line is a linked list which allows Ο(1) appending.
it stores a linked list of result lines which allows :
- Ο(1) appending.
- access to it without going through the expense of asking the prior hash map for all its values

import java.util.Scanner;
import java.util.Map;
import java.util.HashMap;
import java.util.List;
import java.util.LinkedList;

public final class Solver {
  public static void main(String[] args) {
    Scanner in = new Scanner(System.in);
    final Map<Integer, Integer>       ultimateOwners = new HashMap<Integer, Integer>();
    final Map<Integer, List<Integer>> ownerToOwned   = new HashMap<Integer, List<Integer>>();
    final List<List<Integer>>         results        = new LinkedList<List<Integer>>();

    while (in.hasNextInt()) {
      // Get ultimate owner.
      int owner = in.nextInt();
      if (ultimateOwners.containsKey(owner)) owner = ultimateOwners.get(owner);

      // Get owned and register its ultimate owner.
      final int owned = in.nextInt();
      ultimateOwners.put(owned, owner);

      // Add owned to result.
      if (ownerToOwned.containsKey(owner)) ownerToOwned.get(owner).add(owned);
      else {
        final List<Integer> resultLine = new LinkedList<Integer>();
        resultLine.add(owner);
        resultLine.add(owned);
        ownerToOwned.put(owner, resultLine);
        results.add(resultLine);
      }
    }

    int lineNumber = 1;
    for (final List<Integer> line : results) {
      System.out.printf("%d: ", lineNumber++);
      for (final Integer value : line) {
        System.out.printf("%d ", value);
      }
      System.out.println();
    }
  }
}

After not being completely satisfied with my first 2 attempts at this, and some research, I came across this recipe for disjoint sets in Python, with Raymond Hettinger's blessing and input. (Raymond Hettinger is a long-time very active Python core developer.)

Here is an adaptation of that recipe that's very close to my first 2 attempts, but the recipe itself may be a better reference.

Collecting should be as efficient as possible for very large sets of data, as much of the set operations in Python are implemented in C. The input data does not have to be sorted. For printing, I sorted outputs solely for readability, but if this affects performance, connections can be printed without sorting.

# ~~~~~
# data, setup

input = '''
    1 2
    3 4
    2 3
    ''' # etc.

def lgen():
    for l in input.splitlines():
        l = l.strip()
        if l:
            yield tuple(int(i) for i in l.split())

# ~~~~~
# collect

connections = {} # this is a mapping of values to the connections they are in
                 # each node will map to a shared object instance of the connection it is in
                 # e.g. {1: set([1,2]), 2: set([1,2])}, where the 2 sets are the same object

for x, y in lgen():
    cx = connections.setdefault(x, set([x]))      # if not found, create new connection with this single value

    cy = connections.get(y)                 # defaults to None if not found
    if not cy:                              # if we haven't come across this value yet...
        cx.add(y)                           # ...add it to the current connection...
        connections[y] = cx                 # ...and update the reference
    elif cy is not cx:                      # if the cy connection is not the exact same object as the cx connection...
        if len(cy) > len(cx):               # \
            cx, cy = cy, cx                 #  >... merge them ...
        cx |= cy                            # /
        connections[y] = cx                 # ...and update the reference

# ~~~~~
# print

seen = set()
for key in sorted(connections.keys()):
    if key not in seen:
        c = connections[key]
        print sorted(c)
        seen |= c

This actually is a very classic Union-find.

If M is the number of edges, N the number of nodes, the time complexity is O(M * α(M)) which is a O(M) for all practical M and the space complexity if O(N) with N the number of nodes.

The algorithm is online and does not need to know in advance all the edges (compared to other graph-traversal solutions) and hence can scale very well. Also there is no need to order the edges, they can be given in any order.

For graph with billions of nodes you'll need 64 bits / long int and a lot of RAM, but it should handle millions of nodes and billions of edges very well.

The implementation is in C++ but only use vector/map that you can find in almost any language you might want to use.

But since you have unique id for each element we need to map these id to (contiguous) integers.

First version without mapping (suppose that all nodes between 1 and N exists):

#include <cstdio>
#include <algorithm>
#include <vector>
using namespace std;

const int MAX_N = 1000*1000;
int p[MAX_N],f[MAX_N];

int parent(int a) {
  return p[a] == a ? a : p[a] = parent(p[a]);
}
bool join(int a, int b) {
  p[a = parent(a)] = parent(b);
  return p[a] != a;
}

int main()
{
    // First integer in the file : number of nodes in the graph
    int N;
    scanf("%d",&N);
    // Union-find in O(M * alpha(M)) ~= O(M)
    // M = number of lines in the file
    for(int i = 1; i <= N ; i++)
    {
        p[i] = i;
        f[i] = -1;
    }
    int a,b;
    while(scanf("%d%d",&a,&b) != EOF)
        join(a,b);
    // Determine the number of groups : O(M)
    int nG = 0;
    for(int i = 1 ; i <= N ; i++)
    {
        p[i] = parent(p[i]);
        if(f[p[i]] == -1)
            f[p[i]] = nG++;
    }
    // Build groups : O(M)
    vector< vector<int> > Groups(N+1);
    for(int i = 1 ; i <= N ; i++)
        Groups[ f[p[i]] ].push_back(i);
    // Output result
    for(int i = 0 ; i < Groups.size() ; i++)
    {
        if(!Groups[i].size())
            continue;
        printf("%d : ",i);
        for(int j = 0 ; j < Groups[i].size() ; j++)
            printf("%d ",Groups[i][j]);
        printf("\n");
    }

}

Version with mapping : for that version we need to build the mapping. Since I don't know anything about your data, I'm using a classical map to build it in O(M log(N)), if you can send the id of all nodes at the begin of the input file, it can be O(N log(N)) or even better if your are using a Hash Map ( O(N)) or if you can build the mapping yourself, with some knowing of the graph.

Anyway, here is the code :

#include <cstdio>
#include <algorithm>
#include <vector>
#include <map>
using namespace std;

const int MAX_N = 1000*1000;
int p[MAX_N],f[MAX_N];

int parent(int a) {
  return p[a] == a ? a : p[a] = parent(p[a]);
}
bool join(int a, int b) {
  p[a = parent(a)] = parent(b);
  return p[a] != a;
}
// Mapping
int N = 0;
map<int,int> indMap,invMap;
int IND(int x) {
    if(indMap.find(x) == indMap.end())
    {
        N++;
        p[N] = N;
        f[N] = -1;
        indMap[x] = N;
    }
    invMap[ indMap[x] ] = x;
    return indMap[x];
}


int main()
{
    // Union-find in O(M * alpha(M)) ~= O(M)
    // M = number of lines in the file
    int a,b;
    while(scanf("%d%d",&a,&b) != EOF)
        join(IND(a),IND(b));
    // Determine the number of groups : O(M)
    int nG = 0;
    for(int i = 1 ; i <= N ; i++)
    {
        p[i] = parent(p[i]);
        if(f[p[i]] == -1)
            f[p[i]] = nG++;
    }
    // Build groups : O(M)
    vector< vector<int> > Groups(N+1);
    for(int i = 1 ; i <= N ; i++)
        Groups[ f[p[i]] ].push_back(i);
    // Output result
    for(int i = 0 ; i < Groups.size() ; i++)
    {
        if(!Groups[i].size())
            continue;
        printf("%d : ",i+1);
        for(int j = 0 ; j < Groups[i].size() ; j++)
            printf("%d ", invMap[ Groups[i][j] ]);
        printf("\n");
    }

}

继续阅读：algorithm graph php python

Trying to group values?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？