开发者

Merging HashMaps from n lists

I'll try and describe my problem as best as I can, but please ask if there are things that make no sense.

  • I have a finite number of lists
  • Each list contains a finite number of contacts
  • Each contact is represented as a HashMap
  • Each list is linked to a provider
  • The same contact may be present in multiple providers (and hence multiple lists).
  • I need a 'master' list that contains all the unique entries in the other lists

I'm looking for an efficient way to merge these lists into a master list without duplicates. For example if the same contact appears in multiple lists (multiple HashMaps corresponding to the same p开发者_StackOverflowhysical person) I want to merge all the HashMaps into a single one, and put the merged HashMap into the master list. A simple 'putall' here won't do since I need to re-key the contents to efficiently access them (eg. provider one gives me a list of email addresses keyed as 'emails' and provider 2 gives me the same info keyed as 'emailList').

Merging the individual HashMaps is the easier of two problems since I know these keys and can easily merge them.

The problem that has me scratching my head is efficient scanning of the lists ... is there no other way than linearly going through each list in a nested loop, grabbing the next HashMap, checking if it already exists in the mater list and merging/creating a new one ... ?


First observation - using a HashMap to represent your contacts smells of "object denial".

You need to design and implement a Contact class to represent a contact. Without this class, your task is a whole bunch harder than it needs to be.

The class needs getters for all of the contact key fields, and it needs to implement equals, hashcode and Comparable based on the key fields. Getters (and optionally setters) are also needed for non-key fields.

With that, the merging process becomes (pseudo-code):

// If you haven't already done so
convert the master list of HashMaps to a list of Contact objects and sort it.
create an empty "new master" list

for each list of contact HashMaps:
    convert the list of HashMaps to a merge list of Contact objects
    sort the merge list
    iterate the sorted master and merge lists in parallel:
        if a master Contact matches a merge Contact:
            merge the two Contacts and add to the new master list
            advance both iterators
        if a master Contact has no corresponding merge Contact:
            copy the master Contact to the new master list
            advance the master iterator.
        if a merge Contact has no corresponding master Contact:
            add the merge Contact to the new master list.
            advance the merge iterator

The performance characteristics of the various phases should be:

  • Conversion of N HashMaps to Contact objects - O(N).
  • Creation of list of N Contacts - O(N)
  • Sort list of N Contacts - O(NlogN)
  • Merging of 2 sorted lists - O(M + N).

The overall performance should be better than O(NlogN) where N is the total number of master and merge Customer objects.


Make a Map<String,Contact> using a class like the one below. Although, I'm still not sure what you mean by Provider. Perhaps you could provide some more detail on that.

class Contact {

    enum ContactMethod {
        email,
        phone,
        address
    }

    String name;
    Map<ContactMethod,Set<String>> contactInfo;

    Contact(String name) {
        this.name = name;
        this.contactInfo = new HashMap<ContactMethod,Set<String>>();
    }

    void consume(Map<ContactMethod,String> info) {
        for(ContactMethod method : info.keySet()) {
            Set<String> modes = contactInfo.get(method);
            if(modes == null) {
                modes = new HashSet<String>();
                contactInfo.put(method,modes);
            }
            modes.add(info.get(method));
        }
    }
}


For your internal master list, can you use a class that you can define a meaningful equals() on to encapsulate the HashMap, instead of just storing straight HashMaps? If you did that, you could switch to using a Collection implementation that has constant-time lookups (e.g., HashSet) for the Master List. That will eliminate the nested iteration and you'll just have to check each Contact from a Provider once. It's trial and error to determine if your number of contacts is large enough that it's an improvement.


If your lists are sorted, consider this:

Create a "merging" iterator that consumes 2 iterators from your lists.
If the 2 heads are the same, toss one. Otherwise present the smaller of the two.
If one head is from an exhausted (empty) iterator, simply present the other.

Now you have an iterator that produces a unique sorted sequence from 2 iterators.

You can pile these up as many as needed to get a unique iterator for all your lists.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜