Efficiently finding duplicates in a constrained many-to-many dataset?

2023-03-01 00:09 问答作者：

I have to write a bulk operation version of something our webapp lets you do on a more limited basis from the UI. The desired operation is to assign objects to a category. A category can have multiple objects but a given object can only be in one category.

The workflow for the task is:

1) Using the browser, a file of the following form is uploaded:

# ObjectID, CategoryID
Oid1, Cid1
Oid2, Cid1
Oid3, Cid2
Oid4, Cid2
[etc.]

The file will most likely have tens to hundreds of lines, but definitely could have thousands of lines.

In an ideal world a given object id would only occur once in the file (reflecting the fact that an object can only be assigned to one category) But since the file is created outside of our control, there's no guarantee that's actually true and the processing has to deal with that possibility.

2) The server will receive the file, parse it, pre-process it and show a page something like:

723 objects to be assigned to 126 categories
142 objects not found
 42 categories not found

Do you want to continue?

[Yes]     [No]

3) If the user clicks the Yes button, the server will actually do the work.

Since I don't want to parse the file in both steps (2) and (3), as part of (2), I need to build a container that will live across requests and hold a useful representation of the data that will let me easily provide the data to populate the "preview" page and will let me efficiently do the actual work. (While obviously we have sessions, we normally keep very little in-memory session state.)

There is an existing

assignObjectsToCategory(Set<ObjectId> objectIds, CategoryId categoryId)

function that is used when assignment is done through the UI. It is highly desireable for the bulk operation to also use this API since it does a bunch of other business logic in addition to the simple assignment and we need that same business logic to run when this bulk assign is done.

Initially it was going to be OK that if the file "illegally" specified multiple categories for a given object -- it would be OK to assign the object abitrarily to one of the categories the file associated开发者_开发技巧 it with.

So I was initially thinking that in step (2) as I went through the file I would build up and put into the cross-request container a Map<CategoryId, Set<ObjectId>> (specifically a HashMap for quick lookup and insertion) and then when it was time to do the work I could just iterate on the map and for each CategoryId pull out the associated Set<ObjectId> and pass them into assignObjectsToCategory().

However, the requirement on how to handle duplicate ObjectIds changed. And they are now to be handled as follows:

If an ObjectId appears multiple times in the file and all times is associated with the same CategoryId, assign the object to that category.
If an ObjectId appears multiple times in the file and is associated with different CategoryIds, consider that an error and make mention of it on the "preview" page.

That seems to mess up my Map<CategoryId, Set<ObjectId>> strategy since it doesn't provide a good way to detect that the ObjectId I just read out of the file is already associated with a CategoryId.

So my question is how to most efficiently detect and track these duplicate ObjectIds?

What came to mind is to use both "forward" and "reverse" maps:

public CrossRequestContainer
{
    ...

    Map<CategoryId, Set<ObjectId>> objectsByCategory;  // HashMap
    Map<ObjectId, List<CategoryId>> categoriesByObject; // HashMap
    Set<ObjectId> illegalDuplicates;

    ...
}

Then as each (ObjectId, CategoryId) pair was read in, it would get put into both maps. Once the file was completely read in, I could do:

for (Map.Entry<ObjectId, List<CategoryId>> entry : categoriesByObject.entrySet()) {
    List<CategoryId> categories = entry.getValue();
    if (categories.size() > 1) {
        ObjectId object = entry.getKey();
        if (!all_categories_are_equal(categories)) {
            illegalDuplicates.add(object);
            // Since this is an "illegal" duplicate I need to remove it
            // from every category that it appeared with in the file.
            for (CategoryId category : categories) {
                objectsByCategory.get(category).remove(object);
            }
        }
    }
}

When this loop finishes, objectsByCategory will no longer contain any "illegal" duplicates, and illegalDuplicates will contain all the "illegal" duplicates to be reported back as needed. I can then iterate over objectsByCategory, get the Set<ObjectId> for each category, and call assignObjectsToCategory() to do the assignments.

But while I think this will work, I'm worried about storing the data twice, especially when the input file is huge. And I'm also worried that I'm missing something re: efficiency and this will go very slowly.

Are there ways to do this that won't use double memory but can still run quickly? Am I missing something that even with the double memory use will still run a lot slower than I'm expecting?

Given the constraints you've given, I don't there's a way to do this using a lot less memory.

One possible optimization though is to only maintain lists of categories for objects which are listed in multiple categories, and otherwise just map object to category, ie:

Map<CategoryId, Set<ObjectId>> objectsByCategory;  // HashMap
Map<ObjectId, CategoryId> categoryByObject; // HashMap
Map<ObjectId, Set<CategoryId>> illegalDuplicates;  // HashMap

Yes, this adds yet another container, but it will contain (hopefully) only a few entries; also, the memory requirements of the categoryByObject map is reduced (cutting out one list overhead per entry).

The logic is a little more complicated of course. When a duplicate is initially discovered, the object should be removed from the categoryByObject map and added into the illegalDuplicates map. Before adding any object into the categoryByObject map, you will need to first check the illegalDuplicates map.

Finally, it probably won't hurt performance to build the objectsByCategory map in a separate loop after building the other two maps, and it will simplify the code a bit.

继续阅读：data-structures duplicates

Efficiently finding duplicates in a constrained many-to-many dataset?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？