inferring shifts from lists of equipment assignments over time

2023-03-04 01:33 问答作者：

In my small problem, I have n users and m equipments (m and n ~ 50000). One user can use one and only one equipment at a time.

I have a list of records in this format [u, e, t], with t (time) sorted ascending. Each record mean user u is using equipment e at time t. The number of records is around 500 million. Assume that two nearest records with the same u and e mean that u is using e continuously. For example:

1, 2, 1  
3, 4, 1    
1, 2, 3
1, 2, 4 
1, 2, 5 
2, 6, 6
3, 2, 6  
3, 2, 8

would mean user 1 uses equipment 2 from 1 to 5.

What i wa开发者_如何转开发nt to do is from this list, infer the shift time in this format: [u, e, st, et] which means user u uses equipment e from start time st to end time et.

Result for the sample data would be:

1, 2, 0, 5  
3, 4, 0, 6  
3, 2, 6, 8

(assuming time starts from 0 and end at max(t), and when a pair of (u, e) is first seen, u has already started using e since the beginning of time 0. Similar for the last records.)

Given the big list (500 million record) but small enough m and n, how could I do this most efficiently?

@Edit: Possible data inconsistencies:

1: If there's only 1 record (so no end time) such as the case of [2, 6, 6] in the sample data:

--- If that's the only time user 2 and equipment 6 appear in the data set, then ignore the data point.

--- If after that record, user 2 uses another equipment, let say 7 at 10, then 2 uses 6 from 6 to 10.

--- If after that record, equipment 6 is used by another user, let say 10 at 11, then 2 uses 6 from 6 to 11.

Define two structures (I know this is Java, but let's assume a generic algorithm):

struct user_record {
    int machine_idx;
    int start_time;
}

struct machine_record {
    int user_idx;
    int start_time;
}

Given that a user cannot be using more than one piece of equipment at the same time, you could create an array/vector of user_records, one for each user (you said this is ~ 50k, so this should be tractable), and an array/vector of machine_records, one for each machine. Initialise all elements' idx members to -1 (to indicate not currently active).

Then every time you encounter an input record, check the state of the corresponding idx fields in the user_record and machine_record arrays. There are three possibilities:

Both are -1. This is a start point, so set those elements to "point" at each other, and record start_time in each one.
Both are not -1, and consistent. This is an end-point, so simply create an output record, and reset those elements' idx fields back to -1.
At least one is not -1, but they are inconsistent. You will need to create two output records, overwrite the elements with the new values, and also set the corresponding old machine/user indices to -1.

This is O(N) time (where N is the number of input records).

Note: The output will be sorted by end-times.

@Oli-Charlesworth is on the right track but with insufficient detail.

What you need is to have two vectors, one for user and one for machine. The machine one points at the user. The user one points at the machine. One of those vectors, it doesn't matter which one so I'll make it the user, has to also track the first time they were associated, and the last time they were associated.

Initialize everything so that each user is pointing at -1 (no machine), and each machine is pointing at -1 (no user). Here is pseudo-code for how to process your records:

for (user, machine, time) in records:
    # By user.machine I mean look up the machine the user currently points at
    if user.machine <> machine:
        output_record_and_clear(user)
        if machine.user <> user:
            output_record_and_clear(machine.user)
        user.machine = machine
        machine.user = user
        user.start_time = time
    user.end_time = time

def output_record_and_clear (user):
    if -1 <> user.machine and user.start_time < user.end_time:
        emit(user, user.machine, user.start_time, user.end_time)
    user.machine.user = -1
    user.machine = -1

I would approach this by sorting the file in place by user or machine (since it's one to one, it shouldn't matter) and then by time and then the problem is easy, just go line by line and output the shift.

You could also do it in memory, line-by-line, by keeping a hash-table of machines in use and the last time they were used, when you see a machine again, output the time used and update the table. With only 50,000 machines, this should work fine.

继续阅读：algorithm

inferring shifts from lists of equipment assignments over time

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？