inferring shifts from lists of equipment assignments over time
In my small problem, I have n users and m equipments (m and n ~ 50000). One user can use one and only one equipment at a time.
I have a list of records in this format [u, e, t], with t (time) sorted ascending. Each record mean user u is using equipment e at time t. The number of records is around 500 million. Assume that two nearest records with the same u and e mean that u is using e continuously. For example:
1, 2, 1
3, 4, 1
1, 2, 3
1, 2, 4
1, 2, 5
2, 6, 6
3, 2, 6
3, 2, 8
would mean user 1 uses equipment 2 from 1 to 5.
What i wa开发者_如何转开发nt to do is from this list, infer the shift time in this format: [u, e, st, et] which means user u uses equipment e from start time st to end time et.
Result for the sample data would be:
1, 2, 0, 5
3, 4, 0, 6
3, 2, 6, 8
(assuming time starts from 0 and end at max(t), and when a pair of (u, e) is first seen, u has already started using e since the beginning of time 0. Similar for the last records.)
Given the big list (500 million record) but small enough m and n, how could I do this most efficiently?
@Edit: Possible data inconsistencies:
1: If there's only 1 record (so no end time) such as the case of [2, 6, 6] in the sample data: --- If that's the only time user 2 and equipment 6 appear in the data set, then ignore the data point. --- If after that record, user 2 uses another equipment, let say 7 at 10, then 2 uses 6 from 6 to 10. --- If after that record, equipment 6 is used by another user, let say 10 at 11, then 2 uses 6 from 6 to 11.Define two structures (I know this is Java, but let's assume a generic algorithm):
struct user_record {
int machine_idx;
int start_time;
}
struct machine_record {
int user_idx;
int start_time;
}
Given that a user cannot be using more than one piece of equipment at the same time, you could create an array/vector of user_record
s, one for each user (you said this is ~ 50k, so this should be tractable), and an array/vector of machine_record
s, one for each machine. Initialise all elements' idx
members to -1 (to indicate not currently active).
Then every time you encounter an input record, check the state of the corresponding idx
fields in the user_record
and machine_record
arrays. There are three possibilities:
- Both are -1. This is a start point, so set those elements to "point" at each other, and record
start_time
in each one. - Both are not -1, and consistent. This is an end-point, so simply create an output record, and reset those elements'
idx
fields back to -1. - At least one is not -1, but they are inconsistent. You will need to create two output records, overwrite the elements with the new values, and also set the corresponding old machine/user indices to -1.
This is O(N) time (where N is the number of input records).
Note: The output will be sorted by end-times.
@Oli-Charlesworth is on the right track but with insufficient detail.
What you need is to have two vectors, one for user and one for machine. The machine one points at the user. The user one points at the machine. One of those vectors, it doesn't matter which one so I'll make it the user, has to also track the first time they were associated, and the last time they were associated.
Initialize everything so that each user is pointing at -1 (no machine), and each machine is pointing at -1 (no user). Here is pseudo-code for how to process your records:
for (user, machine, time) in records:
# By user.machine I mean look up the machine the user currently points at
if user.machine <> machine:
output_record_and_clear(user)
if machine.user <> user:
output_record_and_clear(machine.user)
user.machine = machine
machine.user = user
user.start_time = time
user.end_time = time
def output_record_and_clear (user):
if -1 <> user.machine and user.start_time < user.end_time:
emit(user, user.machine, user.start_time, user.end_time)
user.machine.user = -1
user.machine = -1
I would approach this by sorting the file in place by user or machine (since it's one to one, it shouldn't matter) and then by time and then the problem is easy, just go line by line and output the shift.
You could also do it in memory, line-by-line, by keeping a hash-table of machines in use and the last time they were used, when you see a machine again, output the time used and update the table. With only 50,000 machines, this should work fine.
精彩评论