Elegant way of parsing Data files for Simulation
I am working on this project where I need to read in a lot of data from .dat files and use the data to perform simulations. The data in my .dat file looks as follows:
DeviceID InteractingDeviceID InteractionStartTime InteractionEndTime
1 2 1101 1105
1,2 1101 and 1105 are tab delimited and it means Device 1 interacted with Device 2 at 1101 ms and ended the interaction at 1105ms.
I have a trace data sets that compile thousands of such interactions and my job is to analyze these interactions.
The first step is to parse the file. The language of choice is C++. The approach I was thinking of taking was to read the file, for every line that's read create a Device Object. This Device object will contain the property DeviceId and an arra开发者_开发百科y/vector of structs, that will contain a list of all the devices the given DeviceId interacted with over the course of the simulation.The struct will contain the Interacting Device Id, Interaction Start Time and Interaction End Time.
I have a two fold question here:
Is my approach correct?
If I am on the right track, how do I rapidly parse these tab delimited data files and create Device objects without excessive memory overhead using C++?
A push in the right direction will be much appreciated.
Thanks
Your approach seems to be correct given the information you've provided.
I'm assuming you'd be creating a class something like:
class device {
public:
int id;
vector<interaction> interactions;
void add_interaction(interaction add_me); // uses vector::insert
};
with
typedef struct interaction_t {
int other_device_id;
int start_time;
int end_time;
} interaction;
At that point, you should be able to read in the file, one line at a time, and pull out the data.
device* pDev = NULL;
interaction new_interaction;
ifstream ifs( "data.dat" );
char temp[MAX_LINE_LENGTH+1];
int id, other_id, start, end;
while(ifs.getline(temp, MAX_LINE_LENGTH)) {
sscanf(temp, "%i\t%i\t%i\t%i",
&id,
&new_interaction.other_device_id,
&new_interaction.start_time,
&new_interaction.end_time);
pDev = find_device_by_id(id);
pDev->add_interaction(new_interaction);
}
Code is untested and for illustration purposes only, but you can get the idea. The trick would be writing the find_device_by_id
function (would return a pointer to the device
object with a matching id
field). This shouldn't require too heavy of a memory overhead per input line; if your input files are huge, you may not be able to store the data in memory and may have to store in a database instead.
Addressing the question of designing the in memory storage and linking:
You haven't told us enough. The necessary structure of your data depends on how you need to use the data.
- If you are going to walk sequentially through (all or part) the data by starttime, shouldn't you be able to visit events in order by starttime? If you're going to jump into the middle of the stream sometime shouldn't you be able to search efficiently by starttime.
- If you want to examine the event active during a certain interval, you also need to be able to search efficiently by endtime.
- If you want to examine all the interaction of a single device you need to be able to select events by device (which you proposed structure does nicely)
- ... what other use cases do you have...
If you don't need the best performance possible (i.e. good performance will do) a relational DB might be in order. Or you can build in memory structures with all the characteristics you need, but they may be moderately complicated...
I done similar thing with interacting people. For future expandability, I would do the following: Have a Device class that holds the id and a vector of pointers Interaction objects. The Devices could be kept in a map (or hashmap) foe easy lookup. The interaction class would contain the rest of the information from the file. This will allow you to create polymorphic devices and interactions, in case you every have multiple kinds of devices or interactions. You might also want to have factories for the devices and interactions, to facilitate this.
Take a look at Boost.Spirit. It's a decent parser framework.
edit, fixed link
精彩评论