开发者

Parsing tab separated data

I have a text file (~10GB) with the following format:

data1<TAB>data2<TAB>data3<TAB>data4<NEWLINE>

I 开发者_高级运维want to scan through it and do processing only on data2. What is the best (fastest) way to extract data2 in C++.

EDIT: Added NEWLINE


Read the file line by line. For each line, split on the tab. That will leave you with an array containing the fields, allowing you to work with the second field (data2).


This sounds like a job for a higher level tool like shell utilities:

cut -f2           # from stdin
cut -f2 <my_file  # from file

But nonetheless, you can do that with C++ as well:

void parse(std::istream& in)
{
    std::string word;
    while( in ) {
        std::cin >> word;  // throwaway 1
        std::cin >> word;  // data2
        process(word);
        std::cin >> word >> word;  // throwaway 3 and 4
    }
}

// ...
parse(std::cin);
std::ifstream file("my_file");
parse(file);


Read the file a line at a time. It's pretty straight forward parsing out the tabs from there. You could use something like strtok() or similar routine.


Well, open a file stream (which should be able to handle 10gig files) and then just jump to after the first tab, which is a '\t', read your data and then skip to the next newline and repeat.

#include <fstream>
#include <string>

int main(){
  std::fstream fin("your_file.txt");

  while(fin){
    std::string data2;
    char sink = '\0';

    // skip to first tab
    fin.ignore(1024,'\t');

    fin >> data2;
    // do stuff with data2

    // skip to next line
    fin.ignore(1024,'\n');
  }
}


Since the file is of a considerable size, you might consider using a technique that will allow you overlap your I/O with your processing. In response a comment, you mentioned you were working on linux. Provided you are using kernel 2.6 or later you might consider using Linux asynchronous I/O (AIO). Specifically you would use aio_read to queue up some read requests, then use aio_suspend to wait for one (or more) of the request to end. As requests complete you would scan through the buffers using a plain char* to locate the data you are interested in. For each piece of data you find you could at that point create a std::string (although avoiding copying may be beneficial) and process it. Once you have scanned a block you would requeue it to read another block from the file. You continue doing this until you have processed every block in the file.

The code for this method will be more complex than reading the file line by line, but it may be considerably faster.


You could use iostream as others have suggested. Another way to go would be to simply use fscanf. For example:

#include <stdio.h>

...

FILE* fp = fopen(path_to_file, "r");
char[256] data;

while(fscanf(fp, "%*s<tab>%s<tab>%*s<tab>%*s", data))
{
   do what you want with your data
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜