Parsing tab separated data

2023-03-04 08:09 问答作者：

I have a text file (~10GB) with the following format:

data1<TAB>data2<TAB>data3<TAB>data4<NEWLINE>

I 开发者_高级运维want to scan through it and do processing only on data2. What is the best (fastest) way to extract data2 in C++.

EDIT: Added NEWLINE

Read the file line by line. For each line, split on the tab. That will leave you with an array containing the fields, allowing you to work with the second field (data2).

This sounds like a job for a higher level tool like shell utilities:

cut -f2           # from stdin
cut -f2 <my_file  # from file

But nonetheless, you can do that with C++ as well:

void parse(std::istream& in)
{
    std::string word;
    while( in ) {
        std::cin >> word;  // throwaway 1
        std::cin >> word;  // data2
        process(word);
        std::cin >> word >> word;  // throwaway 3 and 4
    }
}

// ...
parse(std::cin);
std::ifstream file("my_file");
parse(file);

Read the file a line at a time. It's pretty straight forward parsing out the tabs from there. You could use something like strtok() or similar routine.

Well, open a file stream (which should be able to handle 10gig files) and then just jump to after the first tab, which is a '\t', read your data and then skip to the next newline and repeat.

#include <fstream>
#include <string>

int main(){
  std::fstream fin("your_file.txt");

  while(fin){
    std::string data2;
    char sink = '\0';

    // skip to first tab
    fin.ignore(1024,'\t');

    fin >> data2;
    // do stuff with data2

    // skip to next line
    fin.ignore(1024,'\n');
  }
}

Since the file is of a considerable size, you might consider using a technique that will allow you overlap your I/O with your processing. In response a comment, you mentioned you were working on linux. Provided you are using kernel 2.6 or later you might consider using Linux asynchronous I/O (AIO). Specifically you would use aio_read to queue up some read requests, then use aio_suspend to wait for one (or more) of the request to end. As requests complete you would scan through the buffers using a plain char* to locate the data you are interested in. For each piece of data you find you could at that point create a std::string (although avoiding copying may be beneficial) and process it. Once you have scanned a block you would requeue it to read another block from the file. You continue doing this until you have processed every block in the file.

The code for this method will be more complex than reading the file line by line, but it may be considerably faster.

You could use iostream as others have suggested. Another way to go would be to simply use fscanf. For example:

#include <stdio.h>

...

FILE* fp = fopen(path_to_file, "r");
char[256] data;

while(fscanf(fp, "%*s<tab>%s<tab>%*s<tab>%*s", data))
{
   do what you want with your data
}

Parsing tab separated data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？