开发者

Tokenizer for full-text

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.

Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.

Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probab开发者_如何转开发ly in a thousand different ways, but I can't even find one of them :)

Any good pointers?

Thanks!


The C++ String Toolkit Library (StrTk) has the following solution to your problem:

#include <iostream>
#include <string>
#include <deque>
#include "strtk.hpp"

int main()
{
   std::deque<std::string> word_list;
   strtk::for_each_line("data.txt",
                        [&word_list](const std::string& line)
                        {
                           const std::string delimiters = "\t\r\n ,,.;:'\""
                                                          "!@#$%^&*_-=+`~/\\"
                                                          "()[]{}<>";
                           strtk::parse(line,delimiters,word_list);
                        });

   std::cout << strtk::join(" ",word_list) << std::endl;

   return 0;
}

More examples can be found Here


If performance is a main issue you should probably stick to good old strtok which is sure to be fast:

/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}


A regular expression library might work well if your tokens aren't too difficult to parse.


I wrote my own tokenizer as part of the open-source SWISH++ indexing and search engine.

There's also the the ICU tokenizer that handles Unicode.


I might look into std::stringstream from <sstream>. C-style strtok has a number of usability problems, and C-style strings are just troublesome.

Here's an ultra-trivial example of it tokenizing a sentence into words:

#include <sstream>
#include <iostream>
#include <string>

int main(void) 
{
   std::stringstream sentence("This is a sentence with a bunch of words"); 
   while (sentence)
   {
      std::string word;  
      sentence >> word;  
      std::cout << "Got token: " << word << std::endl;
   }
}

janks@phoenix:/tmp$ g++ tokenize.cc && ./a.out
Got token: This
Got token: is
Got token: a
Got token: sentence
Got token: with
Got token: a
Got token: bunch
Got token: of
Got token: words
Got token:

The std::stringstream class is "bi-directional", in that it supports input and output. You'd probably want to do just one or the other, so you'd use std::istringstream or std::ostringstream.

The beauty of them is that they are also std::istream and std::ostreams respectively, so you can use them as you'd use std::cin or std::cout, which are hopefully familiar to you.

Some might argue these classes are expensive to use; std::strstream from <strstream> is mostly the same thing, but built on top of cheaper C-style 0-terminated strings. It might be faster for you. But anyway, I wouldn't worry about performance right away. Get something working, and then benchmark it. Chances are you can get enough speed by simply writing well-written C++ that minimizes unnecessary object creation and destruction. If it's still not fast enough, then you can look elsewhere. These classes are probably fast enough, though. Your CPU can waste thousands of cycles in the amount of time it takes to read a block of data from a hard disk or network.


You can use the Ragel State Machine Compiler to create a tokenizer (or a lexical analyzer).

The generated code has no external dependencies.

I suggest you look at the clang.rl sample for a relevant example of the syntax and usage.


Well, I would begin by searching Boost and... hop: Boost.Tokenizer

The nice thing ? By default it breaks on white spaces and punctuation because it's meant for text, so you won't forget a symbol.

From the introduction:

#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>

int main(){
   using namespace std;
   using namespace boost;
   string s = "This is,  a test";
   tokenizer<> tok(s);
   for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }
}

// prints
This
is
a
test

// notes how the ',' and ' ' were nicely removed

And there are additional features:

  • it can escape characters
  • it is compatible with Iterators so you can use it with an istream directly... and thus with an ifstream

and a few options (like keeping empty tokens etc...)

Check it out!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜