Tokenizer for full-text

2022-12-25 21:24 问答作者：

This should be an ideal case of not re-inventing the wheel, but so far my search has been in vain.

Instead of writing one myself, I would like to use an existing C++ tokenizer. The tokens are to be used in an index for full text searching. Performance is very important, I will parse many gigabytes of text.

Edit: Please note that the tokens are to be used in a search index. Creating such tokens is not an exact science (afaik) and requires some heuristics. This has been done a thousand time before, and probab开发者_如何转开发ly in a thousand different ways, but I can't even find one of them :)

Any good pointers?

Thanks!

The C++ String Toolkit Library (StrTk) has the following solution to your problem:

#include <iostream>
#include <string>
#include <deque>
#include "strtk.hpp"

int main()
{
   std::deque<std::string> word_list;
   strtk::for_each_line("data.txt",
                        [&word_list](const std::string& line)
                        {
                           const std::string delimiters = "\t\r\n ,,.;:'\""
                                                          "!@#$%^&*_-=+`~/\\"
                                                          "()[]{}<>";
                           strtk::parse(line,delimiters,word_list);
                        });

   std::cout << strtk::join(" ",word_list) << std::endl;

   return 0;
}

More examples can be found Here

If performance is a main issue you should probably stick to good old strtok which is sure to be fast:

/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}

A regular expression library might work well if your tokens aren't too difficult to parse.

I wrote my own tokenizer as part of the open-source SWISH++ indexing and search engine.

There's also the the ICU tokenizer that handles Unicode.

I might look into std::stringstream from <sstream>. C-style strtok has a number of usability problems, and C-style strings are just troublesome.

Here's an ultra-trivial example of it tokenizing a sentence into words:

#include <sstream>
#include <iostream>
#include <string>

int main(void) 
{
   std::stringstream sentence("This is a sentence with a bunch of words"); 
   while (sentence)
   {
      std::string word;  
      sentence >> word;  
      std::cout << "Got token: " << word << std::endl;
   }
}

janks@phoenix:/tmp$ g++ tokenize.cc && ./a.out
Got token: This
Got token: is
Got token: a
Got token: sentence
Got token: with
Got token: a
Got token: bunch
Got token: of
Got token: words
Got token:

The std::stringstream class is "bi-directional", in that it supports input and output. You'd probably want to do just one or the other, so you'd use std::istringstream or std::ostringstream.

The beauty of them is that they are also std::istream and std::ostreams respectively, so you can use them as you'd use std::cin or std::cout, which are hopefully familiar to you.

Some might argue these classes are expensive to use; std::strstream from <strstream> is mostly the same thing, but built on top of cheaper C-style 0-terminated strings. It might be faster for you. But anyway, I wouldn't worry about performance right away. Get something working, and then benchmark it. Chances are you can get enough speed by simply writing well-written C++ that minimizes unnecessary object creation and destruction. If it's still not fast enough, then you can look elsewhere. These classes are probably fast enough, though. Your CPU can waste thousands of cycles in the amount of time it takes to read a block of data from a hard disk or network.

You can use the Ragel State Machine Compiler to create a tokenizer (or a lexical analyzer).

The generated code has no external dependencies.

I suggest you look at the clang.rl sample for a relevant example of the syntax and usage.

Well, I would begin by searching Boost and... hop: Boost.Tokenizer

The nice thing ? By default it breaks on white spaces and punctuation because it's meant for text, so you won't forget a symbol.

From the introduction:

#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>

int main(){
   using namespace std;
   using namespace boost;
   string s = "This is,  a test";
   tokenizer<> tok(s);
   for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
       cout << *beg << "\n";
   }
}

// prints
This
is
a
test

// notes how the ',' and ' ' were nicely removed

And there are additional features:

it can escape characters
it is compatible with Iterators so you can use it with an istream directly... and thus with an ifstream

and a few options (like keeping empty tokens etc...)

Check it out!

继续阅读：full-text-search tokenize

Tokenizer for full-text

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？