Storing tokenized text in the db?

2023-01-23 19:51 问答作者：

I have a simple question. I'm doing some light crawling so new content arrives every few days. I've written a tokenizer and would like to use it for some text mining purposes. Specifically, I'm using Mallet's topic modeling tool and one of the pipe is to tokenize the text into tokens before further processing can be done. With the amount of text in my database, it takes a substa开发者_如何学运维ntial amount of time tokenizing the text (I'm using regex here).

As such, is it a norm to store the tokenized text in the db so that tokenized data can be readily available and tokenizing can be skipped if I need them for other text mining purposes such as Topic modeling, POS tagging? What are the cons of this approach?

Caching Intermediate Representations

It's pretty normal to cache the intermediate representations created by slower components in your document processing pipeline. For example, if you needed dependency parse trees for all the sentences in each document, it would be pretty crazy to do anything except parsing the documents once and then reusing the results.

Slow Tokenization

However, I'm surprise that tokenization is really slow for you, since the stuff downstream from tokenization is usually the real bottleneck.

What package are you using to do the tokenization? If you're using Python and you wrote your own tokenization code, you might want to try one of the tokenizers included in NLTK (e.g., TreebankWordTokenizer).

Another good tokenizer, albeit one that is not written in Python, is the PTBTokenizer included with the Stanford Parser and the Stanford CoreNLP end-to-end NLP pipeline.

I store tokenized text in a MySQL database. While I don't always like the overhead of communication with the database, I've found that there are lots of processing tasks that I can ask the database to do for me (like search the dependency parse tree for complex syntactic patterns).

继续阅读：caching postgresql python tokenize

Storing tokenized text in the db?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？