开发者

NLTK - when to normalize the text?

I've finished gathering my data I plan to use for my corpus, but I'开发者_JAVA百科m a bit confused about whether I should normalize the text. I plan to tag & chunk the corpus in the future. Some of NLTK's corpora are all lower case and others aren't.

Can anyone shed some light on this subject, please?


By "normalize" do you just mean making everything lowercase?

The decision about whether to lowercase everything is really dependent of what you plan to do. For some purposes, lowercasing everything is better because it lowers the sparsity of the data (uppercase words are rarer and might confuse the system unless you have a massive corpus such that the statistics on capitalized words are decent). In other tasks, case information might be valuable.

Additionally, there are other considerations you'll have to make that are similar. For example, should "can't" be treated as ["can't"], ["can", "'t"], or ["ca", "n't"] (I've seen all three in different corpora). What about 7-year-old? Is it one long word? Or three words that should be separated?

That said, there's no reason to reformat the corpus. You can just have your code make these changes on the fly. That way the original information is still around later if you ever need it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜