开发者

How to automatically classify words in the dictionary?

I have a large dictionary file, dic.txt (its actually the SOWPODS) with one word from the English language per line. I want to automatically split this file into 3 different files easy_dic.txt (most common every day words we use - vocabulary of a 16 year old), medium_dic.txt (words not that much in common usage but still known to many 开发者_如何学编程people - knowledge of a 30 year old minus words found in easy_dic.txt), hard_dic.txt (very esoteric words that only professional Scrabble players would know). What's the easiest way (you can use any resources from the internet) to accomplish this?


Google has the right tool :), and shares its DB!

The Ngram viewer is a tool to check out and compare the frequency of appearance of words in literature, magazines, etc.

You can download the DB, and train your dictionaries from here.

HTH!

BTW The tool is VERY fun to use and discover the word's birth and disappearance dates.


  • Take some books (preferably from you three categories) that are available in a computer-readable form.
  • Create histograms for all words from those books.
  • Merge the histograms for all books from each category.
  • When processing your dictionary, check in which category's histogram the word has the highest count and put the word in this category.

Instead of the last step you could also simply process your histograms and remove a word from all histograms except the one with the highest amount of hits. Then you already have a word list without using an external dictionary file.


Download Wikipedia dump, learn word frequencies with some Lingpipe tool(optimal data structures). Check words from dictionaries frequency distribution then split them to 3 groups.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜