How to automatically classify words in the dictionary?

2023-02-28 14:21 问答作者：

I have a large dictionary file, dic.txt (its actually the SOWPODS) with one word from the English language per line. I want to automatically split this file into 3 different files easy_dic.txt (most common every day words we use - vocabulary of a 16 year old), medium_dic.txt (words not that much in common usage but still known to many 开发者_如何学编程people - knowledge of a 30 year old minus words found in easy_dic.txt), hard_dic.txt (very esoteric words that only professional Scrabble players would know). What's the easiest way (you can use any resources from the internet) to accomplish this?

Google has the right tool :), and shares its DB!

The Ngram viewer is a tool to check out and compare the frequency of appearance of words in literature, magazines, etc.

You can download the DB, and train your dictionaries from here.

HTH!

BTW The tool is VERY fun to use and discover the word's birth and disappearance dates.

Take some books (preferably from you three categories) that are available in a computer-readable form.
Create histograms for all words from those books.
Merge the histograms for all books from each category.
When processing your dictionary, check in which category's histogram the word has the highest count and put the word in this category.

Instead of the last step you could also simply process your histograms and remove a word from all histograms except the one with the highest amount of hits. Then you already have a word list without using an external dictionary file.

Download Wikipedia dump, learn word frequencies with some Lingpipe tool(optimal data structures). Check words from dictionaries frequency distribution then split them to 3 groups.

继续阅读：classification data-mining language-agnostic

How to automatically classify words in the dictionary?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？