Simple filtering out of common words from a text description

2023-02-03 08:54 问答作者：

Words like "a", "the", "best", "kind". i am pretty sure there are good ways of achieving this

Just to be clear, I am looking for

The simplest solution that can be implemented, preferably in ruby.
I have a high level of tolerance for errors
If a library of common phrases is开发者_高级运维 what i need, perfectly happy with that too

These common words are known as "stop words" - there is a similar stackoverflow question about this here: "Stop words" list for English?

To summarize:

If you have a large amount of text to deal with, it would be worth gathering statistics about the frequency of words in that particular data set, and taking the most frequent words for your stop word list. (That you include "kind" in your examples suggests to me that you might have quite an unusual set of data, e.g. with lots of colloquial expressions like "kind of", so perhaps you would need to do this.)
Since you say you don't mind much about errors, then it may be sufficient to just use a list of stop words for English that someone else has produced, e.g. the fairly long one used by MySQL or anything else that Google turns up.

If you just put these words into a hash in your program it should be easy to filter any list of words.

  Common = %w{ a and or to the is in be }
Uncommon = %{
  To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
  Common.each { |w| ignore_me[w.downcase] = :Common          }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join

 ,  not  : that   question: 
Whether 'tis nobler   mind  suffer
 slings  arrows of outrageous fortune,
  take arms against  sea of troubles,
 by opposing end them?  die:  sleep;
No more;  by  sleep  say we end
 heart-ache   thousand natural shocks
That flesh  heir , 'tis  consummation
Devoutly   wish'd.  die,  sleep;
 sleep: perchance  dream: ay, there's  rub;
For  that sleep of death what dreams may come

This is a variation on DigitalRoss answer.

str=<<EOF
To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
EOF

common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')

Also relevant: What's the fastest way to check if a word from one string is in another string?

Hold on, you need to do some research before you take out stopwords (aka noise words, junk words). Index size and processing resources aren't the only issues. A lot depends on whether end-users will be typing queries, or you will be working with long automated queries.

All search log analysis shows that people tend to type one to three words per query. When that's all a search has to work with, we can't afford to lose anything. For example, a collection might have the word "copyright" on many documents -- making it very common -- but if there's no word in the index, it's impossible to do exact phrase searches or proximity relevance ranking. In addition, there are perfectly legitimate reasons to search for the most common words: people may be looking for "The Who", or worse, "The The".

So while there are technical issues to consider, and taking out stopwords is one solution, it may not be the right solution for the overall problem that you are trying to solve.

If you have an array of words to remove named stop_words, then you get the result from this expression:

description.scan(/\w+/).reject do |word|
  stop_words.include? word
end.join ' '

If you want to preserve the non-word characters between each word,

description.scan(/(\w+)(\W+)/).reject do |(word, other)|
  stop_words.include? word
end.flatten.join

继续阅读：full-text-search ruby stop-words taxonomy text

Simple filtering out of common words from a text description

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？