Detect and remove noise text [closed]

2022-12-30 08:17 问答作者：

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. 开发者_运维问答 Closed 10 years ago.

giving a database table with huge data in it, what is the best practice to remove noise text such as :

fghfghfghfg
qsdqsdqsd
rtyrtyrty

that noise is stored into the "name" field.

I'm working on data with Java standard structures.

Removing stuff like that isn't as easy as it might seem.

For us humans, it's easy to see that "djkhfkjh" doesn't make any sense. But how would a computer detect this kind of noise? How would it know that "Eyjafjallajökull" is just someone smashing his keyboard, or the most overbuzzed mountain in the last couple of years?

You can't do this reliably without many false positives, so after all, it's filtering the false-positives and true-positives by hand again.

Well, you can build a classifier using NLP methods, and train it on examples of noise and not-noise. One case of that you can take is the language detector from Apache Tika. If the language detector says 'beats me' that might be good enough.

Get a dictionary with as many names you can find and filter your data to display the ones that are not in the dictionary. Then you have to delete them one by one to make sure you do not delete valid data. Sorting the list by name can help you delete more rows at a time.

If the rest of the text is English, you could use a word list. If more than a given percentage (say, 50%) of the words in the text are not in the word list, it is probably noise.

You may want to set a threshold of, say, 5 words, to prevent deleting posts like 'LOL'.

On most Linux installations, you can extract a word list from the spell checker aspell like this:

aspell --lang en dump master

You're going to need to start by defining "noise text" more effectively. Defining the problem is the hard part here. You can't write code that will say "get rid of strings that are sort of like _____." It looks like the pattern you've identified is "a consistent set of three characters in a row, and the set repeats at least once, but may not terminate cleanly (it could terminate on a character from the middle of the set)."

Now write a regular expression that matches that pattern, and test it.

But I bet there are other patterns that you're looking for...

Inspect each word and see how much redundancy is there. If there are more than three consecutive repeated groups of letters, it is a good candidate for noise. Also, look for groups of letters that don't usually belong together and for groups of consecutive letters that are also consecutive on the keyboard. If a whole word is made of such letters that are keyboard neighbors, it also claims a spot on the noise list.

Training a NLP classifier would probably be the best way to go. However, a simpler method might be to simply check that each word exists in a list of all known "valid" words. Most Unix systems have a file called /usr/share/dict/words that you can use for this purpose. Additionally, Ubuntu expands on this with /usr/share/dict/american-english, /usr/share/dict/american-huge, and /usr/share/dict/american-insane, each list more comprehensive then the last. These lists also include a lot of common misspellings, so you won't filter out text that's not technically a word, but clearly recognizable as a word.

If you're really ambitious, you can combine these approaches, and use these words lists to train a Bayesian or Maximum Entropy classifier.

There are a lot of good answers here. Which one(s) will work for you depends a lot on the specifics of your problem -- for example, is the input supposed to be English words, usernames, people's last names, etc.

One approach: write a program to analyze what you consider "valid" input. Keep track of how frequently every possible three-letter sequence appears in legitimate text. Then when you have input to check, look at each three-letter sequence of the input and look up its expected frequency. Something like "xzt" probably has a frequency near zero. If you have too many subsequences like that, mark it as garbage.

Problems with this:

You might treat bad spelling as garbage, for example if someone forgets to put a 'u' after a 'q' in a word.
You won't catch input like "thethethethe".

Examples #1 and #2 can be removed by a parser that tries to figure out how to pronounce the text. Regardless of language they're unspeakable and thus not words.

继续阅读：noise text

Detect and remove noise text [closed]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？