Detect and remove noise text [closed]
giving a database table with huge data in it, what is the best practice to remove noise text such as :
- fghfghfghfg
- qsdqsdqsd
- rtyrtyrty
that noise is stored into the "name" field.
I'm working on data with Java standard structures.
Removing stuff like that isn't as easy as it might seem.
For us humans, it's easy to see that "djkhfkjh" doesn't make any sense. But how would a computer detect this kind of noise? How would it know that "Eyjafjallajökull" is just someone smashing his keyboard, or the most overbuzzed mountain in the last couple of years?
You can't do this reliably without many false positives, so after all, it's filtering the false-positives and true-positives by hand again.
Well, you can build a classifier using NLP methods, and train it on examples of noise and not-noise. One case of that you can take is the language detector from Apache Tika. If the language detector says 'beats me' that might be good enough.
Get a dictionary with as many names you can find and filter your data to display the ones that are not in the dictionary. Then you have to delete them one by one to make sure you do not delete valid data. Sorting the list by name can help you delete more rows at a time.
If the rest of the text is English, you could use a word list. If more than a given percentage (say, 50%) of the words in the text are not in the word list, it is probably noise.
You may want to set a threshold of, say, 5 words, to prevent deleting posts like 'LOL'.
On most Linux installations, you can extract a word list from the spell checker aspell
like this:
aspell --lang en dump master
You're going to need to start by defining "noise text" more effectively. Defining the problem is the hard part here. You can't write code that will say "get rid of strings that are sort of like _____." It looks like the pattern you've identified is "a consistent set of three characters in a row, and the set repeats at least once, but may not terminate cleanly (it could terminate on a character from the middle of the set)."
Now write a regular expression that matches that pattern, and test it.
But I bet there are other patterns that you're looking for...
Inspect each word and see how much redundancy is there. If there are more than three consecutive repeated groups of letters, it is a good candidate for noise. Also, look for groups of letters that don't usually belong together and for groups of consecutive letters that are also consecutive on the keyboard. If a whole word is made of such letters that are keyboard neighbors, it also claims a spot on the noise list.
Training a NLP classifier would probably be the best way to go. However, a simpler method might be to simply check that each word exists in a list of all known "valid" words. Most Unix systems have a file called /usr/share/dict/words that you can use for this purpose. Additionally, Ubuntu expands on this with /usr/share/dict/american-english, /usr/share/dict/american-huge, and /usr/share/dict/american-insane, each list more comprehensive then the last. These lists also include a lot of common misspellings, so you won't filter out text that's not technically a word, but clearly recognizable as a word.
If you're really ambitious, you can combine these approaches, and use these words lists to train a Bayesian or Maximum Entropy classifier.
There are a lot of good answers here. Which one(s) will work for you depends a lot on the specifics of your problem -- for example, is the input supposed to be English words, usernames, people's last names, etc.
One approach: write a program to analyze what you consider "valid" input. Keep track of how frequently every possible three-letter sequence appears in legitimate text. Then when you have input to check, look at each three-letter sequence of the input and look up its expected frequency. Something like "xzt" probably has a frequency near zero. If you have too many subsequences like that, mark it as garbage.
Problems with this:
- You might treat bad spelling as garbage, for example if someone forgets to put a 'u' after a 'q' in a word.
- You won't catch input like "thethethethe".
Examples #1 and #2 can be removed by a parser that tries to figure out how to pronounce the text. Regardless of language they're unspeakable and thus not words.
精彩评论