Natural Language Processing
I have thousands of sentences in a file. I want to find only right/useful English Language words. Is it possible with Natural Language Processing?
Sample Sentence:
~@^.^@~ tic but sometimes world good famous tac Zorooooooooooo
I 开发者_StackOverflowjust want to extract only English Words like
tic world good famous
Any Advice how can I achieve this. Thanks in Advance
You can use the WordNet API for looking up words.
You need to compile a list of stop words (once you don't want to enlist in your search) afterwards you can filter your search, using that stop words list. for details you should consider looking at these wikipedia article
- http://en.wikipedia.org/wiki/Stop_words
- http://en.wikipedia.org/wiki/Natural_language_processing
You can use a language guesser that uses character n-gram statistics. Usually only a small amount of material is needed (both for training and classification). Links to literature and implementations can be found here:
http://odur.let.rug.nl/~vannoord/TextCat/
The methodology is very simple:
- Collect a small amount of text for each language.
- Extract and count the 1-grams and 5-grams occurring in the text.
- Order these n-grams by frequency, taking the best, say 300. This forms the fingerprint of the language.
If you want to classify a text or a sentence, you apply steps 2 and 3, and compare the resulting fingerprint to the fingerprints collected during training. Calculate a score based on rank differences of n-grams, the language with the lowest score wins.
You can use Python for achieving this. What you are looking for is filtering English words.
Firstly tokenize the sentences. (Split the sentences into words)
Use Python langdetect library to see if it is an English word or not
Filter all the english words based on langdetect output.
How to install the library:
$ sudo pip install langdetect
Supported Python versions 2.6, 2.7, 3.x.
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'
https://pypi.python.org/pypi/langdetect?
P.S.: Don't expect this to work correctly always:
>>> detect("today is a good day")
'so'
>>> detect("today is a good day.")
'so'
>>> detect("la vita e bella!")
'it'
>>> detect("khoobi? khoshi?")
'so'
>>> detect("wow")
'pl'
>>> detect("what a day")
'en'
>>> detect("yay!")
'so'
package com;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;
public class TranscriberDemo {
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("en-us");
configuration.setDictionaryPath("Sample Dict File_2.dic");
configuration.setLanguageModelPath("Sample Language Modeller_2.lm");
//configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
//configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
//configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/language/en-us.lm.dmp");
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
InputStream stream = new FileInputStream(new File("test.wav"));
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
}
}
精彩评论