Training Hidden Markov Models without Tagged Corpus Data
For a linguistics course we implemented Part of Sp开发者_高级运维eech (POS) tagging using a hidden markov model, where the hidden variables were the parts of speech. We trained the system on some tagged data, and then tested it and compared our results with the gold data.
Would it have been possible to train the HMM without the tagged training set?
In theory you can do that. In that case you would use the Baum-Welch-Algorithm. It is described very well in Rabiner's HMM Tutorial.
However, having applied HMMs to part of speech, the error you get with the standard form will not be so satisfying. It is a form of expectation maximization which only converges to local maxima. Rule based approaches beat HMMs hands down, iirc.
I believe the natural language toolkit NLTK for python has an HMM implementation for that exact purpose.
NLP was a couple years ago, but I believe without tagging the HMM could help determine the symbol emission/state transition probabilities of n-grams (i.e. what are the odds of "world" occurring after "hello"), but not parts-of-speech. It needs the tagged corpus to learn how the POS interrelate.
If I'm way off on this let me know in the comments!
精彩评论