Using Markov models to convert all-caps to mixed-case and related problems
I've been thinking about using Markov techniques to restore missing information to natural language text.
- Restore all-caps text to mixed-case.
- Restore accents / diacritics to languages which should have them but have been converted to plain ASCII.
- Convert rough phonetic transcriptions back into native alphabets.
That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.
I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Hidden Markov Models to resolve the ambiguities.
Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?
Examples
- GEORGE LOST HIS SIM CARD IN THE BUSH ⇨ George lost his SIM card in the bush
- tantot il rit a gorge deployee ⇨ 开发者_StackOverflow社区tantôt il rit à gorge déployée
I think you can use Markov models (HMMs) for all three tasks, but also take a look at more modern models such as conditional random fields (CRFs). Also, here's some boost for your google-fu:
- Restore mixed case to text in all caps
This is called truecasing.
- Restore accents / diacritics to languages which should have them but have been converted to plain ASCII
I suspect Markov models are going to have a hard time on this. OTOH, labelled training data is free since you can just take a bunch of accented text in the target language and strip the accents. See also next answer.
- Convert rough phonetic transcriptions back into native alphabets
This seems strongly related to machine transliteration, which has been tried using pair HMMs (from bioinformatics/genome work).
I'll take a crack at fleshing out how you would accomplish these.
Capitalisation
This is fairly close to Named Entity Recognition and is an example of a 'sequence tagging problem'. Proper nouns should be initially capitalised, organisation names that are acronyms should be all capitalised, and then there are other examples that fall outside those categories. It seems to me that it would therefore be harder than NER and so a straightforward dictionary based approach probably wouldn't be an ideal solution.
If you were to use an Hidden Markov Model, this would amount to letting the 'hidden' state of an HMM be [lowerCase, initCaps, allCaps] and training on some data that you assume is correct (e.g. Wikipedia but there are many other sources too). You then infer the hidden state for words that you aren't sure are correctly capitalised. There are a bunch of HMM libraries out there, I'm sure you can find one to suit your needs. I'd say trying an HMM is a good initial choice.
Non ASCII characters
As you guessed, a tougher problem. If you tried to do this with an HMM at the word level, you would have an enormous number of hidden states, one for each accented word, which would probably be impossible to train. The problem is more tractable at the character level but you lose a tremendous amount of context if you only consider the previous character. If you start using n-grams instead of characters, your scaling problems come back. In short, I don't think this problem is like the previous one because the number of labels is too large to consider it a sequence labelling problem (I mean you can, it's just not practical).
I haven't heard of research in this area, then again I'm no expert. My best guess would be to use a general language model for the language you are interested in. You could use this to give you a probability of a sentence in the language. Then you could try replacing possibly accented characters to give the probabilities of those sentences and take the most likely, or use some threshold on the difference, or something like that. You could train an n-gram language model fairly easily on a large corpus of a certain language.
I have no idea if this would actually work, either in terms of accuracy or efficiency. I don't have direct experience of this particular problem.
Transliteration
No idea, to be honest. I don't know where you would find data to make a system of your own. After a brief search, I found the Google Transliteration service (with API). Perhaps it does what you're after. I don't even have enough experience in languages with other scripts to really know what it's doing.
精彩评论