Simple text parser
I want to create a very simple parser to conve开发者_高级运维rt:
"I wan't this to be ready by 10:15 p.m. today Mr. Gönzalés.!" to:
( 'I', ' ', 'wan', '\'', 't', ' ', 'this', ' ', 'to', ' ', 'be', ' ', 'ready', ' ', 'by', ' ', '10', ':', '15', ' ', 'p', '.', 'm', '.', ' ', 'today', ' ', 'Mr' '.' ' ', 'Gönzalés', '.' '!' )
So basically I want consecutive letters and numbers to be grouped into a single string. I'm using Python 3 and I don't want to install external libs. I also would like the solution to be as efficient as possible as I will be processing a book.
So what approaches would you recommend me with regard to solving this problem. Any examples?
The only way I can think of now is to step trough the text, character for character, in a for loop. But I'm guessing there's a better more elegant approach.
Thanks,
Barry
You are looking for a procedure called tokenization. That means splitting raw text into discrete "tokens", in our case just words. For programming languages this is fairly easy, but unfortunately it is not so for natural language.
You need to do two things: Split up the text in sentences and split the sentences into words. Usually we do this with regular expressions. Naïvely you could split sentences by the pattern ". ", ie period followed by space, and then split up the words in sentences by space. This won't work very well however, because abbreviations are often also ending in periods. As it turns out, tokenizing and sentence segmentation is actually fairly tricky to get right. You could experiment with several regexps, but it would be better to use a ready made tokenizer. I know you didn't want to install any external libs, but im sure this will spare you pain later on. NLTK has good tokenizers.
I believe this is a solution:
import regex text = "123 2 can't, 4 Å, é, and 中ABC _ sh_t" print(regex.findall('\d+|\P{alpha}|\p{alpha}+', text))
Can it be improved?
Thank!
精彩评论