Simple text parser

2023-03-30 19:47 问答作者：

I want to create a very simple parser to conve开发者_高级运维rt:

"I wan't this to be ready by 10:15 p.m. today Mr. Gönzalés.!" to:

(
  'I',
  ' ', 
  'wan',
  '\'', 
  't', 
  ' ',  
  'this', 
  ' ',  
  'to',
  ' ', 
  'be',
  ' ', 
  'ready',
  ' ', 
  'by',
  ' ', 
  '10', 
  ':', 
  '15',
  ' ', 
  'p',
  '.',
  'm',
  '.',
  ' ', 
  'today',
  ' ',
  'Mr'
  '.'
  ' ',
  'Gönzalés',
  '.'
  '!'
)

So basically I want consecutive letters and numbers to be grouped into a single string. I'm using Python 3 and I don't want to install external libs. I also would like the solution to be as efficient as possible as I will be processing a book.

So what approaches would you recommend me with regard to solving this problem. Any examples?

The only way I can think of now is to step trough the text, character for character, in a for loop. But I'm guessing there's a better more elegant approach.

Thanks,

Barry

You are looking for a procedure called tokenization. That means splitting raw text into discrete "tokens", in our case just words. For programming languages this is fairly easy, but unfortunately it is not so for natural language.

You need to do two things: Split up the text in sentences and split the sentences into words. Usually we do this with regular expressions. Naïvely you could split sentences by the pattern ". ", ie period followed by space, and then split up the words in sentences by space. This won't work very well however, because abbreviations are often also ending in periods. As it turns out, tokenizing and sentence segmentation is actually fairly tricky to get right. You could experiment with several regexps, but it would be better to use a ready made tokenizer. I know you didn't want to install any external libs, but im sure this will spare you pain later on. NLTK has good tokenizers.

I believe this is a solution:

import regex

text = "123 2 can't, 4 Å, é, and 中ABC _ sh_t"
print(regex.findall('\d+|\P{alpha}|\p{alpha}+', text))

Can it be improved?

Thank!

继续阅读：parsing python-3.x

Simple text parser

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？