Tokenizing complex input
I'm attempting to tokenize the following input in Python:
text = 'This @example@ is "neither":/defn/neither complete[1] *nor* trite, *though _simple_*.'
I would like to produce something like the following while avoiding use of the regular expressions:
tokens =开发者_Python百科 [
('text', 'This '),
('enter', 'code'),
('text', "example")
('exit', None),
('text', ' is '),
('enter', 'a'),
('text', "neither"),
('href', "/defn/neither"),
('exit', None),
('text', ' complete'),
('enter', 'footnote'),
('id', 1),
('exit', None),
('text', ' '),
('enter', 'strong'),
('text', 'nor'),
('exit', None),
('text', ' trite, '),
('enter', 'strong'),
('text', 'though '),
('enter', 'em'),
('text', 'simple'),
('exit', None),
('exit', None),
('text', '.')
]
Pretend the above is being produced by a generator. My current implementation works, though the code is somewhat hideous and not easily extended to support links.
Any assistance would be greatly appreciated.
Updated to change the desired syntax from a complex nested list structure to a simple stream of tuples. Indentation for us humans. Formatting within the text of a link is OK. Here is a simple parser that generates the lexing result I'm looking for, but still doesn't handle links or footnotes.
Well, here's a more complete parser with sufficient extensibility to do whatever I may need in the future. It only took three hours. It's not terribly speedy, but generally the output of the class of parser I'm writing is heavily cached anyway. Even with this tokenizer and parser in place, my full engine still clocks in at < 75% of the SLoC of the default python-textile renderer while remaining somewhat faster. All without regular expressions.
Footnote parsing remains, but that's minor compared to link parsing. The output (as of this posting) is:
tokens = [
('text', 'This '),
('enter', 'code'),
('text', 'example'),
('exit', None),
('text', ' is '),
('enter', 'a'),
('text', 'neither'),
('attr', ('href', '/defn/neither')),
('exit', None),
('text', ' complete[1] '),
('enter', 'strong'),
('text', 'nor'),
('exit', None),
('text', ' trite, '),
('enter', 'strong'),
('text', 'though '),
('enter', 'em'),
('text', 'simple'),
('exit', None),
('exit', None),
('text', '.')
]
精彩评论