How do I write a simple Ragel tokenizer (no backtracking)?
UPDATE 2
Original question: Can I avoid using Ragel's |**|
if I don't need backtracking?
Updated answer: Yes, you can write a simple tokenizer with ()*
if you don't need backtracking.
UPDATE 1
I realized that asking about XML tokenizing was a red herring, because what I'm doing is not specific to XML.
END UPDATES
I have a Ragel scanner/tokenizer that simply looks for FooBarEntity elements in files like:
<ABC >
<XYZ >
<FooBarEntity>
<Example >Hello world</Example >
</FooBarEntity>
</XYZ >
<XYZ >
<FooBarEntity>
<Example >sdrastvui</Example >
</FooBarEntity>
</XYZ >
</ABC >
The scanner version:
%%{
machine simple_scanner;
action Emit {
emit data[(ts+14)..(te-15)].pack('c*')
}
foo = '<FooBarEntity>' any+ :>> '</FooBarEntity>';
main := |*
foo => Emit;
any;
*|;
}%%
The non-scanner version (i.e. uses ()*
instead of |**|
)
%%{
machine simple_tokenizer;
action MyTs {
my_ts = p
}
action MyTe {
my_te = p
}
action Emit {
emit data[my_ts...my_te].pack('c*')
my_ts = nil
my_te = nil
}
foo = '<FooBarEntity>' any+ >MyTs :>> '</FooBarEntity>' >MyTe %Emit;
main := ( foo | any+ )*;
}%%
I figured this out and wrote tests for it at https://github.com/seamusabshere/ruby_ragel_examples
You can see the reading/buffering code at https://github.com/seamusabshere/ruby_ragel_examples/blob/master/lib/simple_scanner.rl and https://github.com/seamusabshere/ruby_ragel_examples/blob/master开发者_运维知识库/lib/simple_tokenizer.rl
You don't have to use a scanner to parse XML. I've implemented a simple XML parser in Ragel, without a scanner. Here is a blog post with some timings and more info.
Edit: You can do it many ways. You could use a scanner. You could parse for words and if you see STARTANIMAL
you start collecting words until you see STOPANIMAL
.
Rephrasing Occam: you do not need the scanner unless you need it. Without scanner you can process one symbol at a time, possibly reading it from the stream with no buffer.
精彩评论