Parse and charset: why my script doesn't work
I want to extract attribute1 and attribute3 values only. I don't understand why charset doesn't seem to work in my case to "skip" any other attributes (attribute3 is not extracted as I would like):
content: {<tag attribute1="valueattribute1" attribute2="va开发者_StackOverflow中文版lueattribute2" attribute3="valueattribute3">
</tag>
<tag attribute2="valueattribute21" attribute1="valueattribute11" >
</tag>
}
attribute1: [{attribute1="} copy valueattribute1 to {"} thru {"}]
attribute3: [{attribute3="} copy valueattribute3 to {"} thru {"}]
spacer: charset reduce [tab newline #" "]
letter: complement spacer
to-space: [some letter | end]
attributes-rule: [(valueattribute1: none valueattribute3: none) [attribute1 | none] any letter [attribute3 | none] (print valueattribute1 print valueattribute3)
| [attribute3 | none] any letter [attribute1 | none] (print valueattribute3 print valueattribute1
valueattribute1: none valueattribute3: none
)
| none
]
rule: [any [to {<tag } thru {<tag } attributes-rule {>} to {</tag>} thru {</tag>}] to end]
parse content rule
output is
>> parse content rule
valueattribute1
none
== true
>>
Firstly you're not using parse/all
. In Rebol 2 that means that whitespace has been effectively stripped out before the parse runs. That's not true in Rebol 3: if your parse rules are in block format (as you are doing here) then /all
is implied.
(Note: There seemed to be consensus that Rebol 3 would throw out the non-block form of parse rules, in favor of the split
function for those "minimal" parse scenarios. That would get rid of /all
entirely. No action has yet been taken on this, unfortunately.)
Secondly your code has bugs, which I'm not going to spend time sorting out. (That's mostly because I think using Rebol's parse to process XML/HTML is a fairly silly idea :P)
But don't forget you have an important tool. If you use a set-word in the parse rule, then that will capture the parse position into a variable. You can then print it out and see where you're at. Change the part of attribute-rule
where you first say any letter
to pos: (print pos) any letter
and you'll see this:
>> parse/all content rule
attribute2="valueattribute2" attribute3="valueattribute3">
</tag>
<tag attribute2="valueattribute21" attribute1="valueattribute11" >
</tag>
valueattribute1
none
== true
See the leading space? Your rules right before the any letter
put you at a space... and since you said any letter was ok, no letters are fine, and everything's thrown off.
(Note: Rebol 3 has an even better debugging tool...the word ??
. When you put it in the parse block it tells you what token/rule you're currently processing as well as the state of the input. With this tool you can more easily find out what's going on:
>> parse "hello world" ["hello" ?? space ?? "world"]
space: " world"
"world": "world"
== true
...though it's really buggy on r3 mac intel right now.)
Additionally, if you're not using copy
then your pattern of to X thru X
is unnecessary, you can achieve that with just thru X
. If you want to do a copy you can also do that with the briefer copy Y to X X
or if it's just a single symbol you could write the clearer copy Y to X skip
In places where you see yourself writing repetitive code, remember that Rebol can go a step above by using compose
etc:
>> temp: [thru (rejoin [{attribute} num {=}])
copy (to-word rejoin [{valueattribute} num]) to {"} thru {"}]
>> num: 1
>> attribute1: compose temp
== [thru "attribute1=" copy valueattribute1 to {"} thru {"}]
>> num: 2
>> attribute2: compose temp
== [thru "attribute2=" copy valueattribute2 to {"} thru {"}]
Short answer, [any letter] eats your attribute3="..." as the #"^"" character is by your definition a 'letter. Additionally, you may have problems where there is no attribute2, then your generic second attribute rule will eat attribute3 and your attribute3 rule will not have anything to match - better to either be explicit that there is an optional attribute2 or an optional anything-but-attribute3
attribute1="foo" attribute2="bar" attribute3="foobar"
<- attribute1="..." -> <- any letter -> <- attibute3="..." ->
Also, 'parse without the /all refinement ignores spaces (or at least is very unwieldy where spaces are concerned) - /all is highly recommended for this type of parsing.
When adding parse/all it didn't seem to change anything. Finally this seems to work (using set-word has been indeed a great help for debugging !!!), what do you think ?
content: {<tag attribute1="valueattribute1" attribute2="valueattribute2" attribute3="valueattribute3">
</tag>
<tag attribute2="valueattribute21" attribute1="valueattribute11" >
</tag>
}
attribute1: [to {attribute1="} thru {attribute1="} copy valueattribute1 to {"} thru {"}]
attribute3: [to {attribute3="} thru {attribute3="} copy valueattribute3 to {"} thru {"}]
letter: charset reduce ["ABCDEFGHIJKLMNOPQRSTUabcdefghijklmnopqrstuvwxyz1234567890="]
attributes-rule: [(valueattribute1: none valueattribute3: none)
[attribute1 | none] any letter pos:
[attribute3 | none] (print valueattribute1 print valueattribute3)
| [attribute3 | none] any letter [attribute1 | none] (print valueattribute3 print valueattribute1
valueattribute1: none valueattribute3: none
)
| none
]
rule: [any [to {<tag } thru {<tag } attributes-rule {>} to {</tag>} thru {</tag>}] to end]
parse content rule
which outputs:
>> parse/all content rule
valueattribute1
valueattribute3
valueattribute11
none
== true
>>
精彩评论