Strange behavior with tagsoup and Groovy's XmlSlurper
Let's say I want to parse the phone number from an an xml string like this:
str 开发者_运维知识库= """ <root>
<address>123 New York, NY 10019
<div class="phone"> (212) 212-0001</div>
</address>
</root>
"""
parser = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText (str)
println parser.address.div.text()
It doesn't print the phone number.
If I change the "div" element to "foo" like this
str = """ <root>
<address>123 New York, NY 10019
<foo class="phone"> (212) 212-0001</foo>
</address>
</root>
"""
parser = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText (str)
println parser.address.foo.text()
Then its able to parse and print the phone number.
What the heck is going on?
Btw I am using groovy 1.7.5 and tagsoup 1.2
Just change code to
println parser.address.'div'.text()
This is curse of Groovy and many other dynamic language - "div" is reserved method name thus you don't get node but rather try to divide "address" node :)
I seem to recall that tagsoup normalizes HTML tags - i.e. it uppercases them. So the GPath expression you want is probably
println parser.ADDRESS.DIV.text()
I find it handy to be able to print out the result of the parse - then you can see why your GPath isn't working. Use this..
println groovy.xml.XmlUtil.serialize(parser)
I know that this question is very old. But I faced recently and this is what I used:
parser.'**'.findAll { it.name() == 'div' && it.@class.text() == 'phone' }.each { div ->
println div.text()
}
- Using depthFirst find all tags
- Filter by name div that has class phone;
- Print the value (212) 212-0001
Groovy version is 2.4
精彩评论