How can I get XML sub-tags using a regexp for its contents without knowing it's name
I have XML which looks like this when simplified:
node_set = Nokogiri::XML('
<PARENT>
<SOME_TAG>12:12:1222</SOME_TAG>
<HOLY_TAG>12:12:1222</HOLY_TAG>
<MAJOR_TAG>12:12:1222</MAJOR_TAG>
<FOO_FOO>12:12:1222</FOO_FOO>
</PARENT&g开发者_运维问答t;'
)
All I know is only how to write a regexp for this like:
(\d+):(\d+):(\d+)
I read some articles for regexp matching on the official site, but there's no answer how to do it. Only the mechanism how to invoke user functions into the xpath method.
How could I can get all these tags without knowing it's name by the regexp?
Nokogiri does not support the XPath 2.0 matches
function, so you'll need to use Ruby to perform the regex:
hits = node_set.xpath("//text()").grep(/\d+:\d+:\d+/).map(&:parent)
p hits.map(&:name)
#=> ["SOME_TAG", "HOLY_TAG", "MAJOR_TAG", "FOO_FOO"]
Described:
- Find all text nodes throughout the document.
- Reduce the list to only those that match the regex desired.
- Map the list to the parent elements of each text node.
The Enumerable#grep
method is shorthand for .select{ |text| regex === text }
.
Alternatively, note that you can define your own custom XPath functions in Nokogiri that call back to Ruby, so you could pretend to be using XPath 2.0 matches
:
module FindWithRegex
def self.matches(nodes,pattern,flags=nil)
nodes.grep(Regexp.new(pattern,flags))
end
end
hits = node_set.xpath('//*[matches(text(),"\d+:\d+:\d+")]',FindWithRegex)
p hits.map(&:name)
#=> ["SOME_TAG", "HOLY_TAG", "MAJOR_TAG", "FOO_FOO"]
However, due to the fact that this is re-called for each found node (and thus re-creating a new regexp from a string each time) it's not nearly as efficient:
require 'benchmark'
Benchmark.bm(15) do |x|
N = 10000
x.report('grep and map'){ N.times{
node_set.xpath("//text()").grep(/\d+:\d+:\d+/).map(&:parent)
}}
x.report('custom function'){ N.times{
node_set.xpath('//*[matches(text(),"\d+:\d+:\d+")]',FindWithRegex)
}}
end
#=> user system total real
#=> grep and map 0.437000 0.016000 0.453000 ( 0.442044)
#=> custom function 1.653000 0.031000 1.684000 ( 1.694170)
You can speed it up by caching the Regex:
module FindWithRegex
REs = {}
def self.matches(nodes,pattern,flags=nil)
nodes.grep(REs[pattern] ||= Regexp.new(pattern,flags))
end
end
#=> user system total real
#=> grep and map 0.437000 0.016000 0.453000 ( 0.442044)
#=> cached regex 0.905000 0.000000 0.905000 ( 0.896090)
Here is a pure XPath 1.0 solution. Although there is no native RegEx facility in XPath 1.0, this is still possible to achieve using the standard XPath 1.0 functions substring-before()
, substring-after()
, and translate()
:
/*/*[not(translate(substring-before(.,':'),
'0123456789',
''
)
)
and
not(translate
(substring-before(substring-after(.,':'),
':'
),
'0123456789',
''
)
)
and
not(translate
(substring-after(substring-after(.,':'),
':'
),
'0123456789',
''
)
)
]
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy-of select=
" /*/*[not(translate(substring-before(.,':'),
'0123456789',
''
)
)
and
not(translate
(substring-before(substring-after(.,':'),
':'
),
'0123456789',
''
)
)
and
not(translate
(substring-after(substring-after(.,':'),
':'
),
'0123456789',
''
)
)
]
"/>
</xsl:template>
</xsl:stylesheet>
This XSLT transformation just selects using the above expression and outputs the selected nodes. When applied on this XML document (the provided one with added "invalid" elements):
<PARENT>
<SOME_TAG>12:12:1222</SOME_TAG>
<SOME_TAG2>12a:12:1222</SOME_TAG2>
<HOLY_TAG>12:12:1222</HOLY_TAG>
<HOLY_TAG2>12:12b:1222</HOLY_TAG2>
<MAJOR_TAG>12:12:1222</MAJOR_TAG>
<MAJOR_TAG2>12:12:1222c</MAJOR_TAG2>
<FOO_FOO>12:12:1222</FOO_FOO>
</PARENT>
the wanted, correctly selected nodes are output:
<SOME_TAG>12:12:1222</SOME_TAG>
<HOLY_TAG>12:12:1222</HOLY_TAG>
<MAJOR_TAG>12:12:1222</MAJOR_TAG>
<FOO_FOO>12:12:1222</FOO_FOO>
精彩评论