开发者

retrieving text between tags

I need to create a regular expression to obtain all the stuff that is contained be开发者_开发知识库tween two tags that are either or and there can be multiple lines between this tags. For example:

<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...

Each block tag marks the beginning of a new block, I have tried the following regular expression, but I am a bit lost on how to specify that anything can go between those parenthesis including multiple lines, and also how to specify that it needs to stop retrieving things once it reaches another tag that says

<block color="crimson">(\w+)|<block color="green">(\w+)

woops I forgot to add though that I am not interested in blocks that appear as:

<block color="purple">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...


I wouldn't suggest you use a regular expression for this. First see if you can make the content valid HTML by adding closing tags. Then use something like nokogiri, heres a tutorial:

http://nokogiri.org/tutorials/parsing_an_html_xml_document.html

Even if you can't clean up the HTML, I'd give nokogiri a shot, it has worked with some pretty broken HTML for me before.

Good luck!


Using regex for parsing HTML is asking for trouble except for the most trivial, controlled circumstances. A parser is more robust and, in the long run, usually a lot easier to maintain.

The HTML is invalid because the <block> tags are not terminated. That results in an ambiguous parsing using Nokogiri, but, we can play a minor trick on it to get things fixed up, and then be able to parse it correctly:

html =<<EOT
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html.gsub('<block', '</block><block'))
pp doc.search('block').map { |n| n.text }

>> ["\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...\n",
>>  "\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...\n",
>>  "\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...      \n",
>>  "\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...\n"]

By doing a search and replace the closing </block> can be inserted in front of all <block> tags. That results in the first occurrence being wrong, but all the rest are close enough that Nokogiri's fix-up of the HTML will be sensible. Here's what the HTML looks like after fixup:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
</block><block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
</block><block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
</block><block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
</block>
</body></html>

At that point Nokogiri can make sense of the document and search for the individual blocks. I'm using a CSS accessor, so if you need better granularity you can fine-tune the CSS, or switch to XPath instead.


str = %q(<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...)

ar = str.split(/<block color="\w+">\n/)
ar.shift #(to get rid of the empty element)


Maybe a simple way to do this task is to read line by line, looking if the line starts with

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜