retrieving text between tags

2023-02-20 16:20 问答作者：

I need to create a regular expression to obtain all the stuff that is contained be开发者_开发知识库tween two tags that are either or and there can be multiple lines between this tags. For example:

<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...

Each block tag marks the beginning of a new block, I have tried the following regular expression, but I am a bit lost on how to specify that anything can go between those parenthesis including multiple lines, and also how to specify that it needs to stop retrieving things once it reaches another tag that says

<block color="crimson">(\w+)|<block color="green">(\w+)

woops I forgot to add though that I am not interested in blocks that appear as:

<block color="purple">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...

I wouldn't suggest you use a regular expression for this. First see if you can make the content valid HTML by adding closing tags. Then use something like nokogiri, heres a tutorial:

http://nokogiri.org/tutorials/parsing_an_html_xml_document.html

Even if you can't clean up the HTML, I'd give nokogiri a shot, it has worked with some pretty broken HTML for me before.

Good luck!

Using regex for parsing HTML is asking for trouble except for the most trivial, controlled circumstances. A parser is more robust and, in the long run, usually a lot easier to maintain.

The HTML is invalid because the <block> tags are not terminated. That results in an ambiguous parsing using Nokogiri, but, we can play a minor trick on it to get things fixed up, and then be able to parse it correctly:

html =<<EOT
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html.gsub('<block', '</block><block'))
pp doc.search('block').map { |n| n.text }

>> ["\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...\n",
>>  "\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...\n",
>>  "\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...      \n",
>>  "\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...\n"]

By doing a search and replace the closing </block> can be inserted in front of all <block> tags. That results in the first occurrence being wrong, but all the rest are close enough that Nokogiri's fix-up of the HTML will be sensible. Here's what the HTML looks like after fixup:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
</block><block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
</block><block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
</block><block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
</block>
</body></html>

At that point Nokogiri can make sense of the document and search for the individual blocks. I'm using a CSS accessor, so if you need better granularity you can fine-tune the CSS, or switch to XPath instead.

str = %q(<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...)

ar = str.split(/<block color="\w+">\n/)
ar.shift #(to get rid of the empty element)

Maybe a simple way to do this task is to read line by line, looking if the line starts with

继续阅读：ruby

retrieving text between tags

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？