Parsing through text to find html tags in Ruby 1.9.x
I want to be able to match text in between two tags, starting at an opening tag and ending in a closing tag.
Say I have this block of text in a variable called '开发者_JAVA技巧text':
some text some text some text some text some text
<some_tag>
some text some text some text some text some text
</some_tag>
some text some text some text some text some text
I want to parse the contents 'text' doing nothing until it finds an opening tag, in this case 'some_tag', and once it finds an opening tag I want it to capture everything until the tag closes.
I've been fooling around with blocks and regular expressions for about an hour now and cannot seem to figure out a good way to work this out.
I'd appreciate any and all pointers, thanks!
You should use a parser for HTML. Regex and HTML tends to make a volatile mix, that leads to insanity in large doses.
Using Nokogiri:
require 'nokogiri'
html = <<EOT
some text some text some text some text some text
<p>
some text some text some text some text some text
</p>
some text some text some text some text some text
EOT
doc = Nokogiri::HTML::DocumentFragment.parse(html)
puts doc.search('p').map { |n| n.inner_text }
>> some text some text some text some text some text
This is searching through the HTML fragment, looking for <p>
tags. For each one it finds it'll extract the inner text.
I'm using Nokogiri's CSS mode, by using "p"
. I could use XPath instead, but CSS is understood by more people.
精彩评论