开发者

fastest method to find a specific word in an xhtml document

Wh开发者_如何转开发at would be the fastest way to do this.

I have may html documents that might (or might not) contain the word "Instructions" followed by several lines of instructions. I want to parse these pages that contain the word "Instructions" and the lines that follow.


Maybe something along this lines

require 'rubygems'
require 'nokogiri'

def find_instructions doc
  doc.xpath('//body//text()').each do |text|
    instructions = text.content.select do |line|
      # flip-flop matches all sections starting with
      # "Instructions" and ending with an empty line
      true if (line =~ /Instructions/)..(line =~ /^$/) 
    end
    return instructions unless instructions.empty?
  end
  return []
end

puts find_instructions(Nokogiri::HTML(DATA.read))


__END__
<html>
<head>
  <title>Instructions</title>
</head>
<body>
lorem
ipsum
<p>
lorem
ipsum
<p>
lorem
ipsum
<p>
Instructions
- Browse stackoverflow
- Answer questions
- ???
- Profit

More
<p>
lorem
ipsum
</body>
</html>


This is not the most "correct" way, but will work mostly. Use a regular expression to find the strings:ruby regex

The regex you want is something like /instructions([^<]+)/. This assumes that you are ending with a < character.


You can start by just testing if a document matches:

if open('docname.html').read =~ /Instructions/
  # Parse to remove the instructions.
end

I'd recommend using Hpricot to then extract the part you want - this will be more or less difficult depending on how your html is structured. Please post some more details about the structure if you want some more specific help.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜