fastest method to find a specific word in an xhtml document
Wh开发者_如何转开发at would be the fastest way to do this.
I have may html documents that might (or might not) contain the word "Instructions" followed by several lines of instructions. I want to parse these pages that contain the word "Instructions" and the lines that follow.
Maybe something along this lines
require 'rubygems'
require 'nokogiri'
def find_instructions doc
doc.xpath('//body//text()').each do |text|
instructions = text.content.select do |line|
# flip-flop matches all sections starting with
# "Instructions" and ending with an empty line
true if (line =~ /Instructions/)..(line =~ /^$/)
end
return instructions unless instructions.empty?
end
return []
end
puts find_instructions(Nokogiri::HTML(DATA.read))
__END__
<html>
<head>
<title>Instructions</title>
</head>
<body>
lorem
ipsum
<p>
lorem
ipsum
<p>
lorem
ipsum
<p>
Instructions
- Browse stackoverflow
- Answer questions
- ???
- Profit
More
<p>
lorem
ipsum
</body>
</html>
This is not the most "correct" way, but will work mostly. Use a regular expression to find the strings:ruby regex
The regex you want is something like /instructions([^<]+)/. This assumes that you are ending with a < character.
You can start by just testing if a document matches:
if open('docname.html').read =~ /Instructions/
# Parse to remove the instructions.
end
I'd recommend using Hpricot to then extract the part you want - this will be more or less difficult depending on how your html is structured. Please post some more details about the structure if you want some more specific help.
精彩评论