Scrape URLs From Web
<a href="http://www.开发者_高级运维utoronto.ca/gdrs/" title="Rehabilitation Science"> Rehabilitation Science</a>
For the example above, I want to get the department name "Rehabilitation Science" and its homepage url "http://www.utoronto.ca/gdrs/" at the same time.
Could someone please suggest some smart regular expressions that would do the job for me?
There's no reason to use regex to do this at all. Here's a solution using Nokogiri, which is the usual Ruby HTML/XML parser:
html = <<EOT
<p><a href="http://www.example.com/foo">foo</a></p>
<p><a href='http://www.example.com/foo1'>foo1</p></a>
<p><a href=http://www.example.com/foo2>foo2</a></p>
<p><a href = http://www.example.com/bar>bar</p>
<p><a
href="http://www.example.com/foobar"
>foobar</a></p>
<p><a
href="http://www.example.com/foobar2"
>foobar2</p>
EOT
require 'nokogiri'
doc = Nokogiri::HTML(html)
links = Hash[
*doc.search('a').map { |a|
[
a['href'],
a.content
]
}.flatten
]
require 'pp'
pp links
# >> {"http://www.example.com/foo"=>"foo",
# >> "http://www.example.com/foo1"=>"foo1",
# >> "http://www.example.com/foo2"=>"foo2",
# >> "http://www.example.com/bar"=>"bar",
# >> "http://www.example.com/foobar"=>"foobar",
# >> "http://www.example.com/foobar2"=>"foobar2"}
This returns a hash of URLs as keys with the related content of the <a>
tag as the value. That means you'll only capture unique URLs, throwing away duplicates. If you want all URLs use:
links = doc.search('a').map { |a|
[
a['href'],
a.content
]
}
which results in:
# >> [["http://www.example.com/foo", "foo"],
# >> ["http://www.example.com/foo1", "foo1"],
# >> ["http://www.example.com/foo2", "foo2"],
# >> ["http://www.example.com/bar", "bar"],
# >> ["http://www.example.com/foobar", "foobar"],
# >> ["http://www.example.com/foobar2", "foobar2"]]
I used a CSS accessor 'a'
to locate the tags. I could use 'a[href]'
if I wanted to grab only links, ignoring anchors.
Regex are very fragile when dealing with HTML and XML because the markup formats are too freeform; They can vary in their format while remaining valid, especially HTML, which can vary wildly in its "correctness". If you don't own the generation of the file being parsed, then your code is at the mercy of whoever does generate it when using regex; A simple change in the file can break the pattern badly, resulting in a continual maintenance headache.
A parser, because it actually understands the internal structure of the file, can withstand those changes. Notice that I deliberately created some malformed HTML but the code didn't care. Compare the simplicity of the parser version vs. a regex solution and think of long term maintainability.
I would suggest using an HTML parser like @mrk suggested. Then taking the result you got back and putting it through a regex searcher. I like to use Rubular. This will show you what you what the regex is capturing and you can avoid getting unwanted results. I found that using the regex expression /http[^"]+/ works will in a situation like this because it will grab the entire url even if there is no "www." and you avoid capturing the quotes.
If you're building a spider, then Ruby's Mechanize is a great choice. To fetch a page and extract the links:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get "http://google.com/"
page.links.each do |link|
puts link.href
puts link.text
end
The documentation and the guide (that I linked to) lay out a lot of what you'll probably want to do. Using regular expressions to parse HTML (or XML) is notoriously tricky and error prone. Using a full parser (as others have suggested) will save you effort and make you code more robust.
Trying to not do this overcomplicated:
#<a .*?href="([^"]*)".*>([^<]+)</a>#i
Here is my Ruby Approach:
require 'open-uri'
class HTMLScraper
def initialize(page)
@src = page
open(@src) do |x|
@html = x.read
end
end
def parseLinks
links = @html.scan(/<a\s+href\s*=\s*"([^"]+)"[^>]*>\s*([^<]+)\s*<\/a>/ui)
puts "Link(s) Found:"
i = 0
while i < links.length
puts "\t#{links[i]}"
i += 1
end
end
end
url = "http://stackoverflow.com/questions"
test = HTMLScraper.new(url)
test.parseLinks
This will give to you an array of arrays, in which the first item of each (inner) array is the url, and the second is the title. Hope this helps and note the u
switch on the regex, it's to avoid encoding problems.
精彩评论