Scrape URLs From Web

2023-03-15 11:09 问答作者：

<a href="http://www.开发者_高级运维utoronto.ca/gdrs/" title="Rehabilitation Science"> Rehabilitation Science</a>

For the example above, I want to get the department name "Rehabilitation Science" and its homepage url "http://www.utoronto.ca/gdrs/" at the same time.

Could someone please suggest some smart regular expressions that would do the job for me?

There's no reason to use regex to do this at all. Here's a solution using Nokogiri, which is the usual Ruby HTML/XML parser:

html = <<EOT
<p><a href="http://www.example.com/foo">foo</a></p>
<p><a href='http://www.example.com/foo1'>foo1</p></a>
<p><a href=http://www.example.com/foo2>foo2</a></p>
<p><a href = http://www.example.com/bar>bar</p>
<p><a 
  href="http://www.example.com/foobar"
  >foobar</a></p>
  <p><a 
    href="http://www.example.com/foobar2"
    >foobar2</p>
EOT

require 'nokogiri'

doc = Nokogiri::HTML(html)

links = Hash[
  *doc.search('a').map { |a| 
      [
        a['href'],
        a.content
      ]
    }.flatten
  ]

require 'pp'
pp links
# >> {"http://www.example.com/foo"=>"foo",
# >>  "http://www.example.com/foo1"=>"foo1",
# >>  "http://www.example.com/foo2"=>"foo2",
# >>  "http://www.example.com/bar"=>"bar",
# >>  "http://www.example.com/foobar"=>"foobar",
# >>  "http://www.example.com/foobar2"=>"foobar2"}

This returns a hash of URLs as keys with the related content of the <a> tag as the value. That means you'll only capture unique URLs, throwing away duplicates. If you want all URLs use:

links = doc.search('a').map { |a| 
    [
      a['href'],
      a.content
    ]
  }

which results in:

# >> [["http://www.example.com/foo", "foo"],
# >>  ["http://www.example.com/foo1", "foo1"],
# >>  ["http://www.example.com/foo2", "foo2"],
# >>  ["http://www.example.com/bar", "bar"],
# >>  ["http://www.example.com/foobar", "foobar"],
# >>  ["http://www.example.com/foobar2", "foobar2"]]

I used a CSS accessor 'a' to locate the tags. I could use 'a[href]' if I wanted to grab only links, ignoring anchors.

Regex are very fragile when dealing with HTML and XML because the markup formats are too freeform; They can vary in their format while remaining valid, especially HTML, which can vary wildly in its "correctness". If you don't own the generation of the file being parsed, then your code is at the mercy of whoever does generate it when using regex; A simple change in the file can break the pattern badly, resulting in a continual maintenance headache.

A parser, because it actually understands the internal structure of the file, can withstand those changes. Notice that I deliberately created some malformed HTML but the code didn't care. Compare the simplicity of the parser version vs. a regex solution and think of long term maintainability.

I would suggest using an HTML parser like @mrk suggested. Then taking the result you got back and putting it through a regex searcher. I like to use Rubular. This will show you what you what the regex is capturing and you can avoid getting unwanted results. I found that using the regex expression /http[^"]+/ works will in a situation like this because it will grab the entire url even if there is no "www." and you avoid capturing the quotes.

If you're building a spider, then Ruby's Mechanize is a great choice. To fetch a page and extract the links:

require 'rubygems'
require 'mechanize'

agent = Mechanize.new
page = agent.get "http://google.com/"

page.links.each do |link|
  puts link.href
  puts link.text
end

The documentation and the guide (that I linked to) lay out a lot of what you'll probably want to do. Using regular expressions to parse HTML (or XML) is notoriously tricky and error prone. Using a full parser (as others have suggested) will save you effort and make you code more robust.

Trying to not do this overcomplicated:

#<a .*?href="([^"]*)".*>([^<]+)</a>#i

Here is my Ruby Approach:

require 'open-uri'

class HTMLScraper
    def initialize(page)
      @src = page
      open(@src) do |x|
          @html = x.read
      end
    end
    def parseLinks
      links = @html.scan(/<a\s+href\s*=\s*"([^"]+)"[^>]*>\s*([^<]+)\s*<\/a>/ui)
      puts "Link(s) Found:"
      i = 0
      while i < links.length
        puts "\t#{links[i]}"
        i += 1
      end
    end
  end

url = "http://stackoverflow.com/questions"
test = HTMLScraper.new(url)
test.parseLinks

This will give to you an array of arrays, in which the first item of each (inner) array is the url, and the second is the title. Hope this helps and note the u switch on the regex, it's to avoid encoding problems.

继续阅读：hpricot regex ruby web-scraping

Scrape URLs From Web

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？