开发者

CSS Selector for group of elements?

I'm trying to scrape an HTML site with this structure:

<a name="how"></a>
<div class="ignore"></div>
<p>...</p>
&开发者_开发百科lt;p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<a name="other-uses"></a>

I need to grab all of the p, h3 and ul tags between the two a[name] anchor elements.

Right now I successfully grabbed the first p:

a[name='how'] + div + p

but I'm not sure how to grab all of the elements between the two.

This is being used within ScrAPI ruby scraping library that accepts all valid CSS selectors.


I don't believe this can be done in a single CSS selector, but would love to be proven wrong.

It can be done in a single XPath expression, however:

//*[preceding-sibling::a/@name="how" and following-sibling::a/@name="other-uses"]

so if an alternate scraping library is an option, such as Mechanize (which uses Nokogiri, an XPath-compliant HTML parser), then it can be done using the XPath above.

EDIT: for completeness, here's a fully functioning script that demonstrates the xpath using the Nokogiri HTML parser.

require 'rubygems'
require 'nokogiri'

html =<<ENDOFHTML
<html>
<body>
    <a name="how"></a>
    <div class="ignore"></div>
    <p>...</p>
    <p>...</p>
    <p>...</p>
    <h3>...</h3>
    <p>...</p>
    <ul>...</ul>
    <p>...</p>
    <p>...</p>
    <p>...</p>
    <p>...</p>
    <a name="other-uses"></a>
</body>
</html>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)

puts doc.xpath('//*[preceding-sibling::a/@name="how" and following-sibling::a/@name="other-uses"]')

Result:

<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜