CSS Selector for group of elements?
I'm trying to scrape an HTML site with this structure:
<a name="how"></a>
<div class="ignore"></div>
<p>...</p>
&开发者_开发百科lt;p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<a name="other-uses"></a>
I need to grab all of the p, h3 and ul tags between the two a[name] anchor elements.
Right now I successfully grabbed the first p:
a[name='how'] + div + p
but I'm not sure how to grab all of the elements between the two.
This is being used within ScrAPI ruby scraping library that accepts all valid CSS selectors.
I don't believe this can be done in a single CSS selector, but would love to be proven wrong.
It can be done in a single XPath expression, however:
//*[preceding-sibling::a/@name="how" and following-sibling::a/@name="other-uses"]
so if an alternate scraping library is an option, such as Mechanize (which uses Nokogiri, an XPath-compliant HTML parser), then it can be done using the XPath above.
EDIT: for completeness, here's a fully functioning script that demonstrates the xpath using the Nokogiri HTML parser.
require 'rubygems'
require 'nokogiri'
html =<<ENDOFHTML
<html>
<body>
<a name="how"></a>
<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<a name="other-uses"></a>
</body>
</html>
ENDOFHTML
doc = Nokogiri::HTML.parse(html)
puts doc.xpath('//*[preceding-sibling::a/@name="how" and following-sibling::a/@name="other-uses"]')
Result:
<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
精彩评论