CSS Selector for group of elements?

2023-03-10 22:22 问答作者：

I'm trying to scrape an HTML site with this structure:

<a name="how"></a>
<div class="ignore"></div>
<p>...</p>
&开发者_开发百科lt;p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>
<a name="other-uses"></a>

I need to grab all of the p, h3 and ul tags between the two a[name] anchor elements.

Right now I successfully grabbed the first p:

a[name='how'] + div + p

but I'm not sure how to grab all of the elements between the two.

This is being used within ScrAPI ruby scraping library that accepts all valid CSS selectors.

I don't believe this can be done in a single CSS selector, but would love to be proven wrong.

It can be done in a single XPath expression, however:

//*[preceding-sibling::a/@name="how" and following-sibling::a/@name="other-uses"]

so if an alternate scraping library is an option, such as Mechanize (which uses Nokogiri, an XPath-compliant HTML parser), then it can be done using the XPath above.

EDIT: for completeness, here's a fully functioning script that demonstrates the xpath using the Nokogiri HTML parser.

require 'rubygems'
require 'nokogiri'

html =<<ENDOFHTML
<html>
<body>
    <a name="how"></a>
    <div class="ignore"></div>
    <p>...</p>
    <p>...</p>
    <p>...</p>
    <h3>...</h3>
    <p>...</p>
    <ul>...</ul>
    <p>...</p>
    <p>...</p>
    <p>...</p>
    <p>...</p>
    <a name="other-uses"></a>
</body>
</html>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)

puts doc.xpath('//*[preceding-sibling::a/@name="how" and following-sibling::a/@name="other-uses"]')

Result:

<div class="ignore"></div>
<p>...</p>
<p>...</p>
<p>...</p>
<h3>...</h3>
<p>...</p>
<ul>...</ul>
<p>...</p>
<p>...</p>
<p>...</p>
<p>...</p>

继续阅读：css-selectors ruby screen-scraping

CSS Selector for group of elements?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？