开发者

XPath expression to select text not in paragraph

I'm deveoping web scraping scoftware that relies on XPath to extract information from web pages.

One application of the software is to scrape reviews of shows from websites. One page I'm trying to scrape is the Guardian's latest Edinburgh festival reviews: http://www.guardian.co.uk/culture/edinburghfestival+tone/reviews

The section I want is at the bottom, titled "Most recent". The XPath expression for the list of review items (that is the pic, the stars, the date, the blurb, etc) is

//ul[@id='auto-trail-block']

which returns a list of li elements, each corresponding to one review item.

If I want to refer to only the blurb, the closest I can get is to say

//ul[@id='auto-trail-block']/div[@class='trailtext']

but when I collect the text content from each item of the list, it includes lots of Javascript and nasty stuff I don't need. I can't refe开发者_StackOverflow中文版r to the blurb itself because it is not inside a p element, but within a div element that contains script elements and strong elements that contain javascript and unrelated text respectively.

In the debugger it the DOM looks like this:

<ul id="auto-trail-block" ...>
  <li ...>
    <div ...>
    <div ...>
      <div ...>
      <div class="trailtext">
        <script ...>
        <div ...>
        <span ...>
        <strong .../>
        <br/>
        The Text I want to copy!
        <strong .../>
        <a .../>
        <div .../>
      </div>
    </div>
  </li>
  <li ...>
    ...
  </li>
  ...
</ul>

Is there any way to refer to the text content contained in just the div and not any of its subelements?


My approach would be to select the trailtext div, remove the script tags with their content and all HTML tags. What's left would be the content you want.

Just wondering - what does the inner text node of //ul[@id='auto-trail-block']/div[@class='trailtext'] return? I would guess mostly the blurb, so clearing out the script tags should almost get you there.


If you only want the text node children of div[@class='trailtext'], then use text()

//ul[@id='auto-trail-block']//div[@class='trailtext']/text()
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜