开发者

Yahoo Pipe: How to parse sub DIVs

For a page which has multiple DIVs, how to just fetch content from DIVs that contain useful text开发者_Go百科 and avoid other DIVs that are for ads, etc.

For example, a page structure like this:

...

<div id="articlecopy">

  <div class="advertising 1">Ads I do not want to fetch.</div>

  <p>Useful texts go here</p>

  <div class="advertising 2">Ads I do not want to fetch.</div>

  <div class="related_articles_list">I do not want to read related articles so parse this part too</div>

</div>

...

In this fictional example, I want get rid of the two DIVs for advertising and the DIV for related articles. All I want is to fetch the useful content in

inside the parent DIV.

Can Pipe do this?

Thank you.


Try the YQL module with xpath. Something along these lines:

SELECT * from html where url="http://MyWebPageWithAds.com" and xpath='//div/p'

The above query will retrieve the part of the html inside the <p> tag under the parent <div> tag. You can get fancy with xpath if your DIVs have attributes.

Say for example you had a page with several DIVs, but the one you wanted looked like this:

<div>
    <div>Stuff I don't want</div>
    <div class="main_content">Stuff I want to add to my feed</div>
    <div>Other stuff I don't want</div> 
</div>

You would change the YQL string above to this:

SELECT * from html where url="http://MyWebPageWithAds.com" 
and xpath='//div/div[contains(@class,"main_content")]'

I've only recently discovered YQL myself, and am fairly new to using xpaths, but it has worked for me so far.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜