Yahoo Pipe: How to parse sub DIVs
For a page which has multiple DIVs, how to just fetch content from DIVs that contain useful text开发者_Go百科 and avoid other DIVs that are for ads, etc.
For example, a page structure like this:
...
<div id="articlecopy">
<div class="advertising 1">Ads I do not want to fetch.</div>
<p>Useful texts go here</p>
<div class="advertising 2">Ads I do not want to fetch.</div>
<div class="related_articles_list">I do not want to read related articles so parse this part too</div>
</div>
...
In this fictional example, I want get rid of the two DIVs for advertising and the DIV for related articles. All I want is to fetch the useful content in
inside the parent DIV.
Can Pipe do this?
Thank you.
Try the YQL module with xpath. Something along these lines:
SELECT * from html where url="http://MyWebPageWithAds.com" and xpath='//div/p'
The above query will retrieve the part of the html inside the <p> tag under the parent <div> tag. You can get fancy with xpath if your DIVs have attributes.
Say for example you had a page with several DIVs, but the one you wanted looked like this:
<div>
<div>Stuff I don't want</div>
<div class="main_content">Stuff I want to add to my feed</div>
<div>Other stuff I don't want</div>
</div>
You would change the YQL string above to this:
SELECT * from html where url="http://MyWebPageWithAds.com"
and xpath='//div/div[contains(@class,"main_content")]'
I've only recently discovered YQL myself, and am fairly new to using xpaths, but it has worked for me so far.
精彩评论