How to scrape the first paragraph from a wikipedia page?
Let's say I want to grab the first paragraph in this wikipedia page. How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar?
Is there any php library for that? I don't want to use the api because it's a bit comp开发者_如何学编程lex.
Note: i just need that to add a widget under my pages that displays related info from Wikipedia.
Use the following XPath expression:
/*/h:body//h:h1
|
/*/h:body//h:h1/following::node()
[count(. | //h:table[@id='toc']
/preceding::node()
)
=
count(//h:table[@id='toc']
/preceding::node()
)
]
Here the prefix h:
is bound to the XHTML namespace ("http://www.w3.org/1999/xhtml"
).
This transformation shows that the wanted result is really produced:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:h="http://www.w3.org/1999/xhtml"
>
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/h:body//h:h1
|
/*/h:body//h:h1/following::node()
[count(. | //h:table[@id='toc']
/preceding::node()
)
=
count(//h:table[@id='toc']
/preceding::node()
)
]
"/>
</xsl:template>
</xsl:stylesheet>
When run on the XHTML document of the Wikipedia article ( you also need to define two entities
and ®
for this document), the wanted result is produced.
精彩评论