Extract all text from a HTML page without losing context
For a translation program I am trying to g开发者_JS百科et a 95% accurate text from a HTML file in order to translate the sentences and links.
For example:
<div><a href="stack">Overflow</a> <span>Texts <b>go</b> here</span></div>
Should give me 2 results to translate:
Overflow
Texts <b>go</b> here
Any suggestions or commercial packages available for this problem?
I'm not exactly sure what you're asking, but look at simplehtmldom. Specifically the "Extract Contents from HTML" tab under quick start on that front page (can't link directly, sigh). With that you can extract the text of a website without all those pesky tags.
精彩评论