Html 2 Text - Remove "hidden" Text
I am currently looking for ways to read the visible text开发者_如何学Go of a website and store it into plaintext string using Java.
In other words, I'd like to convert something like this:
Hello <span style="display: none">stupid</span> World
into "Hello World"
or something like
<span>Un</span>friendly
into "Unfriendly" (and not something like "Un friendly")
or
Hello
World
into "Hello World" (as new lines are ignored in HTML)
Do you know of any lib capable of assisting in this task?
Cheers,
Matthias
Boilerpipe is an HTML cleaning library written in Java.
Have a look at Cobra to see if the API provides any method to render the HTML and convert it into plain text.
精彩评论