HTML page to XHTML with TagSoup
Sorry if this is too simple, but I simply couldn't find a tutorial nor the documentation of the Java version of TagSoup.
Basically I want to download an HTML webpage from the internet an开发者_开发技巧d turn it into XHTML, contained in a string. How can I do this with TagSoup?
Thanks!
Something like this:
wget -O - example.com/bad.html | java -jar tagsoup.jar
Or, from Java:
To parse HTML:
- Create an instance of
org.ccil.cowan.tagsoup.Parser
- Provide your own SAX2 ContentHandler
- Provide an
InputSource
referring to the HTML- And
parse()
!
Below is the code which should provide you with a means to pull down a web page and parse it accordingly with TagSoup...
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet("http://streak.espn.go.com/en/?date=20120824");
HttpResponse response = client.execute(request);
// Check if server response is valid
StatusLine status = response.getStatusLine();
if (status.getStatusCode() != 200) {
throw new IOException("Invalid response from server: " + status.toString());
}
// Pull content stream from response
HttpEntity entity = response.getEntity();
InputStream inputStream = entity.getContent();
try
{
XMLReader parser = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
// Use the TagSoup parser to build an XOM document from HTML
Document doc = new Builder(parser).build(builder.toString());
// Push your data to string or XML
doc.toString();
doc.toXML();
}
catch(IOException e)
{ ... }
精彩评论