how to get a text inside html/text content?
hi all I have html/text something like:
<html><head><style type="text/css">
</style></head>
<body><div style="font-family:times new roman,new york,times,serif;font-size:14pt">first text<br><div><br></div><div style="font-family: times new roman,new york,times,serif; font-size: 14pt;"><br><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><font size="2" face="Tahoma"><hr size="1"><b><span style="font-weight: bold;">one:</span></b> second text<br><b><span style="font-weight: bold;">two:</span></b> third text<br><b><span style="font-weight: bold;">three:</span></b> fourth text<br><b><span style="font-weight: bold;">five:</span></b> fifth text<br></font><br>
and I want to extract the text named "first text" in the above h开发者_Python百科tml content Note: this html content is not static it's dynamic, so the general idea is to get the first plain text in an html text
You tagged jsoup, so you're using Jsoup. That's already a good choice ;)
Here's how you could do it with Jsoup:
String html = "<html><head><style type=\"text/css\"></style></head><body><div style=\"font-family:times new roman,new york,times,serif;font-size:14pt\">first text<br><div><br></div><div style=\"font-family: times new roman,new york,times,serif; font-size: 14pt;\"><br><div style=\"font-family: times new roman,new york,times,serif; font-size: 12pt;\"><font size=\"2\" face=\"Tahoma\"><hr size=\"1\"><b><span style=\"font-weight: bold;\">one:</span></b> second text<br><b><span style=\"font-weight: bold;\">two:</span></b> third text<br><b><span style=\"font-weight: bold;\">three:</span></b> fourth text<br><b><span style=\"font-weight: bold;\">five:</span></b> fifth text<br></font><br>";
Document document = Jsoup.parse(html);
String firstText = document.select(":containsOwn(text)").first().ownText();
System.out.println(firstText);
Result:
first text
See also:
- Jsoup CSS selector syntax
You can use a SAX styled HTML parser, like TagSoup.
To do this, initialize the parser with an extended DefaultHandler
to detect when the first time the characters(...)
method is called and save the result.
Look to http://sax.sourceforge.net/quickstart.html for some direction in how to setup the parser.
If you want something fairly simple, look at my PageScraper class, which was designed for use on Java ME platforms and so will work pretty much anywhere. Nothing fancy, but an easy way to transform a text stream into tags and non-tags. Does lazy loading of attributes, so pretty quick to use if you're basically ignoring tags.
精彩评论