开发者

how to get a text inside html/text content?

hi all I have html/text something like:

<html><head><style type="text/css">
</style></head>
<body><div style="font-family:times new roman,new york,times,serif;font-size:14pt">first text<br><div><br></div><div style="font-family: times new roman,new york,times,serif; font-size: 14pt;"><br><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><font size="2" face="Tahoma"><hr size="1"><b><span style="font-weight: bold;">one:</span></b> second text<br><b><span style="font-weight: bold;">two:</span></b> third text<br><b><span style="font-weight: bold;">three:</span></b> fourth text<br><b><span style="font-weight: bold;">five:</span></b> fifth text<br></font><br>

and I want to extract the text named "first text" in the above h开发者_Python百科tml content Note: this html content is not static it's dynamic, so the general idea is to get the first plain text in an html text


You tagged jsoup, so you're using Jsoup. That's already a good choice ;)

Here's how you could do it with Jsoup:

String html = "<html><head><style type=\"text/css\"></style></head><body><div style=\"font-family:times new roman,new york,times,serif;font-size:14pt\">first text<br><div><br></div><div style=\"font-family: times new roman,new york,times,serif; font-size: 14pt;\"><br><div style=\"font-family: times new roman,new york,times,serif; font-size: 12pt;\"><font size=\"2\" face=\"Tahoma\"><hr size=\"1\"><b><span style=\"font-weight: bold;\">one:</span></b> second text<br><b><span style=\"font-weight: bold;\">two:</span></b> third text<br><b><span style=\"font-weight: bold;\">three:</span></b> fourth text<br><b><span style=\"font-weight: bold;\">five:</span></b> fifth text<br></font><br>";
Document document = Jsoup.parse(html);
String firstText = document.select(":containsOwn(text)").first().ownText();
System.out.println(firstText);

Result:

first text

See also:

  • Jsoup CSS selector syntax


You can use a SAX styled HTML parser, like TagSoup.

To do this, initialize the parser with an extended DefaultHandler to detect when the first time the characters(...) method is called and save the result.

Look to http://sax.sourceforge.net/quickstart.html for some direction in how to setup the parser.


If you want something fairly simple, look at my PageScraper class, which was designed for use on Java ME platforms and so will work pretty much anywhere. Nothing fancy, but an easy way to transform a text stream into tags and non-tags. Does lazy loading of attributes, so pretty quick to use if you're basically ignoring tags.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜