Strip HTML tags from Scala String
I am developing web app using Scala and Lift framework. I have record in DB which contains html perex of page
<b>Hi all, this is perex</b>
开发者_如何学JAVA
And in one scenario I need to print to user this perex, but without html tags.
Hi all, this is perex
It is possible to do this in Scala? Because I tried to look with Google, but with no success.
thanks for all replies.
If the string is valid XML then you can use:
scala.xml.XML.loadString("<b>Hi all, this is parex</b>").text
If it's not valid XML, then you can use scala.util.matching.Regex
or an HTML parsing library like http://jsoup.org/
The best solution I've found was to use cyberneko to parse your string and do some "clever" SAX event handling.
cyberneko will parse your HTML even if it's invalid, which is the case for the vast majority of the HTML you're likely to encounter in the wild.
If you register a custom ContentHandler
that essentially ignores all but the character
events and just append those to a string builder, you'll get a good first approximation, with an annoying flaw: words separated by a block element will end up concatenated (for<br/>example
=> forexample
).
A better solution is to get a list of all block elements, and have your ContentHandler
listen to startElement
events. If the element is a block one, just append a space character to your string builder.
Note that while this seems to work fine, it might not be perfect for your use case. <br/>
is not, for example, turned into a line break. It shouldn't be too much work to add this if it's required, though.
TagSoup should meet your requirement to parse a realworld html file.
sbt dependencies,
libraryDependencies += "org.ccil.cowan.tagsoup" % "tagsoup" % "1.2.1"
Sample code,
object TagSoupXmlLoader {
private val factory = new SAXFactoryImpl()
def get(): XMLLoader[Elem] = {
XML.withSAXParser(factory.newSAXParser())
}
}
usage,
val root = TagSoupXmlLoader.get().load("http://www.google.com")
println(root)
精彩评论