Unescaping HTML entities (including named ones)
This question is similar to the Remove html character entities in a string question asked earlier on Stack Overflow. The accepted answer, however, does not address the issue of named HTML entities, e.g. ä
for the character ä
; It therefore cannot unescape all HTML.
I have some legacy HTML which uses named HTML entities for non-ASCII characters. That is, ö
instead of ö
, ä
instead of ä
and so on. A full list of all named HTML entities is available on Wikipedia.
I'd like to unescape these HTML entities into their character equivalents, in a quick and efficient manner.
I have the code to do this in Python 3, using regular expressions:
import re
import html.entities
s = re.sub(r'&(\w+?);', lambda m: chr(html.entities.name2codepoint[m.group(1)]), s)
Regular expressions, however, don't seem very popular, fast or easy to use in Haskell.
Text.HTML.TagSoup.Entity
(tagsoup) has a useful table and functions for mapping named entities tpo codepoints. Using this, and the regex-tdfa package, I've fashioned an extremely slow equivalent in Haskell:
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Lazy.Char8 as L
import Data.ByteString.Lazy.UTF8 as UTF8
import Text.HTML.TagSoup.Entity (lookupEntity)
import Text.Regex.TDFA ((=~~))
unescapeEntites :: L.ByteString -> L.ByteString
unescapeEntites = regexReplaceBy "&#?[[:alnum:]]+;" $ lookupMatch
where
lookupMatch m =
case lookupEntity (L.unpack . L.tail . L.init $ m) of
Nothing -> m
Just x -> UTF8.fromString [x]
-- regex replace taken from http://mutelight.org/articles/g开发者_StackOverflow社区enerating-a-permalink-slug-in-haskell
regexReplaceBy :: L.ByteString -> (L.ByteString -> L.ByteString) -> L.ByteString -> L.ByteString
regexReplaceBy regex f text = go text []
where
go str res =
if L.null str
then L.concat . reverse $ res
else
case (str =~~ regex) :: Maybe (L.ByteString, L.ByteString, L.ByteString) of
Nothing -> L.concat . reverse $ (str : res)
Just (bef, match , aft) -> go aft (f match : bef : res)
The unescapeEntities
function runs several orders of magnitude slower than the Python version above. The Python code can convert about 130 MB in 7 seconds, whereas my Haskell version runs for several minutes.
I'm looking for a better solution, primarily in terms of speed. But I'd also like to avoid regular expressions, if possible (speed and avoiding regular expressions seem to go hand in hand in Haskell anyway).
Here's my version. It uses String (instead of ByteString).
import Text.HTML.TagSoup.Entity (lookupEntity)
unescapeEntities :: String -> String
unescapeEntities [] = []
unescapeEntities ('&':xs) =
let (b, a) = break (== ';') xs in
case (lookupEntity b, a) of
(Just c, ';':as) -> c : unescapeEntities as
_ -> '&' : unescapeEntities xs
unescapeEntities (x:xs) = x : unescapeEntities xs
I would guess it's faster because it doesn't use the expensive regex operations. I haven't tested it. You could adapt it for ByteString or for Data.Text if you need it faster.
You could install the web-encodings package, take the Sourcecode of the decodeHtml function and add the characters you need (works for me). This is all you need:
import Data.Maybe
import qualified Web.Encodings.StringLike as SL
import Web.Encodings.StringLike (StringLike)
import Data.Char (ord)
-- | Decode HTML-encoded content into plain content.
--
-- Note: this does not support all HTML entities available. It also swallows
-- all failures.
decodeHtml :: StringLike s => s -> s
decodeHtml s = case SL.uncons s of
Nothing -> SL.empty
Just ('&', xs) -> fromMaybe ('&' `SL.cons` decodeHtml xs) $ do
(before, after) <- SL.breakCharMaybe ';' xs
c <- case SL.unpack before of -- this are small enough that unpack is ok
"lt" -> return '<'
"gt" -> return '>'
"amp" -> return '&'
"quot" -> return '"'
'#' : 'x' : hex -> readHexChar hex
'#' : 'X' : hex -> readHexChar hex
'#' : dec -> readDecChar dec
_ -> Nothing -- just to shut up a warning
return $ c `SL.cons` decodeHtml after
Just (x, xs) -> x `SL.cons` decodeHtml xs
readHexChar :: String -> Maybe Char
readHexChar s = helper 0 s where
helper i "" = return $ toEnum i
helper i (c:cs) = do
c' <- hexVal c
helper (i * 16 + c') cs
hexVal :: Char -> Maybe Int
hexVal c
| '0' <= c && c <= '9' = Just $ ord c - ord '0'
| 'A' <= c && c <= 'F' = Just $ ord c - ord 'A' + 10
| 'a' <= c && c <= 'f' = Just $ ord c - ord 'a' + 10
| otherwise = Nothing
readDecChar :: String -> Maybe Char
readDecChar s = do
case reads s of
(i, _):_ -> Just $ toEnum (i :: Int)
_ -> Nothing
I did not test performance though. But it might be a nice sample if you this can be done without regexps as well.
精彩评论