Unescaping HTML entities (including named ones)

2023-03-23 01:46 问答作者：

This question is similar to the Remove html character entities in a string question asked earlier on Stack Overflow. The accepted answer, however, does not address the issue of named HTML entities, e.g. ä for the character ä; It therefore cannot unescape all HTML.

I have some legacy HTML which uses named HTML entities for non-ASCII characters. That is, ö instead of ö, ä instead of ä and so on. A full list of all named HTML entities is available on Wikipedia.

I'd like to unescape these HTML entities into their character equivalents, in a quick and efficient manner.

I have the code to do this in Python 3, using regular expressions:

import re
import html.entities

s = re.sub(r'&(\w+?);', lambda m: chr(html.entities.name2codepoint[m.group(1)]), s)

Regular expressions, however, don't seem very popular, fast or easy to use in Haskell.

Text.HTML.TagSoup.Entity (tagsoup) has a useful table and functions for mapping named entities tpo codepoints. Using this, and the regex-tdfa package, I've fashioned an extremely slow equivalent in Haskell:

{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Lazy.Char8 as L
import Data.ByteString.Lazy.UTF8 as UTF8
import Text.HTML.TagSoup.Entity (lookupEntity)
import Text.Regex.TDFA ((=~~))

unescapeEntites :: L.ByteString -> L.ByteString
unescapeEntites = regexReplaceBy "&#?[[:alnum:]]+;" $ lookupMatch
 where
  lookupMatch m =
    case lookupEntity (L.unpack . L.tail . L.init $ m) of
      Nothing -> m
      Just x -> UTF8.fromString [x]

-- regex replace taken from http://mutelight.org/articles/g开发者_StackOverflow社区enerating-a-permalink-slug-in-haskell
regexReplaceBy :: L.ByteString -> (L.ByteString -> L.ByteString) -> L.ByteString -> L.ByteString
regexReplaceBy regex f text = go text []
 where
  go str res =
    if L.null str
      then L.concat . reverse $ res
      else
        case (str =~~ regex) :: Maybe (L.ByteString, L.ByteString, L.ByteString) of
          Nothing -> L.concat . reverse $ (str : res)
          Just (bef, match , aft) -> go aft (f match : bef : res)

The unescapeEntities function runs several orders of magnitude slower than the Python version above. The Python code can convert about 130 MB in 7 seconds, whereas my Haskell version runs for several minutes.

I'm looking for a better solution, primarily in terms of speed. But I'd also like to avoid regular expressions, if possible (speed and avoiding regular expressions seem to go hand in hand in Haskell anyway).

Here's my version. It uses String (instead of ByteString).

import Text.HTML.TagSoup.Entity (lookupEntity)

unescapeEntities :: String -> String
unescapeEntities [] = []
unescapeEntities ('&':xs) = 
  let (b, a) = break (== ';') xs in
  case (lookupEntity b, a) of
    (Just c, ';':as) ->  c  : unescapeEntities as    
    _                -> '&' : unescapeEntities xs
unescapeEntities (x:xs) = x : unescapeEntities xs

I would guess it's faster because it doesn't use the expensive regex operations. I haven't tested it. You could adapt it for ByteString or for Data.Text if you need it faster.

You could install the web-encodings package, take the Sourcecode of the decodeHtml function and add the characters you need (works for me). This is all you need:

import Data.Maybe
import qualified Web.Encodings.StringLike as SL
import Web.Encodings.StringLike (StringLike)
import Data.Char (ord)

-- | Decode HTML-encoded content into plain content.
--
-- Note: this does not support all HTML entities available. It also swallows
-- all failures.
decodeHtml :: StringLike s => s -> s
decodeHtml s = case SL.uncons s of
    Nothing -> SL.empty
    Just ('&', xs) -> fromMaybe ('&' `SL.cons` decodeHtml xs) $ do
        (before, after) <- SL.breakCharMaybe ';' xs
        c <- case SL.unpack before of -- this are small enough that unpack is ok
            "lt" -> return '<'
            "gt" -> return '>'
            "amp" -> return '&'
            "quot" -> return '"'
            '#' : 'x' : hex -> readHexChar hex
            '#' : 'X' : hex -> readHexChar hex
            '#' : dec -> readDecChar dec
            _ -> Nothing -- just to shut up a warning
        return $ c `SL.cons` decodeHtml after
    Just (x, xs) -> x `SL.cons` decodeHtml xs

readHexChar :: String -> Maybe Char
readHexChar s = helper 0 s where
    helper i "" = return $ toEnum i
    helper i (c:cs) = do
        c' <- hexVal c
        helper (i * 16 + c') cs

hexVal :: Char -> Maybe Int
hexVal c
    | '0' <= c && c <= '9' = Just $ ord c - ord '0'
    | 'A' <= c && c <= 'F' = Just $ ord c - ord 'A' + 10
    | 'a' <= c && c <= 'f' = Just $ ord c - ord 'a' + 10
    | otherwise = Nothing

readDecChar :: String -> Maybe Char
readDecChar s = do
    case reads s of
        (i, _):_ -> Just $ toEnum (i :: Int)
        _ -> Nothing

I did not test performance though. But it might be a nice sample if you this can be done without regexps as well.

继续阅读：haskell string

Unescaping HTML entities (including named ones)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？