What strategies are there for escaping character entities?

2022-12-13 21:11 问答作者：

We are doing Natural Language Processing on a range of English language documents (mainly scientific) and run into problems in carrying non-ANSI characters through the various components. The documents may be "ASCII", UNICODE, PDF, or HTML. We cannot predict at this stage what tools will be in our chain or whether they will allow character encodings other than ANSI. Even ISO-Latin characters expressed in UNICODE will give problems (e.g. displaying incorrectly in browsers). We are likely to encounter a range of symbols including mathematical and Greek. We would like to "flatten" these into a text string which will survive multistep processing (including XML and regex tools) and then possibly reconstitute it in the last step (although it is the semantics rather than the typography we are concerned with so this is a minor concern).

I appreciate that there is no absolute answer - any escapi开发者_如何学运维ng can clash in some cases - but I am looking for something allong the lines of XML's <![CDATA[ ...]]> which will survive most non-recursive XML operations. Characters such as [ are bad as they are common in regexes. So I'm wondering if there is a generally adopted approach rather than inventing our own.

A typical example is the "degrees" symbol:

HTML Entity (decimal)   &#176;
HTML Entity (hex)   &#xb0;
HTML Entity (named)     &deg;
How to type in Microsoft Windows    Alt +00B0
Alt 0176
Alt 248
UTF-8 (hex)     0xC2 0xB0 (c2b0)
UTF-8 (binary)  11000010:10110000
UTF-16 (hex)    0x00B0 (00b0)
UTF-16 (decimal)    176
UTF-32 (hex)    0x000000B0 (00b0)
UTF-32 (decimal)    176
C/C++/Java source code  "\u00B0"
Python source code  u"\u00B0"

We are also likely to encounter TeX

$10\,^{\circ}{\rm C}$

\degree

so backslashes, curlies and dollars are a poor idea.

We could for example use markup like:

__deg__
__#176__

and this will probably work but I'd appreciate advice from those who have similar problems.

update I accept @MichaelB's insistence that we use UTF-8 throughout. I am worried that some of our tools may not conform and if so I'll revisit this. Note that my original question is not well worded - read his answer and the link in it.

Get someone to do this who really understands character encodings. It looks like you don't, because you're not using the terminology correctly. Alternatively, read this.
Do not brew up your own escape scheme - it will cause you more problems than it will solve. Instead, normalize the various source encodings to UTF-8 (which is really just one such escape scheme, except efficient and standardized) and handle character encodings correctly. Perhaps use UTF-7 if you're really that scared of high bits.
In this day and age, not handling character encodings correctly is not acceptable. If a tool doesn't, abandon it - it is most likely very bad quality code in many other ways as well and not worth the hassle using.

Maybe I don't get the problem correctly, but I would create a very unique escape marker which is unlikely to be touched, and then use it to enclose the entity encoded as a base32 string.

Eventually, you can transmit the unique markers and their number along the chain through a separate channel, and check their presence and number at the end.

Example, something like

the value of the temperature was 18 cd48d8c50d7f40aeb6a164181b17feee EZSGKZY= cd48d8c50d7f40aeb6a164181b17feee

your marker is a uuid, and the entity is &deg encoded in base32. You then pass along the marker cd48d8c50d7f40aeb6a164181b17feee. It cannot be corrupted (if it gets corrupted, your filters will probably corrupt anything made of letters and numbers anyway, but at least you can exclude them because they are fixed length) and you can always recover the content by looking inside the two markers.

Of course, if you have uuids in your documents, this could represent a problem, but since you are not transmitting them as authorized markers along the lateral channel, they won't be recognized as such (and in any case, what's inbetween won't validate as a base32 string anyway).

If you need to search for them, then you can keep the uuid subdivision, and then use a proper regexp to spot these occurrences. Example:

>>> re.search("(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})(.*?)(\\1)", s)
<_sre.SRE_Match object at 0x1003d31f8>
>>> _.groups()
('6d378205-1265-44e4-80b8-a47d1ceaad51', ' EZSGKZY= ', '6d378205-1265-44e4-80b8-a47d1ceaad51')
>>>

If you really need a specific "token" to test, you can use a uuid1, with a very defined specification of a node:

>>> uuid.uuid1(node=0x1234567890)  
UUID('bdcce554-e95d-11de-bd0f-001234567890')
>>> uuid.uuid1(node=0x1234567890)  
UUID('c4c57a91-e95d-11de-90ca-001234567890')
>>>

You can use anything you prefer as a node, the uuid will be unique, but you can still test for presence (although you can get false positives).

继续阅读：character-encoding escaping

What strategies are there for escaping character entities?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？