saxutils.escape() escapes semicolon twice
I'm trying to escape semicolon in by saxutils.escape method.
saxutils.escape('<;', {';': ';'})
I开发者_运维问答 expect it to produce
'<;'
But it gives
'<;;'
Is this by design? And how can I get my expected result?
Your problem is that saxutils.escape
works in two steps. First, it parses <
, >
, and &
, then it uses entities
to parse the result of that parsing.
So once <
has been replaced by <
, you've got <;
, so you end up with <;;
.
Basically, what it's doing makes sense. If you need to escape semicolons, it's not because of HTML reasons, so it must be to double-escape them. In this situation, it makes sense to escape the semicolons created by HTML-required escaping.
You can't get your desired result with saxutils.escape
. You need to use another method of escaping. See the Python Wiki page on escaping HTML for some ideas.
You can also use something like what is in my answer to What is the best way to do a find and replace of multiple queries on multiple files? to replace semicolons simultaneously with other patters so you don't double-substitute anything.
That's because escape() takes the final semicolon character of the escaped <
into account, and replaces it with ;
as instructed. Therefore, <;
gives <;;
.
Semicolons normally do not need to be escaped that way, so I don't think it's a bug in the function, only an edge case with this specific character.
This is by design, and Frédéric Hamidi has explained why this is so.
So how can you get what you want?
Taking @agf's suggestion:
escape_table = {
"&": "&",
">": ">",
"<": "<",
";": ";",
}
def escape(text):
return "".join(escape_table.get(c,c) for c in text)
精彩评论