开发者

How do you make strings "XML safe"?

I am responding to an AJAX call by sending it an XML document through PHP echos. In order to form this XML document, I loop through the records of a database. The problem is that the database includes records that have '<' symbols in th开发者_高级运维em. So naturally, the browser throws an error at that particular spot. How can this be fixed?


Since PHP 5.4 you can use:

htmlspecialchars($string, ENT_XML1);

You should specify the encoding, such as:

htmlspecialchars($string, ENT_XML1, 'UTF-8');

Update

Note that the above will only convert:

  • & to &amp;
  • < to &lt;
  • > to &gt;

If you want to escape text for use in an attribute enclosed in double quotes:

htmlspecialchars($string, ENT_XML1 | ENT_COMPAT, 'UTF-8');

will convert " to &quot; in addition to &, < and >.


And if your attributes are enclosed in single quotes:

htmlspecialchars($string, ENT_XML1 | ENT_QUOTES, 'UTF-8');

will convert ' to &apos; in addition to &, <, > and ".

(Of course you can use this even outside of attributes).


See the manual entry for htmlspecialchars.


By either escaping those characters with htmlspecialchars, or, perhaps more appropriately, using a library for building XML documents, such as DOMDocument or XMLWriter.

Another alternative would be to use CDATA sections, but then you'd have to look out for occurrences of ]]>.

Take also into consideration that that you must respect the encoding you define for the XML document (by default UTF-8).


1) You can wrap your text as CDATA like this:

<mytag>
    <![CDATA[Your text goes here. Btw: 5<6 and 6>5]]>
</mytag>

see http://www.w3schools.com/xml/xml_cdata.asp

2) As already someone said: Escape those chars. E.g. like so:

5&lt;6 and 6&gt;5


Try this:

$str = htmlentities($str,ENT_QUOTES,'UTF-8');

So, after filtering your data using htmlentities() function, you can use the data in XML tag like:

<mytag>$str</mytag>


If at all possible, its always a good idea to create your XML using the XML classes rather than string manipulation - one of the benefits being that the classes will automatically escape characters as needed.


Adding this in case it helps someone.

As I am working with Japanese characters, encoding has also been set appropriately. However, from time to time, I find that htmlentities and htmlspecialchars are not sufficient.

Some user inputs contain special characters that are not stripped by the above functions. In those cases I have to do this:

preg_replace('/[\x00-\x1f]/','',htmlspecialchars($string))

This will also remove certain xml-unsafe control characters like Null character or EOT. You can use this table to determine which characters you wish to omit.


I prefer the way Golang does quote escaping for XML (and a few extras like newline escaping, and escaping some other characters), so I have ported its XML escape function to PHP below

function isInCharacterRange(int $r): bool {
    return $r == 0x09 ||
            $r == 0x0A ||
            $r == 0x0D ||
            $r >= 0x20 && $r <= 0xDF77 ||
            $r >= 0xE000 && $r <= 0xFFFD ||
            $r >= 0x10000 && $r <= 0x10FFFF;
}

function xml(string $s, bool $escapeNewline = true): string {
    $w = '';

    $Last = 0;
    $l = strlen($s);
    $i = 0;

    while ($i < $l) {
        $r = mb_substr(substr($s, $i), 0, 1);
        $Width = strlen($r);
        $i += $Width;
        switch ($r) {
            case '"':
                $esc = '&#34;';
                break;
            case "'":
                $esc = '&#39;';
                break;
            case '&':
                $esc = '&amp;';
                break;
            case '<':
                $esc = '&lt;';
                break;
            case '>':
                $esc = '&gt;';
                break;
            case "\t":
                $esc = '&#x9;';
                break;
            case "\n":
                if (!$escapeNewline) {
                    continue 2;
                }
                $esc = '&#xA;';
                break;
            case "\r":
                $esc = '&#xD;';
                break;
            default:
                if (!isInCharacterRange(mb_ord($r)) || (mb_ord($r) === 0xFFFD && $Width === 1)) {
                    $esc = "\u{FFFD}";
                    break;
                }

                continue 2;
        }
        $w .= substr($s, $Last, $i - $Last - $Width) . $esc;
        $Last = $i;
    }
    $w .= substr($s, $Last);
    return $w;
}

Note you'll need at least PHP7.2 because of the mb_ord usage, or you'll have to swap it out for another polyfill, but these functions are working great for us!

For anyone curious, here is the relevant Go source https://golang.org/src/encoding/xml/xml.go?s=44219:44263#L1887

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜