How do you make strings "XML safe"?
I am responding to an AJAX call by sending it an XML document through PHP echos. In order to form this XML document, I loop through the records of a database. The problem is that the database includes records that have '<' symbols in th开发者_高级运维em. So naturally, the browser throws an error at that particular spot. How can this be fixed?
Since PHP 5.4 you can use:
htmlspecialchars($string, ENT_XML1);
You should specify the encoding, such as:
htmlspecialchars($string, ENT_XML1, 'UTF-8');
Update
Note that the above will only convert:
&
to&
<
to<
>
to>
If you want to escape text for use in an attribute enclosed in double quotes:
htmlspecialchars($string, ENT_XML1 | ENT_COMPAT, 'UTF-8');
will convert "
to "
in addition to &
, <
and >
.
And if your attributes are enclosed in single quotes:
htmlspecialchars($string, ENT_XML1 | ENT_QUOTES, 'UTF-8');
will convert '
to '
in addition to &
, <
, >
and "
.
(Of course you can use this even outside of attributes).
See the manual entry for htmlspecialchars.
By either escaping those characters with htmlspecialchars
, or, perhaps more appropriately, using a library for building XML documents, such as DOMDocument or XMLWriter.
Another alternative would be to use CDATA sections, but then you'd have to look out for occurrences of ]]>
.
Take also into consideration that that you must respect the encoding you define for the XML document (by default UTF-8).
1) You can wrap your text as CDATA like this:
<mytag>
<![CDATA[Your text goes here. Btw: 5<6 and 6>5]]>
</mytag>
see http://www.w3schools.com/xml/xml_cdata.asp
2) As already someone said: Escape those chars. E.g. like so:
5<6 and 6>5
Try this:
$str = htmlentities($str,ENT_QUOTES,'UTF-8');
So, after filtering your data using htmlentities()
function, you can use the data in XML tag like:
<mytag>$str</mytag>
If at all possible, its always a good idea to create your XML using the XML classes rather than string manipulation - one of the benefits being that the classes will automatically escape characters as needed.
Adding this in case it helps someone.
As I am working with Japanese characters, encoding has also been set appropriately. However, from time to time, I find that htmlentities
and htmlspecialchars
are not sufficient.
Some user inputs contain special characters that are not stripped by the above functions. In those cases I have to do this:
preg_replace('/[\x00-\x1f]/','',htmlspecialchars($string))
This will also remove certain xml-unsafe
control characters like Null character
or EOT
. You can use this table to determine which characters you wish to omit.
I prefer the way Golang does quote escaping for XML (and a few extras like newline escaping, and escaping some other characters), so I have ported its XML escape function to PHP below
function isInCharacterRange(int $r): bool {
return $r == 0x09 ||
$r == 0x0A ||
$r == 0x0D ||
$r >= 0x20 && $r <= 0xDF77 ||
$r >= 0xE000 && $r <= 0xFFFD ||
$r >= 0x10000 && $r <= 0x10FFFF;
}
function xml(string $s, bool $escapeNewline = true): string {
$w = '';
$Last = 0;
$l = strlen($s);
$i = 0;
while ($i < $l) {
$r = mb_substr(substr($s, $i), 0, 1);
$Width = strlen($r);
$i += $Width;
switch ($r) {
case '"':
$esc = '"';
break;
case "'":
$esc = ''';
break;
case '&':
$esc = '&';
break;
case '<':
$esc = '<';
break;
case '>':
$esc = '>';
break;
case "\t":
$esc = '	';
break;
case "\n":
if (!$escapeNewline) {
continue 2;
}
$esc = '
';
break;
case "\r":
$esc = '
';
break;
default:
if (!isInCharacterRange(mb_ord($r)) || (mb_ord($r) === 0xFFFD && $Width === 1)) {
$esc = "\u{FFFD}";
break;
}
continue 2;
}
$w .= substr($s, $Last, $i - $Last - $Width) . $esc;
$Last = $i;
}
$w .= substr($s, $Last);
return $w;
}
Note you'll need at least PHP7.2 because of the mb_ord
usage, or you'll have to swap it out for another polyfill, but these functions are working great for us!
For anyone curious, here is the relevant Go source https://golang.org/src/encoding/xml/xml.go?s=44219:44263#L1887
精彩评论