How to skip invalid characters in XML file using PHP
I'm trying to parse an XML file using PHP, but I get an error message:
parser error : Char 0x0 out of allowed range in
I think it's because of the content of the XML, I think there is a speical symbol "☆", any ideas what I can do to fix it?
I also get:
parser error : Premature end of data in tag item line
What might be causing that error?
I'm using simplexml_load_file
.
Update:
I try to find the error line and paste its content as single xml file and it can work!! so I still cannot figure out what makes xml file parse fails. PS it's a huge xm开发者_如何学运维l file over 100M, will it makes parse error?
Do you have control over the XML? If so, ensure the data is enclosed in <![CDATA[
.. ]]>
blocks.
And you also need to clear the invalid characters:
/**
* Removes invalid XML
*
* @access public
* @param string $value
* @return string
*/
function stripInvalidXml($value)
{
$ret = "";
$current;
if (empty($value))
{
return $ret;
}
$length = strlen($value);
for ($i=0; $i < $length; $i++)
{
$current = ord($value[$i]);
if (($current == 0x9) ||
($current == 0xA) ||
($current == 0xD) ||
(($current >= 0x20) && ($current <= 0xD7FF)) ||
(($current >= 0xE000) && ($current <= 0xFFFD)) ||
(($current >= 0x10000) && ($current <= 0x10FFFF)))
{
$ret .= chr($current);
}
else
{
$ret .= " ";
}
}
return $ret;
}
I decided to test all UTF-8 values (0-1114111) to make sure things work as they should. Using preg_replace() causes a NULL to be returned due to errors when testing all utf-8 values. This is the solution I've come up.
$utf_8_range = range(0, 1114111);
$output = ords_to_utfstring($utf_8_range);
$sanitized = sanitize_for_xml($output);
/**
* Removes invalid XML
*
* @access public
* @param string $value
* @return string
*/
function sanitize_for_xml($input) {
// Convert input to UTF-8.
$old_setting = ini_set('mbstring.substitute_character', '"none"');
$input = mb_convert_encoding($input, 'UTF-8', 'auto');
ini_set('mbstring.substitute_character', $old_setting);
// Use fast preg_replace. If failure, use slower chr => int => chr conversion.
$output = preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', '', $input);
if (is_null($output)) {
// Convert to ints.
// Convert ints back into a string.
$output = ords_to_utfstring(utfstring_to_ords($input), TRUE);
}
return $output;
}
/**
* Given a UTF-8 string, output an array of ordinal values.
*
* @param string $input
* UTF-8 string.
* @param string $encoding
* Defaults to UTF-8.
*
* @return array
* Array of ordinal values representing the input string.
*/
function utfstring_to_ords($input, $encoding = 'UTF-8'){
// Turn a string of unicode characters into UCS-4BE, which is a Unicode
// encoding that stores each character as a 4 byte integer. This accounts for
// the "UCS-4"; the "BE" prefix indicates that the integers are stored in
// big-endian order. The reason for this encoding is that each character is a
// fixed size, making iterating over the string simpler.
$input = mb_convert_encoding($input, "UCS-4BE", $encoding);
// Visit each unicode character.
$ords = array();
for ($i = 0; $i < mb_strlen($input, "UCS-4BE"); $i++) {
// Now we have 4 bytes. Find their total numeric value.
$s2 = mb_substr($input, $i, 1, "UCS-4BE");
$val = unpack("N", $s2);
$ords[] = $val[1];
}
return $ords;
}
/**
* Given an array of ints representing Unicode chars, outputs a UTF-8 string.
*
* @param array $ords
* Array of integers representing Unicode characters.
* @param bool $scrub_XML
* Set to TRUE to remove non valid XML characters.
*
* @return string
* UTF-8 String.
*/
function ords_to_utfstring($ords, $scrub_XML = FALSE) {
$output = '';
foreach ($ords as $ord) {
// 0: Negative numbers.
// 55296 - 57343: Surrogate Range.
// 65279: BOM (byte order mark).
// 1114111: Out of range.
if ( $ord < 0
|| ($ord >= 0xD800 && $ord <= 0xDFFF)
|| $ord == 0xFEFF
|| $ord > 0x10ffff) {
// Skip non valid UTF-8 values.
continue;
}
// 9: Anything Below 9.
// 11: Vertical Tab.
// 12: Form Feed.
// 14-31: Unprintable control codes.
// 65534, 65535: Unicode noncharacters.
elseif ($scrub_XML && (
$ord < 0x9
|| $ord == 0xB
|| $ord == 0xC
|| ($ord > 0xD && $ord < 0x20)
|| $ord == 0xFFFE
|| $ord == 0xFFFF
)) {
// Skip non valid XML values.
continue;
}
// 127: 1 Byte char.
elseif ( $ord <= 0x007f) {
$output .= chr($ord);
continue;
}
// 2047: 2 Byte char.
elseif ($ord <= 0x07ff) {
$output .= chr(0xc0 | ($ord >> 6));
$output .= chr(0x80 | ($ord & 0x003f));
continue;
}
// 65535: 3 Byte char.
elseif ($ord <= 0xffff) {
$output .= chr(0xe0 | ($ord >> 12));
$output .= chr(0x80 | (($ord >> 6) & 0x003f));
$output .= chr(0x80 | ($ord & 0x003f));
continue;
}
// 1114111: 4 Byte char.
elseif ($ord <= 0x10ffff) {
$output .= chr(0xf0 | ($ord >> 18));
$output .= chr(0x80 | (($ord >> 12) & 0x3f));
$output .= chr(0x80 | (($ord >> 6) & 0x3f));
$output .= chr(0x80 | ($ord & 0x3f));
continue;
}
}
return $output;
}
And to do this on a simple object or array
// Recursive sanitize_for_xml.
function recursive_sanitize_for_xml(&$input){
if (is_null($input) || is_bool($input) || is_numeric($input)) {
return;
}
if (!is_array($input) && !is_object($input)) {
$input = sanitize_for_xml($input);
}
else {
foreach ($input as &$value) {
recursive_sanitize_for_xml($value);
}
}
}
Certain Unicode characters must not appear in XML 1.0:
- C0 control codes (U+0000 - U+001F) expect tab, CR and LF.
- UTF-16 surrogates (U+D800 - U+DFFF). These are invalid in UTF-8 as well and indicate more serious problems when encountered.
- U+FFFE and U+FFFF.
But in practice, you often have to handle XML which was carelessly produced from other sources containing such characters. If you want to handle this special case of invalid XML in an UTF-8 encoded string, I'd suggest:
$str = preg_replace(
'/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
"\xEF\xBF\xBD",
$str
);
This doesn't use the u
Unicode regex modifier but works directly on UTF-8 encoded bytes for extra performance. The parts of the pattern are:
- Invalid control chars:
[\x00-\x08\x0B\x0C\x0E-\x1F]
- UTF-16 surrogates:
\xED[\xA0-\xBF].
- Non-characters U+FFFE and U+FFFF:
\xEF\xBF[\xBE\xBF]
Invalid characters are replaced with the replacement character U+FFFD (�) instead of simply stripping them. This makes it easier to diagnose invalid chars and can even prevent security issues.
If you have control over the data, ensure that it is encoded correctly (i.e. is in the encoding that you promised in the xml tag, e.g. if you have:
<?xml version="1.0" encoding="UTF-8"?>
then you'll need to ensure your data is in UTF-8.
If you don't have control over the data, yell at those who do.
You can use a tool like xmllint to check which part(s) of the data are not valid.
My problem was "&" character (HEX 0x24), i changed to:
function stripInvalidXml($value)
{
$ret = "";
$current;
if (empty($value))
{
return $ret;
}
$length = strlen($value);
for ($i=0; $i < $length; $i++)
{
$current = ord($value{$i});
if (($current == 0x9) ||
($current == 0xA) ||
($current == 0xD) ||
(($current >= 0x28) && ($current <= 0xD7FF)) ||
(($current >= 0xE000) && ($current <= 0xFFFD)) ||
(($current >= 0x10000) && ($current <= 0x10FFFF)))
{
$ret .= chr($current);
}
else
{
$ret .= " ";
}
}
return $ret;
}
Make sure your XML source is valid. See http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
For a non-destructive method of loading this type of input into a SimpleXMLElement, see my answer on How to handle invalid unicode with simplexml
I used this to clean the string:
public static function Clean($inputName)
{
$strName=trim($inputName);
if($strName!="")
{
$strName = iconv("UTF-8", "UTF-8//IGNORE", $strName); // drop all non utf-8 characters
$strName=str_replace(array('\\','/',':','*','?','"','<','>','|'),'@',$strName);
$string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string);
// [\x00-\x1F] control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
// Invalid control chars: [\x00-\x08\x0B\x0C\x0E-\x1F]
// UTF-16 surrogates: \xED[\xA0-\xBF].
// Non-characters U+FFFE and U+FFFF: \xEF\xBF[\xBE\xBF]
// Invalid characters are replaced with the replacement character U+FFFD
$strName = preg_replace(
'/[\x00-\x08\x0B\x0C\x0E-\x1F]|\xED[\xA0-\xBF].|\xEF\xBF[\xBE\xBF]/',
"\xEF\xBF\xBD",
$strName);
// Reduce all multiple whitespace to a single space
// $strName = preg_replace('/\s+/', ' ', $strName);
if(trim($strName)=="")
{
$strName="@" . "empty-name";
}
}
else
{
$strName=" ";
}
return $strName;
}
Not a php solution but, it works:
Download Notepad++ https://notepad-plus-plus.org/
Open your .xml file in Notepad++
From Main Menu: Search -> Search Mode set this to: Extended
Then,
Replace -> Find what \x00; Replace with {leave empty}
Then, Replace_All
Rob
精彩评论