开发者

xml parse error: 'Invalid character'

I'm using the google weather api for a widget.

All is fine and dandy except that today I encountered a problem that I cannot solve. When called with this location:

http://www.google.com/ig/api?weather=dunjkovec,medimurska,croatia&hl=en

I get this error:

XML parse error 9 'Invalid character' at line 1, column 169 (byte index 199)

I suspect that the problem is here: Nedelišće

The code block is this one:

$parser = xml_parser_create('UTF-8');
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
$ok = xml_parse_into_struct($parser, $data, $values);
if (!$ok) {
    $errmsg = sprintf("XML parse error %d '%s' at line %d, column %d (byte index %d)",
    xml_get_error_code($parser),
    xml_error_string(xml_get_error_code($parser)),
    xml_get_current_line_number($parser),
    xml_get_current_column_number($parser),
    xml_get_current_byte_index($parser));
}

$data is the content of the xml and $values is empty.

Can someone help me? Thank you very much!

EDIT----------------------------------

After reading Hussein's post I discovered that the problem is in the way the file gets retrieved.

I tried file_get_contents and cURL. Both returns:

that is the line that creates problems. Or so I thought! I tried this html_entity_decode($data,ENT_NOQUOTES,'UTF-8') and it wasn't working, so I made a discover, I can't echo the contents of the xml, I can only print_r them and see the results in the html source! With any other location in the world it works, only this one creates problems... I wan开发者_如何转开发na cry :-(

EDIT 2--------------------------------

For anybody that cares. I fixed the problem with this lines of code after retrieving the xml file from the api:

$data = mb_convert_encoding($data, 'UTF-8', mb_detect_encoding($data, 'UTF-8, ISO-8859-1', true));
$data = html_entity_decode($data,ENT_NOQUOTES,'UTF-8'); 

then parse the xml, it works like a charm. I marked hussein's answer because it got me on the right track.


After reading at your problem, I tried same thing on my machine. What I did is 1. Downloaded xml file on my local machine from the URL you posted. 2. Used your xml parsing script to prepare structure from XML.

Amazingly it worked perfectly on my machine, even though XML has Nedelišće keyword. So, I see the problem in the way of reading XML file.

It would be easy to debug if you can tell me the way you are reading the xml form google api. Are you using CURL?

EDIT -----------------------------------------------

Hi 0plus1,

I have prepared one helper function to convert those special chars to html for making it able for parsing..

I am pasting entire code here. Use following script..

function utf8tohtml($utf8, $encodeTags)
{
    $result = '';
    for ($i = 0; $i < strlen($utf8); $i++)
    {
        $char = $utf8[$i];
        $ascii = ord($char);
        if ($ascii < 128)
        {
            // one-byte character
            $result .= ($encodeTags) ? htmlentities($char , ENT_QUOTES, 'UTF-8') : $char;
        } else if ($ascii < 192)
        {
            // non-utf8 character or not a start byte
        } else if ($ascii < 224)
        {
            // two-byte character
            $result .= htmlentities(substr($utf8, $i, 2), ENT_QUOTES, 'UTF-8');
            $i++;
        } else if ($ascii < 240)
        {
            // three-byte character
            $ascii1 = ord($utf8[$i+1]);
            $ascii2 = ord($utf8[$i+2]);
            $unicode = (15 & $ascii) * 4096 +
                (63 & $ascii1) * 64 +
                (63 & $ascii2);
            $result .= "&#$unicode;";
            $i += 2;
        } else if ($ascii < 248)
        {
            // four-byte character
            $ascii1 = ord($utf8[$i+1]);
            $ascii2 = ord($utf8[$i+2]);
            $ascii3 = ord($utf8[$i+3]);
            $unicode = (15 & $ascii) * 262144 +
                (63 & $ascii1) * 4096 +
                (63 & $ascii2) * 64 +
                (63 & $ascii3);
            $result .= "&#$unicode;";
            $i += 3;
        }
    }
    return $result;
}


$curlHandle = curl_init();
$serviceUrl = "http://www.google.com/ig/api?weather=dunjkovec,medimurska,croatia&hl=en";
// setup the basic options for the curl
curl_setopt($curlHandle , CURLOPT_URL, $serviceUrl);
curl_setopt($curlHandle , CURLOPT_HEADER , 0);
curl_setopt($curlHandle , CURLOPT_HTTPHEADER , array("Cache-Control: no-cache","Content-type: application/x-www-form-urlencoded;charset=UTF-8"));
curl_setopt($curlHandle , CURLOPT_FOLLOWLOCATION , true);
curl_setopt($curlHandle , CURLOPT_RETURNTRANSFER , true);
curl_setopt($curlHandle , CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)');
$data = curl_exec($curlHandle);
// echo $data;
$data = utf8tohtml($data , false);
echo $data;

$parser = xml_parser_create("UTF-8");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($parser, XML_OPTION_SKIP_WHITE, 1);
$ok = xml_parse_into_struct($parser, $data, $values);
if (!$ok) {
    $errmsg = sprintf("XML parse error %d '%s' at line %d, column %d (byte index %d)",
    xml_get_error_code($parser),
    xml_error_string(xml_get_error_code($parser)),
    xml_get_current_line_number($parser),
    xml_get_current_column_number($parser),
    xml_get_current_byte_index($parser));
}
echo "<pre>";
print_r($values);
echo "</pre>";

Hope this will help.

Thanks!

Hussain.


The Content-Type header field in the response specifies the content to be encoded with ISO 8859-1 (see response on Web-Sniffer.net) and not UTF-8. So either specify ISO-8859-1 as encoding or omit that parameter and xml_parser_create tries to identify the encoding.


Again, which php version are you using? xml_parser_create takes encoding as a parameter, but only for output, not input in some versions. http://www.php.net/manual/en/function.xml-parser-create.php

You might want to consider creating an empty utf-8 string and then filling it with the XML retrieved from Google, or explicitly converting the string to UTF-8.

string utf8_encode ( string $data )

Google is correctly informing us the data is UTF-8, but only in the header, not in the actual XML.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜