开发者

PHP - how to detect encoding?

I'm using Amazon's API to obtain the description of books. The API returns XML responses and the description is marked up (with HTML) very poorly. To deal with this poorly marked up description, which oftentimes breaks the layout of my site, I'm trying to use HTML Tidy to "clean it up."

In order to prevent "weird" characters from being displayed on my web page, I think I need to tell Tidy what the input encoding is and what the desired output encoding is. I know I want the output to be UTF8. However, I'm not sure how to determine the encoding of the input (Amazon's book description).

I've tried something like this:

mb_detect_encoding($amazon_description);

It's helped, but I'm still occasionally getting weird characters (a black diamond with a question mark in it: �). My guess is that I'm not detecting the encoding properly.

Any suggestions what I need to do?

EDIT:

This is my current solution:

$sanitized_amazon_markup = preg_replace('/[^\w`~!@#$%^&*()-=_+[\]{}|;\':",.开发者_如何学JAVA\/<>? ]/', '', $sanitized_amazon_markup);

I'm not sure about this as this may delete stuff that I should be keeping.


Can you provide your tidy repairString call?

If you tried to use input-encoding and output-encoding from tidy options, try to not use these and use the third argument or repairString instead, something like this :

$oTidy = new tidy();
$page_content = $oTidy->repairString($page_content,
    array("show-errors" => 0, "show-warnings" => false),
    "utf8"
);

Edit :

After doing some tests, what I said before cannot work if you don't have utf8 encoding in $page_content already before calling repairString

But you will mostly end up with ISO-8859-1 (latin1) encoding if not UTF-8 already.

May I suggest you try :

$charset = mb_detect_encoding($amazon_description, 'UTF-8, ISO-8859-1');
if ($charset == "ISO-8859-1") {
    $amazon_description = utf8_encode($amazon_description);
}
$oTidy = new tidy();
$amazon_description = $oTidy->repairString($amazon_description,
    array("show-errors" => 0, "show-warnings" => false),
    "utf8"
);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜