PHP - how to detect encoding?
I'm using Amazon's API to obtain the description of books. The API returns XML responses and the description is marked up (with HTML) very poorly. To deal with this poorly marked up description, which oftentimes breaks the layout of my site, I'm trying to use HTML Tidy to "clean it up."
In order to prevent "weird" characters from being displayed on my web page, I think I need to tell Tidy what the input encoding is and what the desired output encoding is. I know I want the output to be UTF8. However, I'm not sure how to determine the encoding of the input (Amazon's book description).
I've tried something like this:
mb_detect_encoding($amazon_description);
It's helped, but I'm still occasionally getting weird characters (a black diamond with a question mark in it: �). My guess is that I'm not detecting the encoding properly.
Any suggestions what I need to do?
EDIT:
This is my current solution:
$sanitized_amazon_markup = preg_replace('/[^\w`~!@#$%^&*()-=_+[\]{}|;\':",.开发者_如何学JAVA\/<>? ]/', '', $sanitized_amazon_markup);
I'm not sure about this as this may delete stuff that I should be keeping.
Can you provide your tidy repairString call?
If you tried to use input-encoding
and output-encoding
from tidy options, try to not use these and use the third argument or repairString
instead, something like this :
$oTidy = new tidy();
$page_content = $oTidy->repairString($page_content,
array("show-errors" => 0, "show-warnings" => false),
"utf8"
);
Edit :
After doing some tests, what I said before cannot work if you don't have utf8 encoding in $page_content
already before calling repairString
But you will mostly end up with ISO-8859-1 (latin1) encoding if not UTF-8 already.
May I suggest you try :
$charset = mb_detect_encoding($amazon_description, 'UTF-8, ISO-8859-1');
if ($charset == "ISO-8859-1") {
$amazon_description = utf8_encode($amazon_description);
}
$oTidy = new tidy();
$amazon_description = $oTidy->repairString($amazon_description,
array("show-errors" => 0, "show-warnings" => false),
"utf8"
);
精彩评论